Unit 6 - Notes
Unit 6: Simple Correlation and Regression Analysis
1. Meaning and Types of Correlation
1.1 Meaning of Correlation
Correlation is a statistical technique used to measure and describe the strength and direction of the linear relationship between two or more variables. It indicates how closely two variables move together. When two variables are correlated, a change in one variable is associated with a change in the other.
- Key Idea: Correlation quantifies the degree of association, but it does not imply causation. Just because two variables are highly correlated does not mean one causes the other. There could be a third, unobserved variable (a lurking variable) influencing both.
1.2 Types of Correlation
Correlation can be classified based on direction, linearity, and the number of variables.
1.2.1 Based on Direction
-
Positive Correlation: The two variables move in the same direction. When one variable increases, the other variable also tends to increase. When one decreases, the other also tends to decrease.
- Examples:
- Height and weight of individuals.
- Advertising expenditure and sales revenue.
- Hours of study and exam scores.
- The correlation coefficient is between 0 and +1.
- Examples:
-
Negative Correlation: The two variables move in opposite directions. When one variable increases, the other variable tends to decrease, and vice-versa.
- Examples:
- Price of a product and its demand.
- Speed of a car and time taken to travel a fixed distance.
- Altitude and atmospheric pressure.
- The correlation coefficient is between -1 and 0.
- Examples:
-
Zero Correlation (No Correlation): There is no linear relationship between the two variables. A change in one variable is not associated with any predictable change in the other.
- Examples:
- A person's IQ and their shoe size.
- The price of tea in China and the number of students in a UK university.
- The correlation coefficient is 0.
- Examples:
1.2.2 Based on Linearity
-
Linear Correlation: The ratio of change between the two variables is constant. When plotted on a scatter diagram, the points tend to cluster around a straight line.
-
Non-linear (or Curvilinear) Correlation: The ratio of change between the two variables is not constant. The relationship can be described by a curve, not a straight line.
- Example: The relationship between fertilizer applied and crop yield might be positive up to a point, after which more fertilizer leads to a decrease in yield.
1.2.3 Based on the Number of Variables
-
Simple Correlation: Studies the relationship between only two variables (e.g., sales and advertising).
-
Multiple Correlation: Studies the relationship between three or more variables simultaneously. It measures the relationship of one dependent variable with multiple independent variables (e.g., crop yield as a function of rainfall, fertilizer, and temperature).
-
Partial Correlation: Studies the relationship between two variables while keeping the effect of one or more other variables constant (e.g., studying the correlation between sales and advertising while keeping the effect of price constant).
2. Pearson’s Coefficient of Correlation (r)
Karl Pearson's coefficient of correlation, denoted by 'r', is the most widely used method for measuring the degree of linear relationship between two quantitative variables. It is also known as the "product-moment correlation coefficient".
2.1 Properties of Pearson's 'r'
- Range: The value of 'r' always lies between -1 and +1, inclusive.
r = +1: Perfect positive linear correlation.r = -1: Perfect negative linear correlation.r = 0: No linear correlation.
- Interpretation of Strength:
|r| > 0.75: Strong correlation.0.5 < |r| ≤ 0.75: Moderate correlation.0.25 < |r| ≤ 0.5: Weak correlation.|r| ≤ 0.25: Very weak or no correlation.
(Note: These are general guidelines and can vary by field of study.)
- Symmetry: The correlation between X and Y is the same as the correlation between Y and X (
r_xy = r_yx). - Unit-Free: It is a pure number and is independent of the units of measurement of the variables.
- Independence of Origin and Scale: 'r' is not affected by adding/subtracting a constant to all values of a variable (change of origin) or multiplying/dividing by a constant (change of scale).
2.2 Assumptions
- Linear Relationship: The variables should have a linear relationship.
- Normality: The variables should be approximately normally distributed.
- Quantitative Data: The data must be quantitative (interval or ratio scale).
- No Outliers: Significant outliers can heavily distort the value of 'r'.
2.3 Formula for Calculation
Let X and Y be two variables with n observations.
1. Formula using Covariance and Standard Deviation:
r = Cov(X, Y) / (σx * σy)
Where:
Cov(X, Y)is the covariance of X and Y.σxis the standard deviation of X.σyis the standard deviation of Y.
2. Formula for Raw Data (Direct Method): This is the most common formula for practical calculation.
r = [ n(ΣXY) - (ΣX)(ΣY) ] / √[ {n(ΣX²) - (ΣX)²} * {n(ΣY²) - (ΣY)²} ]
Where:
n= Number of pairs of observations.ΣXY= Sum of the product of each pair of X and Y values.ΣX= Sum of X values.ΣY= Sum of Y values.ΣX²= Sum of the squares of X values.ΣY²= Sum of the squares of Y values.
2.4 Example Calculation
Calculate the correlation coefficient between advertising expenditure (X, in '0000).
| Advertising (X) | Sales (Y) | X² | Y² | XY |
|---|---|---|---|---|
| 2 | 30 | 4 | 900 | 60 |
| 3 | 40 | 9 | 1600 | 120 |
| 5 | 50 | 25 | 2500 | 250 |
| 6 | 60 | 36 | 3600 | 360 |
| 4 | 45 | 16 | 2025 | 180 |
| ΣX = 20 | ΣY = 225 | ΣX²=90 | ΣY²=10625 | ΣXY=970 |
Here, n = 5.
Using the formula:
r = [ 5(970) - (20)(225) ] / √[ {5(90) - (20)²} * {5(10625) - (225)²} ]
r = [ 4850 - 4500 ] / √[ {450 - 400} * {53125 - 50625} ]
r = 350 / √[ {50} * {2500} ]
r = 350 / √[ 125000 ]
r = 350 / 353.55
r ≈ +0.9899
Interpretation: There is a very strong positive linear correlation between advertising expenditure and sales.
3. Rank Correlation (Spearman's Coefficient)
Spearman's Rank Correlation Coefficient, denoted by 'ρ' (rho) or 'rs', is a non-parametric measure of correlation. It is used when:
- The data is qualitative (ordinal), such as ranks, beauty, honesty, or skill.
- The assumptions of Pearson's 'r' (like normality) are not met.
- The relationship between variables is non-linear but monotonic (consistently increasing or decreasing).
3.1 Formula for Calculation
The calculation involves ranking the data for each variable from highest to lowest (or vice-versa) and then using the differences in ranks.
Case 1: No Tied Ranks
ρ = 1 - [ 6 * Σd² ] / [ n(n² - 1) ]
Where:
d= Difference between the ranks of paired items (R1 - R2).Σd²= Sum of the squares of the differences.n= Number of pairs of observations.
Case 2: With Tied Ranks
When two or more items have the same value, they are assigned the average rank. A correction factor (CF) is added to Σd². The formula becomes:
ρ = 1 - [ 6 * (Σd² + CF) ] / [ n(n² - 1) ]
Where the Correction Factor
CF is calculated for each tie:CF = Σ [ m(m² - 1) / 12 ]
mis the number of items in each tie.
Note: In practice, many simply use the first formula even with ties, as the correction factor's impact is often small unless there are many large ties.
3.2 Example Calculation (With Tied Ranks)
Calculate the rank correlation between scores given by two judges to 7 contestants.
| Contestant | Judge 1 (X) | Judge 2 (Y) | Rank X (R1) | Rank Y (R2) | d = R1 - R2 | d² |
|---|---|---|---|---|---|---|
| A | 15 | 18 | 4 | 3 | 1 | 1 |
| B | 12 | 12 | 6 | 6.5 | -0.5 | 0.25 |
| C | 18 | 20 | 2 | 2 | 0 | 0 |
| D | 10 | 15 | 7 | 5 | 2 | 4 |
| E | 20 | 25 | 1 | 1 | 0 | 0 |
| F | 16 | 12 | 3 | 6.5 | -3.5 | 12.25 |
| G | 14 | 17 | 5 | 4 | 1 | 1 |
| Σd = 0 | Σd²=18.5 |
Ranking Explanation:
- Rank X (R1):
- 20 is 1st, 18 is 2nd, 16 is 3rd, etc. No ties.
- Rank Y (R2):
- 25 is 1st, 20 is 2nd.
- The value 12 appears twice, occupying the 6th and 7th positions. The average rank is
(6 + 7) / 2 = 6.5. This rank is assigned to both.
Here, n = 7, Σd² = 18.5.
Using the simple formula (often sufficient):
ρ = 1 - [ 6 * 18.5 ] / [ 7(7² - 1) ]
ρ = 1 - [ 111 ] / [ 7(48) ]
ρ = 1 - [ 111 / 336 ]
ρ = 1 - 0.330
ρ ≈ +0.67
Interpretation: There is a moderate to strong positive rank correlation between the scores of the two judges.
4. Regression Analysis
While correlation measures the strength and direction of a relationship, regression analysis aims to predict or estimate the value of one variable (dependent) based on the known value of another variable (independent).
- Dependent Variable (Y): The variable we want to predict (also called response or predicted variable).
- Independent Variable (X): The variable used to make the prediction (also called predictor or explanatory variable).
4.1 Regression Lines
There are two regression lines:
-
Regression Line of Y on X: Used to predict Y for a given value of X.
- Equation:
Yc = a + b_yx * X Ycis the predicted value of Y.ais the Y-intercept (the value of Y when X=0).b_yxis the regression coefficient of Y on X (the slope), representing the change in Y for a one-unit change in X.
- Equation:
-
Regression Line of X on Y: Used to predict X for a given value of Y.
- Equation:
Xc = a' + b_xy * Y Xcis the predicted value of X.a'is the X-intercept.b_xyis the regression coefficient of X on Y, representing the change in X for a one-unit change in Y.
- Equation:
4.2 Calculation of Regression Coefficients
The coefficients a and b are calculated using the method of least squares, which minimizes the sum of the squared differences between the observed values and the predicted values.
1. Regression Coefficient of Y on X (b_yx)
b_yx = [ n(ΣXY) - (ΣX)(ΣY) ] / [ n(ΣX²) - (ΣX)² ]
Alternatively:
b_yx = Cov(X, Y) / Var(X) = r * (σy / σx)
2. Regression Coefficient of X on Y (b_xy)
b_xy = [ n(ΣXY) - (ΣX)(ΣY) ] / [ n(ΣY²) - (ΣY)² ]
Alternatively:
b_xy = Cov(X, Y) / Var(Y) = r * (σx / σy)
3. Intercepts a and a'
The regression line always passes through the mean of the data (X_bar, Y_bar).
- For Y on X:
a = Y_bar - b_yx * X_bar - For X on Y:
a' = X_bar - b_xy * Y_bar
4.3 Example Calculation
Using the same data from the correlation example:
n=5, ΣX=20, ΣY=225, ΣX²=90, ΣY²=10625, ΣXY=970.
X_bar = 20/5 = 4
Y_bar = 225/5 = 45
1. Find the regression line of Y on X (Sales on Advertising)
-
Calculate
b_yx:
b_yx = [ 5(970) - (20)(225) ] / [ 5(90) - (20)² ]
b_yx = [ 4850 - 4500 ] / [ 450 - 400 ] = 350 / 50 = 7
Interpretation: For every 70,000. -
Calculate
a:
a = Y_bar - b_yx * X_bar = 45 - 7 * (4) = 45 - 28 = 17 -
The regression equation is:
Yc = 17 + 7X -
Prediction: Predict sales if advertising expenditure is $8,000 (i.e., X=8).
Yc = 17 + 7(8) = 17 + 56 = 73
Predicted sales are $730,000.
5. Properties of Regression Coefficients
-
Sign: Both regression coefficients (
b_yxandb_xy) will have the same sign. If one is positive, the other must be positive, and vice-versa. The correlation coefficient (r) will also share this sign. -
Range: The value of both coefficients can be greater than 1. However, if one coefficient is greater than 1, the other must be less than 1 (unless
r = ±1). -
Geometric Mean: The correlation coefficient (
r) is the geometric mean of the two regression coefficients.
TEXTr = ±√(b_yx * b_xy)
The sign ofris the same as the sign of the coefficients. -
Independence of Origin: Regression coefficients are independent of the change of origin. Adding or subtracting a constant from the series does not change their value.
-
Dependence on Scale: Regression coefficients are not independent of the change of scale. If X and Y are multiplied by constants
canddrespectively, the new coefficients will be affected. -
Point of Intersection: The two regression lines (Y on X and X on Y) always intersect at the point of their means (
X_bar,Y_bar).
6. Relationships between Correlation and Regression Coefficients
Correlation and Regression are deeply connected concepts. The regression coefficients are derived from and can be expressed in terms of the correlation coefficient.
-
Core Relationship Formulas:
- The slope of the regression line of Y on X is
b_yx = r * (σy / σx). - The slope of the regression line of X on Y is
b_xy = r * (σx / σy). - These formulas show that the regression slope (
b) is the correlation coefficient (r) adjusted by the ratio of the standard deviations.
- The slope of the regression line of Y on X is
-
Relationship via Geometric Mean:
- Multiplying the two equations above:
b_yx * b_xy = [r * (σy / σx)] * [r * (σx / σy)]
b_yx * b_xy = r²
r = ±√(b_yx * b_xy) - This confirms that
ris the geometric mean of the regression coefficients.
- Multiplying the two equations above:
-
Angle between Regression Lines:
- If
r = 0, thenb_yx = 0andb_xy = 0. The regression lines areY = Y_barandX = X_bar. They are perpendicular to each other and parallel to the axes. This indicates no linear relationship. - If
r = ±1, thenb_yx * b_xy = 1. The two regression lines coincide and become a single line. This indicates a perfect linear relationship, allowing for perfect prediction. - The closer
ris to ±1, the smaller the angle between the two regression lines. The closerris to 0, the larger the angle between the lines.
- If