Unit 6 - Notes

QTT201 10 min read

Unit 6: Simple Correlation and Regression Analysis

1. Meaning and Types of Correlation

1.1 Meaning of Correlation

Correlation is a statistical technique used to measure and describe the strength and direction of the linear relationship between two or more variables. It indicates how closely two variables move together. When two variables are correlated, a change in one variable is associated with a change in the other.

  • Key Idea: Correlation quantifies the degree of association, but it does not imply causation. Just because two variables are highly correlated does not mean one causes the other. There could be a third, unobserved variable (a lurking variable) influencing both.

1.2 Types of Correlation

Correlation can be classified based on direction, linearity, and the number of variables.

1.2.1 Based on Direction

  • Positive Correlation: The two variables move in the same direction. When one variable increases, the other variable also tends to increase. When one decreases, the other also tends to decrease.

    • Examples:
      • Height and weight of individuals.
      • Advertising expenditure and sales revenue.
      • Hours of study and exam scores.
    • The correlation coefficient is between 0 and +1.
  • Negative Correlation: The two variables move in opposite directions. When one variable increases, the other variable tends to decrease, and vice-versa.

    • Examples:
      • Price of a product and its demand.
      • Speed of a car and time taken to travel a fixed distance.
      • Altitude and atmospheric pressure.
    • The correlation coefficient is between -1 and 0.
  • Zero Correlation (No Correlation): There is no linear relationship between the two variables. A change in one variable is not associated with any predictable change in the other.

    • Examples:
      • A person's IQ and their shoe size.
      • The price of tea in China and the number of students in a UK university.
    • The correlation coefficient is 0.

1.2.2 Based on Linearity

  • Linear Correlation: The ratio of change between the two variables is constant. When plotted on a scatter diagram, the points tend to cluster around a straight line.

  • Non-linear (or Curvilinear) Correlation: The ratio of change between the two variables is not constant. The relationship can be described by a curve, not a straight line.

    • Example: The relationship between fertilizer applied and crop yield might be positive up to a point, after which more fertilizer leads to a decrease in yield.

1.2.3 Based on the Number of Variables

  • Simple Correlation: Studies the relationship between only two variables (e.g., sales and advertising).

  • Multiple Correlation: Studies the relationship between three or more variables simultaneously. It measures the relationship of one dependent variable with multiple independent variables (e.g., crop yield as a function of rainfall, fertilizer, and temperature).

  • Partial Correlation: Studies the relationship between two variables while keeping the effect of one or more other variables constant (e.g., studying the correlation between sales and advertising while keeping the effect of price constant).


2. Pearson’s Coefficient of Correlation (r)

Karl Pearson's coefficient of correlation, denoted by 'r', is the most widely used method for measuring the degree of linear relationship between two quantitative variables. It is also known as the "product-moment correlation coefficient".

2.1 Properties of Pearson's 'r'

  1. Range: The value of 'r' always lies between -1 and +1, inclusive.
    • r = +1: Perfect positive linear correlation.
    • r = -1: Perfect negative linear correlation.
    • r = 0: No linear correlation.
  2. Interpretation of Strength:
    • |r| > 0.75: Strong correlation.
    • 0.5 < |r| ≤ 0.75: Moderate correlation.
    • 0.25 < |r| ≤ 0.5: Weak correlation.
    • |r| ≤ 0.25: Very weak or no correlation.
      (Note: These are general guidelines and can vary by field of study.)
  3. Symmetry: The correlation between X and Y is the same as the correlation between Y and X (r_xy = r_yx).
  4. Unit-Free: It is a pure number and is independent of the units of measurement of the variables.
  5. Independence of Origin and Scale: 'r' is not affected by adding/subtracting a constant to all values of a variable (change of origin) or multiplying/dividing by a constant (change of scale).

2.2 Assumptions

  1. Linear Relationship: The variables should have a linear relationship.
  2. Normality: The variables should be approximately normally distributed.
  3. Quantitative Data: The data must be quantitative (interval or ratio scale).
  4. No Outliers: Significant outliers can heavily distort the value of 'r'.

2.3 Formula for Calculation

Let X and Y be two variables with n observations.

1. Formula using Covariance and Standard Deviation:

TEXT
r = Cov(X, Y) / (σx * σy)

Where:

  • Cov(X, Y) is the covariance of X and Y.
  • σx is the standard deviation of X.
  • σy is the standard deviation of Y.

2. Formula for Raw Data (Direct Method): This is the most common formula for practical calculation.

TEXT
r = [ n(ΣXY) - (ΣX)(ΣY) ] / √[ {n(ΣX²) - (ΣX)²} * {n(ΣY²) - (ΣY)²} ]

Where:

  • n = Number of pairs of observations.
  • ΣXY = Sum of the product of each pair of X and Y values.
  • ΣX = Sum of X values.
  • ΣY = Sum of Y values.
  • ΣX² = Sum of the squares of X values.
  • ΣY² = Sum of the squares of Y values.

2.4 Example Calculation

Calculate the correlation coefficient between advertising expenditure (X, in '0000).

Advertising (X) Sales (Y) XY
2 30 4 900 60
3 40 9 1600 120
5 50 25 2500 250
6 60 36 3600 360
4 45 16 2025 180
ΣX = 20 ΣY = 225 ΣX²=90 ΣY²=10625 ΣXY=970

Here, n = 5.

Using the formula:
r = [ 5(970) - (20)(225) ] / √[ {5(90) - (20)²} * {5(10625) - (225)²} ]
r = [ 4850 - 4500 ] / √[ {450 - 400} * {53125 - 50625} ]
r = 350 / √[ {50} * {2500} ]
r = 350 / √[ 125000 ]
r = 350 / 353.55
r ≈ +0.9899

Interpretation: There is a very strong positive linear correlation between advertising expenditure and sales.


3. Rank Correlation (Spearman's Coefficient)

Spearman's Rank Correlation Coefficient, denoted by 'ρ' (rho) or 'rs', is a non-parametric measure of correlation. It is used when:

  1. The data is qualitative (ordinal), such as ranks, beauty, honesty, or skill.
  2. The assumptions of Pearson's 'r' (like normality) are not met.
  3. The relationship between variables is non-linear but monotonic (consistently increasing or decreasing).

3.1 Formula for Calculation

The calculation involves ranking the data for each variable from highest to lowest (or vice-versa) and then using the differences in ranks.

Case 1: No Tied Ranks

TEXT
ρ = 1 - [ 6 * Σd² ] / [ n(n² - 1) ]

Where:

  • d = Difference between the ranks of paired items (R1 - R2).
  • Σd² = Sum of the squares of the differences.
  • n = Number of pairs of observations.

Case 2: With Tied Ranks
When two or more items have the same value, they are assigned the average rank. A correction factor (CF) is added to Σd². The formula becomes:

TEXT
ρ = 1 - [ 6 * (Σd² + CF) ] / [ n(n² - 1) ]

Where the Correction Factor CF is calculated for each tie:
CF = Σ [ m(m² - 1) / 12 ]

  • m is the number of items in each tie.

Note: In practice, many simply use the first formula even with ties, as the correction factor's impact is often small unless there are many large ties.

3.2 Example Calculation (With Tied Ranks)

Calculate the rank correlation between scores given by two judges to 7 contestants.

Contestant Judge 1 (X) Judge 2 (Y) Rank X (R1) Rank Y (R2) d = R1 - R2
A 15 18 4 3 1 1
B 12 12 6 6.5 -0.5 0.25
C 18 20 2 2 0 0
D 10 15 7 5 2 4
E 20 25 1 1 0 0
F 16 12 3 6.5 -3.5 12.25
G 14 17 5 4 1 1
Σd = 0 Σd²=18.5

Ranking Explanation:

  • Rank X (R1):
    • 20 is 1st, 18 is 2nd, 16 is 3rd, etc. No ties.
  • Rank Y (R2):
    • 25 is 1st, 20 is 2nd.
    • The value 12 appears twice, occupying the 6th and 7th positions. The average rank is (6 + 7) / 2 = 6.5. This rank is assigned to both.

Here, n = 7, Σd² = 18.5.

Using the simple formula (often sufficient):
ρ = 1 - [ 6 * 18.5 ] / [ 7(7² - 1) ]
ρ = 1 - [ 111 ] / [ 7(48) ]
ρ = 1 - [ 111 / 336 ]
ρ = 1 - 0.330
ρ ≈ +0.67

Interpretation: There is a moderate to strong positive rank correlation between the scores of the two judges.


4. Regression Analysis

While correlation measures the strength and direction of a relationship, regression analysis aims to predict or estimate the value of one variable (dependent) based on the known value of another variable (independent).

  • Dependent Variable (Y): The variable we want to predict (also called response or predicted variable).
  • Independent Variable (X): The variable used to make the prediction (also called predictor or explanatory variable).

4.1 Regression Lines

There are two regression lines:

  1. Regression Line of Y on X: Used to predict Y for a given value of X.

    • Equation: Yc = a + b_yx * X
    • Yc is the predicted value of Y.
    • a is the Y-intercept (the value of Y when X=0).
    • b_yx is the regression coefficient of Y on X (the slope), representing the change in Y for a one-unit change in X.
  2. Regression Line of X on Y: Used to predict X for a given value of Y.

    • Equation: Xc = a' + b_xy * Y
    • Xc is the predicted value of X.
    • a' is the X-intercept.
    • b_xy is the regression coefficient of X on Y, representing the change in X for a one-unit change in Y.

4.2 Calculation of Regression Coefficients

The coefficients a and b are calculated using the method of least squares, which minimizes the sum of the squared differences between the observed values and the predicted values.

1. Regression Coefficient of Y on X (b_yx)

TEXT
b_yx = [ n(ΣXY) - (ΣX)(ΣY) ] / [ n(ΣX²) - (ΣX)² ]

Alternatively:
TEXT
b_yx = Cov(X, Y) / Var(X) = r * (σy / σx)

2. Regression Coefficient of X on Y (b_xy)

TEXT
b_xy = [ n(ΣXY) - (ΣX)(ΣY) ] / [ n(ΣY²) - (ΣY)² ]

Alternatively:
TEXT
b_xy = Cov(X, Y) / Var(Y) = r * (σx / σy)

3. Intercepts a and a'
The regression line always passes through the mean of the data (X_bar, Y_bar).

  • For Y on X: a = Y_bar - b_yx * X_bar
  • For X on Y: a' = X_bar - b_xy * Y_bar

4.3 Example Calculation

Using the same data from the correlation example:
n=5, ΣX=20, ΣY=225, ΣX²=90, ΣY²=10625, ΣXY=970.
X_bar = 20/5 = 4
Y_bar = 225/5 = 45

1. Find the regression line of Y on X (Sales on Advertising)

  • Calculate b_yx:
    b_yx = [ 5(970) - (20)(225) ] / [ 5(90) - (20)² ]
    b_yx = [ 4850 - 4500 ] / [ 450 - 400 ] = 350 / 50 = 7
    Interpretation: For every 70,000.

  • Calculate a:
    a = Y_bar - b_yx * X_bar = 45 - 7 * (4) = 45 - 28 = 17

  • The regression equation is: Yc = 17 + 7X

  • Prediction: Predict sales if advertising expenditure is $8,000 (i.e., X=8).
    Yc = 17 + 7(8) = 17 + 56 = 73
    Predicted sales are $730,000.


5. Properties of Regression Coefficients

  1. Sign: Both regression coefficients (b_yx and b_xy) will have the same sign. If one is positive, the other must be positive, and vice-versa. The correlation coefficient (r) will also share this sign.

  2. Range: The value of both coefficients can be greater than 1. However, if one coefficient is greater than 1, the other must be less than 1 (unless r = ±1).

  3. Geometric Mean: The correlation coefficient (r) is the geometric mean of the two regression coefficients.

    TEXT
        r = ±√(b_yx * b_xy)
        

    The sign of r is the same as the sign of the coefficients.

  4. Independence of Origin: Regression coefficients are independent of the change of origin. Adding or subtracting a constant from the series does not change their value.

  5. Dependence on Scale: Regression coefficients are not independent of the change of scale. If X and Y are multiplied by constants c and d respectively, the new coefficients will be affected.

  6. Point of Intersection: The two regression lines (Y on X and X on Y) always intersect at the point of their means (X_bar, Y_bar).


6. Relationships between Correlation and Regression Coefficients

Correlation and Regression are deeply connected concepts. The regression coefficients are derived from and can be expressed in terms of the correlation coefficient.

  1. Core Relationship Formulas:

    • The slope of the regression line of Y on X is b_yx = r * (σy / σx).
    • The slope of the regression line of X on Y is b_xy = r * (σx / σy).
    • These formulas show that the regression slope (b) is the correlation coefficient (r) adjusted by the ratio of the standard deviations.
  2. Relationship via Geometric Mean:

    • Multiplying the two equations above:
      b_yx * b_xy = [r * (σy / σx)] * [r * (σx / σy)]
      b_yx * b_xy = r²
      r = ±√(b_yx * b_xy)
    • This confirms that r is the geometric mean of the regression coefficients.
  3. Angle between Regression Lines:

    • If r = 0, then b_yx = 0 and b_xy = 0. The regression lines are Y = Y_bar and X = X_bar. They are perpendicular to each other and parallel to the axes. This indicates no linear relationship.
    • If r = ±1, then b_yx * b_xy = 1. The two regression lines coincide and become a single line. This indicates a perfect linear relationship, allowing for perfect prediction.
    • The closer r is to ±1, the smaller the angle between the two regression lines. The closer r is to 0, the larger the angle between the lines.