Unit 2 - Notes
Unit 2: Correlation and Linear regression
1. Scatter Plots
A scatter plot is a graphical representation used to display the relationship between two quantitative (numerical) variables. Each pair of values (x, y) is plotted as a single point on a two-dimensional Cartesian plane.
1.1 Purpose of a Scatter Plot
- Visualize Relationships: To visually inspect the data to see if a relationship exists between the two variables.
- Identify Patterns: To determine the form, direction, and strength of the relationship.
- Detect Outliers: To identify data points that deviate significantly from the general pattern.
1.2 Interpreting Scatter Plots
a) Form of the Relationship
- Linear: The points tend to cluster around a straight line.
- Non-linear (Curvilinear): The points follow a clear, but not straight, pattern (e.g., a curve).
- No Relationship: The points are scattered randomly with no discernible pattern.
b) Direction of the Relationship
- Positive Association: As the value of one variable (X) increases, the value of the other variable (Y) also tends to increase. The points trend upwards from left to right.
- Negative Association: As the value of one variable (X) increases, the value of the other variable (Y) tends to decrease. The points trend downwards from left to right.
c) Strength of the Relationship
- Strong: The points are tightly clustered around a discernible pattern (e.g., a line). This indicates that one variable is a good predictor of the other.
- Weak: The points are loosely scattered, and the pattern is less obvious. This indicates a poor relationship between the variables.
- Moderate: A state between strong and weak.
2. Correlation Coefficient
A correlation coefficient is a single numerical value that measures the strength and direction of the linear relationship between two quantitative variables. It is typically denoted by r for a sample and ρ (rho) for a population.
2.1 Properties of the Correlation Coefficient (r)
-
Range: The value of
ris always between -1 and +1, inclusive.-1 ≤ r ≤ +1
-
Direction:
r > 0: Indicates a positive linear relationship.r < 0: Indicates a negative linear relationship.r = 0: Indicates no linear relationship. Note: There could still be a strong non-linear relationship.
-
Strength: The absolute value of
rindicates the strength of the linear relationship.|r| = 1: A perfect linear relationship. All points lie exactly on a straight line.|r| ≈ 0: A very weak or no linear relationship.- A general guideline for interpretation:
0.7 ≤ |r| ≤ 1.0: Strong0.4 ≤ |r| < 0.7: Moderate0.1 ≤ |r| < 0.4: Weak0.0 ≤ |r| < 0.1: Trivial or no relationship
-
Symmetry: The correlation between X and Y is the same as the correlation between Y and X.
r(X, Y) = r(Y, X)
-
Unitless: The correlation coefficient has no units. It is a pure number.
-
Invariance to Change of Origin and Scale:
ris not affected by adding/subtracting a constant to all values of a variable (change of origin) or by multiplying/dividing all values by a positive constant (change of scale).- If
U = (X - a) / candV = (Y - b) / dwherea, b, c, dare constants andc, d > 0, thenr(U, V) = r(X, Y).
- If
-
Correlation ≠ Causation: A strong correlation between two variables does not imply that one variable causes the other. There may be a lurking (confounding) variable influencing both.
3. Karl Pearson’s Correlation Coefficient
Also known as the Product-Moment Correlation Coefficient (PMCC), this is the most widely used measure of linear correlation for continuous data.
3.1 Formulae
Let (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ) be n pairs of observations.
a) Formula using Covariance and Standard Deviation:
This formula highlights the conceptual definition.
r = Cov(X, Y) / (σₓ * σᵧ)
Where:
Cov(X, Y)is the covariance of X and Y.Cov(X,Y) = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / nσₓis the standard deviation of X.σₓ = √[Σ(xᵢ - x̄)² / n]σᵧis the standard deviation of Y.σᵧ = √[Σ(yᵢ - ȳ)² / n]
b) Computational Formula (easier for manual calculation):
This formula is derived from the first and avoids calculating deviations from the mean for each data point.
r = [n(Σxy) - (Σx)(Σy)] / √[ [nΣx² - (Σx)²] * [nΣy² - (Σy)²] ]
Where:
n= number of data pairsΣxy= sum of the product of each pairΣx= sum of the x-valuesΣy= sum of the y-valuesΣx²= sum of the squared x-valuesΣy²= sum of the squared y-values
3.2 Assumptions for Pearson's r
- Linearity: The relationship between the two variables should be linear.
- Normality: The variables should be approximately normally distributed.
- Homoscedasticity: The variance of residuals should be constant for any value of X.
- No significant outliers: Outliers can heavily influence the value of
r.
4. Spearman’s Rank Correlation Coefficient
Spearman's correlation (ρ or rₛ) is a non-parametric measure of correlation. It assesses how well the relationship between two variables can be described using a monotonic function (one that is always increasing or always decreasing).
4.1 When to Use Spearman's Rank Correlation
- When the data is ordinal (ranked data).
- When the relationship between variables is non-linear but monotonic.
- When the data contains significant outliers (Pearson's
ris sensitive to outliers, Spearman'sρis not). - When the assumptions for Pearson's correlation (like normality) are not met.
4.2 Calculation Procedure
- Rank the Data: For each variable (X and Y), rank the observations from smallest to largest. The smallest value gets rank 1, the next smallest gets rank 2, and so on.
- Handle Ties: If two or more values are the same (tied), assign the average of the ranks they would have occupied. For example, if the 4th and 5th values are tied, both get the rank
(4+5)/2 = 4.5. - Calculate Rank Difference (
dᵢ): For each pair of observations, find the difference between their ranks:dᵢ = Rank(xᵢ) - Rank(yᵢ). - Apply the Formula:
a) Formula (when there are no tied ranks):
ρ = 1 - [ (6 * Σdᵢ²) / (n(n² - 1)) ]
Where:
dᵢ= difference in ranks for the i-th pairn= number of data pairs
b) When there are tied ranks:
The formula above can be inaccurate with many ties. The standard procedure is to calculate Pearson's correlation coefficient on the ranks themselves. The formula remains the same as Pearson's, but you substitute the raw data (x, y) with their ranks (Rank(x), Rank(y)).
4.3 Properties of Spearman's ρ
- Range:
-1 ≤ ρ ≤ +1. - Interpretation: Similar to Pearson's
r, but it measures the strength and direction of a monotonic relationship, not just a linear one.ρ = +1: Perfect positive monotonic relationship (as X increases, Y always increases).ρ = -1: Perfect negative monotonic relationship (as X increases, Y always decreases).ρ = 0: No monotonic relationship.
5. Linear Regression
While correlation tells us about the strength and direction of a relationship, regression provides a model to describe that relationship and make predictions. Simple linear regression aims to find the "line of best fit" for a set of paired data.
5.1 The Regression Line Equation
The equation for the simple linear regression line is:
Ŷ = b₀ + b₁X
Where:
Ŷ(Y-hat) is the predicted value of the dependent variable Y for a given value of X.Xis the independent (or predictor/explanatory) variable.b₀is the Y-intercept: the predicted value of Y when X = 0.b₁is the slope: the amount by which Y is predicted to change for a one-unit increase in X.
5.2 The Method of Least Squares
The "line of best fit" is determined by the method of least squares. This method finds the values of b₀ and b₁ that minimize the sum of the squared differences between the actual Y values and the predicted Ŷ values. These differences, (Yᵢ - Ŷᵢ), are called residuals or errors.
The goal is to: Minimize Σ(Yᵢ - Ŷᵢ)²
5.3 Formulae for Slope (b₁) and Intercept (b₀)
Slope (b₁):
The slope is directly related to the correlation coefficient.
b₁ = r * (sᵧ / sₓ)
or computationally:
b₁ = [n(Σxy) - (Σx)(Σy)] / [nΣx² - (Σx)²]
Where
sᵧ and sₓ are the sample standard deviations of Y and X, respectively.
Intercept (b₀):
The intercept is calculated after the slope. The regression line always passes through the point of means (x̄, ȳ).
b₀ = ȳ - b₁x̄
Where:
ȳis the mean of the y-values.x̄is the mean of the x-values.
5.4 Properties of the Least Squares Regression Line
- Passes through the Point of Means: The line
Ŷ = b₀ + b₁Xalways passes through the point(x̄, ȳ). - Sum of Residuals is Zero: The sum of the errors (residuals) is always zero.
Σ(Yᵢ - Ŷᵢ) = 0. - Minimizes Sum of Squared Residuals: The line minimizes the sum of the squared vertical distances from the data points to the line.
- Uncorrelated Residuals: The residuals are uncorrelated with the independent variable
X.
5.5 Coefficient of Determination (R²)
R² (R-squared) is a crucial metric that measures how well the regression line fits the data.
- Definition: It represents the proportion of the total variance in the dependent variable (Y) that can be explained by its linear relationship with the independent variable (X).
- Formula: For simple linear regression,
R²is simply the square of the Pearson correlation coefficientr.
TEXTR² = r² - Range:
0 ≤ R² ≤ 1. - Interpretation:
R²is often expressed as a percentage.- An
R²of 0.75 means that 75% of the variation in the Y-values can be accounted for by the linear model with X. The remaining 25% is unexplained variation (error). - A higher
R²indicates a better fit of the model to the data.