Unit 2 - Notes

MTH302 8 min read

Unit 2: Correlation and Linear regression

1. Scatter Plots

A scatter plot is a graphical representation used to display the relationship between two quantitative (numerical) variables. Each pair of values (x, y) is plotted as a single point on a two-dimensional Cartesian plane.

1.1 Purpose of a Scatter Plot

Visualize Relationships: To visually inspect the data to see if a relationship exists between the two variables.
Identify Patterns: To determine the form, direction, and strength of the relationship.
Detect Outliers: To identify data points that deviate significantly from the general pattern.

1.2 Interpreting Scatter Plots

a) Form of the Relationship

Linear: The points tend to cluster around a straight line.
Non-linear (Curvilinear): The points follow a clear, but not straight, pattern (e.g., a curve).
No Relationship: The points are scattered randomly with no discernible pattern.

b) Direction of the Relationship

Positive Association: As the value of one variable (X) increases, the value of the other variable (Y) also tends to increase. The points trend upwards from left to right.
Negative Association: As the value of one variable (X) increases, the value of the other variable (Y) tends to decrease. The points trend downwards from left to right.

c) Strength of the Relationship

Strong: The points are tightly clustered around a discernible pattern (e.g., a line). This indicates that one variable is a good predictor of the other.
Weak: The points are loosely scattered, and the pattern is less obvious. This indicates a poor relationship between the variables.
Moderate: A state between strong and weak.

2. Correlation Coefficient

A correlation coefficient is a single numerical value that measures the strength and direction of the linear relationship between two quantitative variables. It is typically denoted by r for a sample and ρ (rho) for a population.

2.1 Properties of the Correlation Coefficient (r)

Range: The value of r is always between -1 and +1, inclusive.
- -1 ≤ r ≤ +1
Direction:
- r > 0: Indicates a positive linear relationship.
- r < 0: Indicates a negative linear relationship.
- r = 0: Indicates no linear relationship. Note: There could still be a strong non-linear relationship.
Strength: The absolute value of r indicates the strength of the linear relationship.
- |r| = 1: A perfect linear relationship. All points lie exactly on a straight line.
- |r| ≈ 0: A very weak or no linear relationship.
- A general guideline for interpretation:
  - 0.7 ≤ |r| ≤ 1.0: Strong
  - 0.4 ≤ |r| < 0.7: Moderate
  - 0.1 ≤ |r| < 0.4: Weak
  - 0.0 ≤ |r| < 0.1: Trivial or no relationship
Symmetry: The correlation between X and Y is the same as the correlation between Y and X.
- r(X, Y) = r(Y, X)
Unitless: The correlation coefficient has no units. It is a pure number.
Invariance to Change of Origin and Scale: r is not affected by adding/subtracting a constant to all values of a variable (change of origin) or by multiplying/dividing all values by a positive constant (change of scale).
- If U = (X - a) / c and V = (Y - b) / d where a, b, c, d are constants and c, d > 0, then r(U, V) = r(X, Y).
Correlation ≠ Causation: A strong correlation between two variables does not imply that one variable causes the other. There may be a lurking (confounding) variable influencing both.

3. Karl Pearson’s Correlation Coefficient

Also known as the Product-Moment Correlation Coefficient (PMCC), this is the most widely used measure of linear correlation for continuous data.

3.1 Formulae

Let (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ) be n pairs of observations.

a) Formula using Covariance and Standard Deviation:
This formula highlights the conceptual definition.

TEXT

r = Cov(X, Y) / (σₓ * σᵧ)

Where:

Cov(X, Y) is the covariance of X and Y. Cov(X,Y) = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / n
σₓ is the standard deviation of X. σₓ = √[Σ(xᵢ - x̄)² / n]
σᵧ is the standard deviation of Y. σᵧ = √[Σ(yᵢ - ȳ)² / n]

b) Computational Formula (easier for manual calculation):
This formula is derived from the first and avoids calculating deviations from the mean for each data point.

TEXT

r = [n(Σxy) - (Σx)(Σy)] / √[ [nΣx² - (Σx)²] * [nΣy² - (Σy)²] ]

Where:

n = number of data pairs
Σxy = sum of the product of each pair
Σx = sum of the x-values
Σy = sum of the y-values
Σx² = sum of the squared x-values
Σy² = sum of the squared y-values

3.2 Assumptions for Pearson's r

Linearity: The relationship between the two variables should be linear.
Normality: The variables should be approximately normally distributed.
Homoscedasticity: The variance of residuals should be constant for any value of X.
No significant outliers: Outliers can heavily influence the value of r.

4. Spearman’s Rank Correlation Coefficient

Spearman's correlation (ρ or rₛ) is a non-parametric measure of correlation. It assesses how well the relationship between two variables can be described using a monotonic function (one that is always increasing or always decreasing).

4.1 When to Use Spearman's Rank Correlation

When the data is ordinal (ranked data).
When the relationship between variables is non-linear but monotonic.
When the data contains significant outliers (Pearson's r is sensitive to outliers, Spearman's ρ is not).
When the assumptions for Pearson's correlation (like normality) are not met.

4.2 Calculation Procedure

Rank the Data: For each variable (X and Y), rank the observations from smallest to largest. The smallest value gets rank 1, the next smallest gets rank 2, and so on.
Handle Ties: If two or more values are the same (tied), assign the average of the ranks they would have occupied. For example, if the 4th and 5th values are tied, both get the rank (4+5)/2 = 4.5.
Calculate Rank Difference (dᵢ): For each pair of observations, find the difference between their ranks: dᵢ = Rank(xᵢ) - Rank(yᵢ).
Apply the Formula:

a) Formula (when there are no tied ranks):

TEXT

ρ = 1 - [ (6 * Σdᵢ²) / (n(n² - 1)) ]

Where:

dᵢ = difference in ranks for the i-th pair
n = number of data pairs

b) When there are tied ranks:
The formula above can be inaccurate with many ties. The standard procedure is to calculate Pearson's correlation coefficient on the ranks themselves. The formula remains the same as Pearson's, but you substitute the raw data (x, y) with their ranks (Rank(x), Rank(y)).

4.3 Properties of Spearman's ρ

Range: -1 ≤ ρ ≤ +1.
Interpretation: Similar to Pearson's r, but it measures the strength and direction of a monotonic relationship, not just a linear one.
- ρ = +1: Perfect positive monotonic relationship (as X increases, Y always increases).
- ρ = -1: Perfect negative monotonic relationship (as X increases, Y always decreases).
- ρ = 0: No monotonic relationship.

5. Linear Regression

While correlation tells us about the strength and direction of a relationship, regression provides a model to describe that relationship and make predictions. Simple linear regression aims to find the "line of best fit" for a set of paired data.

5.1 The Regression Line Equation

The equation for the simple linear regression line is:

TEXT

Ŷ = b₀ + b₁X

Where:

Ŷ (Y-hat) is the predicted value of the dependent variable Y for a given value of X.
X is the independent (or predictor/explanatory) variable.
b₀ is the Y-intercept: the predicted value of Y when X = 0.
b₁ is the slope: the amount by which Y is predicted to change for a one-unit increase in X.

5.2 The Method of Least Squares

The "line of best fit" is determined by the method of least squares. This method finds the values of b₀ and b₁ that minimize the sum of the squared differences between the actual Y values and the predicted Ŷ values. These differences, (Yᵢ - Ŷᵢ), are called residuals or errors.

The goal is to: Minimize Σ(Yᵢ - Ŷᵢ)²

5.3 Formulae for Slope (`b₁`) and Intercept (`b₀`)

Slope (b₁):
The slope is directly related to the correlation coefficient.

TEXT

b₁ = r * (sᵧ / sₓ)

or computationally:

TEXT

b₁ = [n(Σxy) - (Σx)(Σy)] / [nΣx² - (Σx)²]

Where sᵧ and sₓ are the sample standard deviations of Y and X, respectively.

Intercept (b₀):
The intercept is calculated after the slope. The regression line always passes through the point of means (x̄, ȳ).

TEXT

b₀ = ȳ - b₁x̄

Where:

ȳ is the mean of the y-values.
x̄ is the mean of the x-values.

5.4 Properties of the Least Squares Regression Line

Passes through the Point of Means: The line Ŷ = b₀ + b₁X always passes through the point (x̄, ȳ).
Sum of Residuals is Zero: The sum of the errors (residuals) is always zero. Σ(Yᵢ - Ŷᵢ) = 0.
Minimizes Sum of Squared Residuals: The line minimizes the sum of the squared vertical distances from the data points to the line.
Uncorrelated Residuals: The residuals are uncorrelated with the independent variable X.

5.5 Coefficient of Determination (R²)

R² (R-squared) is a crucial metric that measures how well the regression line fits the data.

Definition: It represents the proportion of the total variance in the dependent variable (Y) that can be explained by its linear relationship with the independent variable (X).
Formula: For simple linear regression, R² is simply the square of the Pearson correlation coefficient r.
TEXT
```
  R² = r²
  
```
Range: 0 ≤ R² ≤ 1.
Interpretation:
- R² is often expressed as a percentage.
- An R² of 0.75 means that 75% of the variation in the Y-values can be accounted for by the linear model with X. The remaining 25% is unexplained variation (error).
- A higher R² indicates a better fit of the model to the data.

Unit 1

Unit 3

Unit 2 - Notes

Table of Contents

Unit 2: Correlation and Linear regression

1. Scatter Plots

1.1 Purpose of a Scatter Plot

1.2 Interpreting Scatter Plots

a) Form of the Relationship

b) Direction of the Relationship

c) Strength of the Relationship

2. Correlation Coefficient

2.1 Properties of the Correlation Coefficient (r)

3. Karl Pearson’s Correlation Coefficient

3.1 Formulae

3.2 Assumptions for Pearson's r

4. Spearman’s Rank Correlation Coefficient

4.1 When to Use Spearman's Rank Correlation

4.2 Calculation Procedure

4.3 Properties of Spearman's ρ

5. Linear Regression

5.1 The Regression Line Equation

5.2 The Method of Least Squares

5.3 Formulae for Slope (b₁) and Intercept (b₀)

5.4 Properties of the Least Squares Regression Line

5.5 Coefficient of Determination (R²)

5.3 Formulae for Slope (`b₁`) and Intercept (`b₀`)