Unit 5 - Notes

INT395 7 min read

Unit 5: Time Series Regression

1. Time Series Data Characteristics

Time series data is a sequence of data points collected or recorded at specific time intervals. Unlike standard supervised learning datasets where observations are assumed to be independent (i.i.d.), time series data possesses temporal dependence, meaning the current value is often correlated with past values.

Key Components of Time Series

A time series signal can be decomposed into four primary components:

Trend ( $T_t$ ): The long-term movement or direction of the data. Trends can be deterministic (linear, exponential) or stochastic.
- Example: Increasing global temperatures over 50 years.
Seasonality ( $S_t$ ): Recurring patterns or cycles that occur at fixed intervals (e.g., daily, weekly, monthly, quarterly).
- Example: Retail sales spiking every December.
Cyclicity ( $C_t$ ): Fluctuations occurring at irregular intervals, usually influenced by macroeconomic factors (business cycles). Unlike seasonality, the period is not fixed.
- Example: Economic recessions and expansions.
Irregularity / Noise / Residuals ( $\epsilon_t$ ): The random variation left over after extracting trend, seasonality, and cyclicity. This should ideally resemble white noise.

Stationarity

A critical concept in time series analysis. A time series is stationary if its statistical properties do not change over time.

Constant Mean: No trend.
Constant Variance: No heteroscedasticity (volatility does not change).
Constant Autocovariance: The relationship between $t$ and $t-k$ depends only on the lag $k$ , not on the actual time $t$ .

Why it matters: Most classical models (ARIMA) assume stationarity to make reliable future predictions.

2. Univariate vs. Multivariate Time Series

Univariate Time Series

Consists of a single variable observed over time. The objective is to predict future values of this variable based solely on its own history.

Notation: $Y = \{y_1, y_2, ..., y_t\}$
Example: Predicting the price of Bitcoin using only past Bitcoin prices.

Multivariate Time Series

Consists of two or more variables observed over time. One variable is usually the target, while others are exogenous (predictor) variables that may influence the target.

Notation: $Y_t$ (target) and $X_t^{(1)}, X_t^{(2)}$ (features).
Example: Predicting "Ice Cream Sales" ( $Y$ ) using "Past Sales" ( $Y_{t-1}$ ) and "Temperature" ( $X^{(1)}$ ).

3. Regular vs. Irregular Intervals

Regular Time Series

Data points are collected at equally spaced intervals (frequency is constant). This is the standard format required for most regression algorithms (ARIMA, RNNs).

Examples: Hourly temperature readings, daily stock closing prices.

Irregular Time Series

Data points arrive sporadically or at arbitrary timestamps.

Examples: Credit card transactions, server error logs, patient health records (visits occur only when sick).
Handling Irregularity: Before applying standard regression, irregular data is often resampled (e.g., binning or interpolation) to convert it into a regular interval structure.

4. Preparing Time Series Data

Data preparation involves transforming raw temporal data into a format suitable for supervised learning algorithms.

A. Handling Missing Values

Standard mean imputation is often inappropriate because it destroys temporal continuity.

Forward Fill (Last Observation Carried Forward): Propagates the last valid observation forward.
Interpolation: Linear or spline interpolation based on time index.

B. Feature Engineering

Transforming a time series into a supervised learning problem ( $X, y$ ).

Lag Features: Using past values as input features.
- Target: $y_t$
- Feature 1: $y_{t-1}$ (Lag 1)
- Feature 2: $y_{t-7}$ (Lag 7 / Weekly seasonality)
Rolling Window Statistics: Calculating summary statistics over a moving window.
- Rolling Mean (Moving Average).
- Rolling Standard Deviation (Volatility).
Datetime Features: Extracting components from the timestamp.
- Hour of day, Day of week, Is_Weekend, Month.

C. Transformations for Stationarity

Differencing: Computing the difference between consecutive observations ( $y'_t = y_t - y_{t-1}$ ) to remove trend.
Log Transformation: Applying $log(y_t)$ to stabilize increasing variance (heteroscedasticity).
Decomposition: Separating Trend and Seasonality mathematically.

5. Data Splitting Strategies

CRITICAL RULE: Never use random train_test_split or standard K-Fold Cross-Validation for time series. Random shuffling causes data leakage (using future information to predict the past).

A. Temporal Train-Test Split

Split the data at a specific cutoff point.

Train: Data from $t_0$ to $t_k$
Test: Data from $t_{k+1}$ to $t_n$
Con: Uses only one snapshot of model performance.

B. Walk-Forward Validation (Time Series Cross-Validation)

Also known as "Rolling Origin" validation.

Train on initial window. Predict the next step.
Expand the training window to include the actual value of the predicted step.
Retrain and predict the subsequent step.
Repeat until the end of the dataset.

Expanding Window: Training set grows larger with every step.
Sliding Window: Training set size remains constant (oldest data drops off).

6. Autoregressive Models (AR)

An Autoregressive model forecasts the variable of interest using a linear combination of its past values. It assumes the current value depends linearly on its own previous values.

Formula (AR( $p$ ))

Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + ... + \phi_p Y_{t-p} + \epsilon_t

$p$ : The order of the AR model (number of lags used).
$\phi$ : Coefficients.
$\epsilon_t$ : White noise error term.

Determining $p$

Use the Partial Autocorrelation Function (PACF) plot.

The PACF removes the indirect effect of intermediate lags.
If the PACF cuts off (drops to zero) after lag $p$ , an AR( $p$ ) model is appropriate.

7. Moving Average Models (MA)

A Moving Average model forecasts the variable based on a linear combination of past forecast errors (shocks). It models the "noise" structure.

Formula (MA( $q$ ))

Y_t = \mu + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ... + \theta_q \epsilon_{t-q} + \epsilon_t

$q$ : The order of the MA model.
$\theta$ : Coefficients.
$\epsilon$ : Prediction errors (residuals) from previous time steps.

Determining $q$

Use the Autocorrelation Function (ACF) plot.

If the ACF cuts off after lag $q$ , an MA( $q$ ) model is appropriate.

8. ARIMA (AutoRegressive Integrated Moving Average)

ARIMA combines AR and MA models while handling non-stationary data through "Integration" (Differencing).

Notation: ARIMA( $p, d, q$ )

$p$ (AR order): Number of lag observations included in the model.
$d$ (Integrated order): Number of times the raw observations are differenced to make the data stationary.
$q$ (MA order): Size of the moving average window.

The Modeling Process (Box-Jenkins Method)

Identification:
- Check for stationarity (Augmented Dickey-Fuller test).
- If non-stationary, difference the data (increase $d$ ) until stationary.
- Examine ACF and PACF plots to estimate $p$ and $q$ .
Estimation: Use Maximum Likelihood Estimation (MLE) or Least Squares to find coefficients ( $\phi$ and $\theta$ ).
Diagnostic Checking: Analyze residuals. They should be white noise (no correlation, constant variance).

9. SARIMA (Seasonal ARIMA)

Standard ARIMA struggles with strong seasonal patterns (e.g., sales always peaking in December). SARIMA extends ARIMA by adding seasonal hyperparameters.

Notation: SARIMA $(p, d, q) \times (P, D, Q)_m$

Non-Seasonal Components:

$p$ : Trend autoregression order.
$d$ : Trend difference order.
$q$ : Trend moving average order.

Seasonal Components:

$m$ : The number of time steps for a single seasonal period (e.g., $m=12$ for monthly data, $m=7$ for daily data with weekly seasonality).
$P$ : Seasonal autoregressive order (lags at multiples of $m$ ).
$D$ : Seasonal difference order (subtracting $Y_t$ from $Y_{t-m}$ ).
$Q$ : Seasonal moving average order.

Application Note

SARIMA is computationally more expensive than ARIMA but essential for data with obvious cycles (electricity demand, temperature, retail sales).

Model Selection (Auto-ARIMA)

In practice, manually interpreting ACF/PACF plots can be subjective. Grid Search (via AIC/BIC scores) is often used to find the optimal $(p,d,q)(P,D,Q)$ parameters.

AIC (Akaike Information Criterion): Estimator of prediction error. Lower AIC indicates a better model (balances goodness of fit vs. model complexity).

Unit 4

Unit 6