1 $Which of the following is an example of Nominal categorical data?$

Types of data Easy

A.

Customer satisfaction rating (Low, Medium, High)

B.

Colors of a car (Red, Blue, Green)

C.

Educational level (High School, Bachelor's, Master's)

D.

Size of a T-shirt (Small, Medium, Large)

2 $The number of students in a classroom is an example of what type of data?$

Types of data Easy

A.

Continuous numerical data

B.

Ordinal categorical data

C.

Nominal categorical data

D.

Discrete numerical data

3 $What is the primary purpose of data pre-processing in machine learning?$

Data pre-processing Easy

A.

To transform raw data into a clean and suitable format for a model

B.

To visualize the final results of a model

C.

To train the machine learning model

D.

To select the best machine learning algorithm

4 $Which of the following is NOT a typical data pre-processing step?$

Data pre-processing Easy

A.

Model evaluation and deployment

B.

Feature scaling

C.

Handling missing values

D.

Encoding categorical variables

5 $What is data leakage in machine learning?$

data leakage concept Easy

A.

When information from outside the training dataset is used to create the model

B.

When the model's accuracy is too low

C.

When data is lost due to a hardware failure

D.

When the dataset is too small to train a model

6 $To prevent data leakage, when should you split your data into training and testing sets?$

data leakage concept Easy

A.

After performing feature scaling on the entire dataset

B.

Before performing any pre-processing like scaling or imputation

C.

It does not matter when you split the data

D.

After training the model

7 $The technique of replacing missing values with a substitute value, like the mean or median, is called:$

handling missing values Easy

A.

Imputation

B.

Deletion

C.

Binning

D.

Normalization

8 $Which imputation method is most suitable for a numerical feature with significant outliers?$

handling missing values Easy

A.

Mean imputation

B.

Zero imputation

C.

Median imputation

D.

Mode imputation

9 $What is the simplest strategy for handling rows with missing values?$

handling missing values Easy

A.

Deleting the rows

B.

Replacing them with the mean

C.

Using a predictive model to estimate them

D.

Replacing them with the mode

10 $What is an outlier in a dataset?$

outliers handling Easy

A.

A data point that is missing a value

B.

A categorical data point

C.

A data point that is significantly different from other data points

D.

The average value of a feature

11 $Why is it often important to handle outliers?$

outliers handling Easy

A.

They always represent incorrect data entry

B.

They help improve the accuracy of all models

C.

They can skew statistical measures and negatively affect model performance

D.

They make the dataset larger

12 $Which encoding technique converts a categorical feature with N unique categories into N new binary features?$

handling categorical data Easy

A.

Target Encoding

B.

Ordinal Encoding

C.

Label Encoding

D.

One-Hot Encoding

13 $When would Label Encoding be an appropriate choice over One-Hot Encoding?$

handling categorical data Easy

A.

When the categorical variable is ordinal

B.

When using tree-based models, always

C.

When the categorical variable is nominal

D.

When the number of categories is very small (e.g., 2)

14 $What is the primary goal of feature scaling?$

scaling and normalization Easy

A.

To encode categorical data

B.

To bring all features to a similar scale or range

C.

To handle missing values

D.

To remove outliers from the data

15 $Min-Max Scaling typically scales the data to which of the following ranges?$

scaling and normalization Easy

A.

Mean of 0 and Standard Deviation of 1

B.

[-1, 1]

C.

[0, 1]

D.

[0, 100]

16 $Which technique transforms features to have a mean of 0 and a standard deviation of 1?$

scaling and normalization Easy

A.

Robust Scaling

B.

Min-Max Scaling

C.

One-Hot Encoding

D.

Standardization (Z-score Normalization)

17 $What does 'class imbalance' refer to in a classification problem?$

class imbalance handling Easy

A.

When features are not on the same scale

B.

When the number of observations per class is not equally distributed

C.

When there are too many categorical features

D.

When the dataset contains many outliers

18 $Which technique involves creating new synthetic data points in the minority class to balance a dataset?$

class imbalance handling Easy

A.

Principal Component Analysis (PCA)

B.

Feature Scaling

C.

Oversampling (e.g., SMOTE)

D.

Undersampling

19 $The Interquartile Range (IQR) method is commonly used for what purpose?$

outliers handling Easy

A.

To handle missing values

B.

To encode categorical data

C.

To scale numerical data

D.

To identify potential outliers

20 $What is the primary disadvantage of One-Hot Encoding?$

handling categorical data Easy

A.

It implies an incorrect order among categories

B.

It only works for numerical data

C.

It cannot handle more than two categories

D.

It can lead to a very high number of features (curse of dimensionality)

21 $You are working with a dataset containing a feature for 'Annual Income' which has a right-skewed distribution and several missing values. Which imputation method would be the most robust and appropriate choice to fill in the missing data?$

Handling missing values Medium

A.

Dropping all rows with missing values

B.

Median imputation

C.

Mode imputation

D.

Mean imputation

22 $A data scientist is building a model to predict customer churn. Before splitting the data into training and test sets, they apply StandardScaler to the entire dataset. Why is this a problematic approach?$

Data leakage concept Medium

A.

It will permanently alter the original data, making it impossible to interpret the model's coefficients.

B.

It is computationally inefficient to scale the entire dataset at once.

C.

It causes data leakage because information from the test set (its mean and variance) is used to transform the training set.

D.

StandardScaler cannot be applied before splitting the data.

23 $You are preparing data for a linear regression model. A nominal categorical feature, 'Country', has 5 unique values. Why is One-Hot Encoding preferred over Label Encoding in this scenario?$

Handling categorical data Medium

A.

Label Encoding would create a large number of new features, leading to dimensionality issues.

B.

One-Hot Encoding is computationally faster for linear models.

C.

Label Encoding would introduce an arbitrary ordinal relationship (e.g., 1 < 2 < 3) that the linear model might misinterpret as a meaningful ranking.

D.

One-Hot Encoding preserves the original 'Country' column in the dataset.

24 $You have a dataset with features on vastly different scales, and some of these features contain significant outliers. You need to prepare the data for an SVM model. Which scaling technique would be the most appropriate?$

Scaling and normalization Medium

A.

MinMaxScaler

B.

Normalizer

C.

RobustScaler

D.

StandardScaler

25 $When applying the Synthetic Minority Over-sampling Technique (SMOTE), what is the correct procedure to avoid data leakage and get a reliable estimate of model performance?$

Class imbalance handling Medium

A.

Apply SMOTE to the test set only to ensure it is balanced.

B.

Apply SMOTE to both the training and test data independently after the split.

C.

Perform the train-test split first, and then apply SMOTE only to the training data.

D.

Apply SMOTE to the entire dataset before performing a train-test split.

26 $You are analyzing a feature that is not normally distributed. Which method is generally more reliable for identifying outliers in this case?$

Outliers handling Medium

A.

Using the Interquartile Range (IQR) method.

B.

Visual inspection using a bar chart.

C.

Using the Z-score with a threshold of 3.

D.

Removing the top and bottom 1% of the data, regardless of distribution.

27 $A dataset contains a feature 'Product_Rating' with possible values: 'Poor', 'Average', 'Good', 'Excellent'. Which data type best describes this feature and why?$

Types of data Medium

A.

Interval, because the difference between 'Good' and 'Average' is quantifiable.

B.

Ratio, because there is a true zero point in the ratings.

C.

Ordinal, because the categories have a clear, meaningful order.

D.

Nominal, because the values are distinct categories.

28 $In a competition to predict if a patient has a certain disease, a feature treatment_administered is found to be highly predictive. However, this treatment is only administered after a diagnosis is confirmed. Including this feature in the model is an example of what?$

Data leakage concept Medium

A.

Multicollinearity

B.

The curse of dimensionality

C.

Target leakage

D.

Outlier-driven correlation

29 $You are dealing with a high-cardinality categorical feature like 'UserID', which has thousands of unique values. Applying One-Hot Encoding is infeasible due to the massive number of new features it would create. Which of the following is a viable alternative strategy?$

Handling categorical data Medium

A.

Binary Encoding, which creates fewer columns than One-Hot Encoding by using binary representations.

B.

Dropping the feature, as high-cardinality features are never useful.

C.

Converting the feature to an integer and using it directly in a linear model.

D.

Label Encoding, as it only creates one new feature.

30 $For which of the following machine learning algorithms is feature scaling generally NOT required for the model to perform well?$

Scaling and normalization Medium

A.

Support Vector Machines (SVM) with an RBF kernel

B.

K-Nearest Neighbors (KNN)

C.

Random Forest

D.

Principal Component Analysis (PCA)

31 $You observe that missing values in a 'Salary' column occur more frequently for participants who did not report their 'Level of Education'. This situation, where the probability of a value being missing depends on another observed feature, is known as:$

Handling missing values Medium

A.

Missing Completely At Random (MCAR)

B.

Missing Not At Random (MNAR)

C.

Systematically Missing Data (SMD)

D.

Missing At Random (MAR)

32 $What is the primary mechanism by which the Synthetic Minority Over-sampling Technique (SMOTE) works?$

Class imbalance handling Medium

A.

It removes majority class samples that are close to minority class samples.

B.

It creates new, synthetic minority class samples by interpolating between existing minority samples and their nearest neighbors.

C.

It duplicates existing minority class samples at random.

D.

It assigns higher weights to the minority class samples during model training.

33 $Instead of removing outliers from a 'house_price' feature, a data scientist applies a log transformation (). What is the primary benefit of this approach?$

Outliers handling Medium

A.

It converts the feature into a categorical variable.

B.

It guarantees that the transformed feature will have a mean of 0 and a standard deviation of 1.

C.

It pulls in the high-value outliers, reducing the skewness of the distribution and making it more symmetric.

D.

It completely removes the influence of the outliers on the model.

34 $To ensure a robust and generalizable machine learning model, what is the correct order of these common pre-processing steps?$

Data pre-processing Medium

A.

1. Impute missing values, 2. Split data into train/test sets, 3. Scale features

B.

1. Scale features, 2. Impute missing values, 3. Split data into train/test sets

C.

1. Split data into train/test sets, 2. Scale features on the training set, 3. Apply the same scaling to the test set

D.

1. Split data into train/test sets, 2. Scale features on the entire dataset, 3. Train the model

35 $You are implementing Target Encoding for a categorical feature. What is a common technique to mitigate the risk of overfitting, especially for categories with few samples?$

Handling categorical data Medium

A.

Applying a smoothing or credibility factor, which blends the category's mean with the global mean of the target.

B.

Removing all categories that appear less than 10 times in the dataset.

C.

Using One-Hot Encoding as a backup for rare categories.

D.

Using the median of the target variable instead of the mean for encoding.

36 $What is the primary difference in the output of StandardScaler versus MinMaxScaler ?$

Scaling and normalization Medium

A.

MinMaxScaler is a linear transformation, while StandardScaler is non-linear.

B.

StandardScaler requires the data to be normally distributed, while MinMaxScaler does not.

C.

MinMaxScaler removes outliers, while StandardScaler does not.

D.

StandardScaler transforms data to have a mean of 0 and a standard deviation of 1, while MinMaxScaler scales data to a fixed range, typically [0, 1].

37 $You are building a time-series forecasting model and have missing values. Simple mean or median imputation is not ideal. Which of the following would be a more suitable approach for this type of data?$

Handling missing values Medium

A.

K-Nearest Neighbors (KNN) imputation.

B.

Last Observation Carried Forward (LOCF) or interpolation.

C.

Dropping the time steps with missing data.

D.

Replacing missing values with a constant like -999.

38 $You are working on a credit card fraud detection project where fraud cases (positive class) are only 0.1% of the data. You train a model and achieve 99.9% accuracy. Why is accuracy a misleading metric in this context?$

Class imbalance handling Medium

A.

The dataset is too small for accuracy to be a stable metric.

B.

Accuracy is computationally expensive to calculate on imbalanced datasets.

C.

Accuracy can only be used for models that output probabilities.

D.

The model might be achieving high accuracy simply by predicting 'not fraud' for every single transaction.

39 $A feature in your dataset is 'postal_code'. While it is numerical, why is it incorrect to treat it as a continuous or ratio variable in a machine learning model?$

Types of data Medium

A.

Because postal codes often contain missing values that must be imputed first.

B.

Because the numerical difference between two postal codes (e.g., 90210 vs 90211) is not a meaningful measure of distance or magnitude.

C.

Because postal codes are always integers and models require floating-point numbers.

D.

Because postal codes have a true zero, making them ratio data by definition.

40 $Which of the following models is inherently more robust to outliers in the feature space, often requiring less stringent outlier pre-processing?$

Outliers handling Medium

A.

Support Vector Machines (SVM)

B.

Linear Regression

C.

Decision Tree based models (e.g., Random Forest)

D.

K-Means Clustering

41 $You are building a credit default model. You engineer a new feature z_score_income_by_profession by calculating the Z-score of each individual's income relative to the mean and standard deviation of income within their stated profession . You calculate these means and standard deviations using the entire dataset and then perform a 5-fold cross-validation. Why is this approach a form of data leakage?$

data leakage concept Hard

A.

The mean and standard deviation for a profession are calculated using data from all 5 folds, so information from the validation fold 'leaks' into the feature creation for the training folds.

B.

Using profession as a grouping variable introduces societal biases into the model.

C.

This is not data leakage; it is a valid form of feature engineering known as target-unaware group-wise normalization.

D.

Z-score transformation is a form of scaling and should only be applied after the model has been trained.

42 $You are working with a high-cardinality categorical feature 'user_id' (over 1 million unique IDs). You choose to use Target Encoding to convert it into a numerical feature for a gradient boosting model. What is the most critical step to prevent severe overfitting and data leakage when using this technique?$

handling categorical data Hard

A.

Using One-Hot Encoding instead, as it is immune to overfitting.

B.

Applying a smoothing factor or a Bayesian averaging approach, especially for categories with few samples.

C.

Hashing the 'user_id' feature before applying the target encoding.

D.

Normalizing the encoded feature using StandardScaler after the transformation.

43 $Consider a dataset where missing values in a feature 'serum_level' are suspected to be Missing Not At Random (MNAR). Specifically, patients with very high or very low serum levels are less likely to have their values recorded. In this scenario, why would MICE (Multivariate Imputation by Chained Equations) likely produce biased estimates if used naively?$

handling missing values Hard

A.

Because MICE requires the data to be normally distributed, which is unlikely for 'serum_level'.

B.

Because MICE can only handle numerical data and 'serum_level' might be stored as a string.

C.

Because MICE's underlying assumption is that data is Missing at Random (MAR) or Missing Completely at Random (MCAR), and it cannot model the mechanism causing the missingness itself.

D.

Because MICE is a deterministic method and will always impute the same value for a given pattern, ignoring the inherent uncertainty.

44 $You are using SMOTE (Synthetic Minority Oversampling TEchnique) to handle a severely imbalanced dataset. A colleague warns you that SMOTE can be detrimental when the minority class is very noisy and overlaps significantly with the majority class. What is the primary reason for this warning?$

class imbalance handling Hard

A.

SMOTE will change the distribution of the majority class, leading to information loss.

B.

SMOTE only works by duplicating existing minority samples, not creating new ones.

C.

SMOTE may generate synthetic samples in the overlapping region, effectively blurring the decision boundary and making class separation harder.

D.

SMOTE is computationally too expensive for datasets with significant class overlap.

45 $You are preparing a dataset with several features for a Support Vector Machine (SVM) with an RBF kernel. One feature, 'user_age', ranges from 18-90, while another, 'account_balance', ranges from -5,000 to 2,000,000. Both features contain significant outliers. Which scaling method is most appropriate and why?$

scaling and normalization Hard

A.

No scaling is needed, as tree-based kernels like RBF are immune to feature scale.

B.

MinMaxScaler, because it scales data to a fixed range [0, 1], which guarantees the fastest convergence for the SVM algorithm.

C.

RobustScaler, because it uses the interquartile range (IQR) to scale, making it insensitive to the influence of extreme outliers.

D.

StandardScaler, because it centers the data at zero, which is a strict requirement for SVMs.

46 $Instead of removing outliers, a data scientist decides to use Winsorization at the 98th percentile for a right-skewed feature. What is a key theoretical advantage of this method over simply capping the values at a fixed threshold (e.g., a domain-knowledge-based maximum)?$

outliers handling Hard

A.

Winsorization guarantees that the feature will have a normal distribution after transformation.

B.

Winsorization is guaranteed to improve the performance of any linear model.

C.

Winsorization is a non-parametric method, while capping at a fixed threshold is parametric.

D.

Winsorization pulls outliers towards the tail of the distribution, rather than clustering them at an arbitrary point, which can better preserve the feature's variance structure.

47 $In a time-series forecasting problem, you create features like a 7-day rolling average of sales. For a given day t, you calculate this average using data from days t-3 to t+3 . You then use this feature to predict sales on day t . This is a classic example of data leakage. What is the correct way to engineer this feature?$

data leakage concept Hard

A.

Use a 'trailing' or 'lagged' window, calculating the average for day t using data from days t-7 to t-1 .

B.

Calculate the rolling average over the entire dataset to get a stable global baseline.

C.

Apply a Box-Cox transformation to the sales data before calculating the rolling average.

D.

Use a 'centered' window but with a much larger window size, such as 30 days, to smooth out the noise.

48 $You are comparing One-Hot Encoding (OHE) and Dummy Coding for a categorical feature with 5 categories (A, B, C, D, E) for a multiple linear regression model. What is the primary statistical advantage of using Dummy Coding (which creates k-1 new features) over OHE (which creates k features) in this context?$

handling categorical data Hard

A.

It automatically assigns a baseline category that has the highest frequency, improving model accuracy.

B.

It avoids the 'dummy variable trap' (perfect multicollinearity), which can make the model's coefficient estimates unstable and uninterpretable.

C.

It preserves the ordinal nature of the categories, which OHE fails to do.

D.

It is more computationally efficient because it generates fewer columns.

49 $You are training a model on an imbalanced dataset and decide to use ADASYN (Adaptive Synthetic Sampling) instead of SMOTE. In which scenario would ADASYN be theoretically superior to SMOTE?$

class imbalance handling Hard

A.

When the minority class samples that are harder to learn (i.e., those with more majority class neighbors) should be prioritized for synthetic sample generation.

B.

When the goal is to generate synthetic samples that are exact duplicates of the original minority samples.

C.

When the dataset is extremely large, as ADASYN is computationally simpler than SMOTE.

D.

When the feature space has a mix of categorical and numerical features.

50 $A dataset contains features with drastically different distributions (e.g., one is uniform, one is bimodal, one is heavily skewed). You want to use a method that forces all features to follow a uniform or normal distribution. Which pre-processing technique would achieve this, and what is its primary use case?$

scaling and normalization Hard

A.

StandardScaler, used to center data around a mean of 0 and standard deviation of 1.

B.

QuantileTransformer, used to handle non-Gaussian distributed features and reduce the impact of outliers for distance-based models.

C.

Normalizer (per-sample), used to scale samples to have unit norm, which is required for text classification.

D.

PowerTransformer with the 'box-cox' method, used exclusively for positive, skewed data to make it linear.

51 $You have a longitudinal dataset tracking patient measurements over time. The 'blood_pressure' feature has missing values. A simple approach is to use LOCF (Last Observation Carried Forward). What is the most significant statistical risk associated with this imputation method in time-series analysis?$

handling missing values Hard

A.

It is computationally intensive for long time series.

B.

It artificially reduces the variance of the feature and creates spurious temporal correlations, potentially leading to an underestimation of confidence intervals.

C.

It can only be applied if the first observation in the series is not missing.

D.

It introduces a systematic bias by always imputing values that are lower than the true mean.

52 $You are using the Isolation Forest algorithm for outlier detection. The algorithm works by building an ensemble of 'isolation trees' and identifies outliers as points that have a short average path length from the root node. Why is this method particularly effective for large, high-dimensional datasets compared to distance-based methods like DBSCAN or Local Outlier Factor (LOF)?$

outliers handling Hard

A.

Isolation Forest does not rely on distance calculations, which suffer from the 'curse of dimensionality' where distances between all points become almost equal.

B.

Isolation Forest can only detect global outliers, not local ones, which simplifies the problem in high dimensions.

C.

Isolation Forest requires the data to be scaled and normalized, which is a simpler pre-processing step than the parameter tuning required for DBSCAN.

D.

Isolation Forest is a supervised algorithm that learns from a small set of labeled outliers, making it more accurate.

53 $A feature in your dataset is 'customer_satisfaction_rating' with values [1, 2, 3, 4, 5], representing 'Very Unsatisfied' to 'Very Satisfied'. A colleague treats this feature as a continuous numerical variable and applies StandardScaler . What is the primary theoretical flaw in this approach?$

Types of data Hard

A.

StandardScaler cannot handle integer data; it requires floating-point inputs.

B.

The range is too small (1-5), and scaling will have no significant effect on any model.

C.

The data is ordinal, not interval or ratio. StandardScaler assumes mathematical operations like mean and standard deviation are meaningful, but the 'distance' between 1 and 2 may not be the same as between 4 and 5.

D.

The data is categorical, and One-Hot Encoding should have been used.

54 $You are designing a data pre-processing pipeline for a logistic regression model. The pipeline must include missing value imputation (using the mean), one-hot encoding for a categorical feature, and feature scaling using StandardScaler . What is the correct order of these operations, and why is any other order incorrect?$

Data pre-processing Hard

A.

Scaling -> Imputation -> One-Hot Encoding. Scaling first ensures that the imputation value (the mean) is calculated on the scaled data.

B.

One-Hot Encoding -> Imputation -> Scaling. This is the most efficient order as it allows for parallel processing.

C.

Imputation -> One-Hot Encoding -> Scaling. Encoding before imputation would be complex, and scaling before encoding would ignore the binary nature of the new columns.

D.

The order does not matter as all three operations are linear transformations.

55 $When would using Hash Encoding be a clearly superior choice over One-Hot Encoding (OHE) for a categorical feature?$

handling categorical data Hard

A.

For any feature with fewer than 10 categories, as it is more computationally efficient.

B.

In an online learning environment with a very high-cardinality feature where new categories appear frequently and memory is a major constraint.

C.

When training a linear regression model, as hash collisions have a regularizing effect similar to L2 regularization.

D.

When the categorical feature has a clear ordinal relationship that must be preserved.

56 $You are working on a fraud detection problem. Instead of using a data-level technique like SMOTE or ADASYN, you opt for an algorithmic-level approach by adjusting the class_weight parameter in your logistic regression model. How does setting class_weight='balanced' typically affect the model's decision-making process?$

class imbalance handling Hard

A.

It internally resamples the dataset to be balanced before each training epoch.

B.

It changes the optimization algorithm from gradient descent to a weighted least squares method.

C.

It adds a regularization term to the loss function that is proportional to the class imbalance ratio.

D.

It increases the penalty for misclassifying the minority class by a factor inversely proportional to its frequency, forcing the decision boundary to shift towards the majority class.

57 $When using KNNImputer, what is a significant potential risk if the feature space is high-dimensional and features are not scaled properly before imputation?$

handling missing values Hard

A.

The imputer will default to mean imputation if the number of dimensions exceeds a certain threshold (typically 100).

B.

The algorithm's time complexity will increase from to, making it unusable.

C.

The curse of dimensionality will cause all nearest-neighbor distances to converge to zero, preventing the algorithm from finding any neighbors.

D.

The distance metric will be dominated by features with large scales, causing the 'nearest neighbors' to be chosen based on only a few features, leading to irrelevant imputations.

58 $You have a very sparse dataset (many zero values), common in text analysis (e.g., TF-IDF vectors) or transaction data. You need to apply scaling before feeding it to a model. Why is StandardScaler generally a poor choice for this type of data, and which scaler would be more appropriate?$

scaling and normalization Hard

A.

StandardScaler will center the data by subtracting the mean, which destroys sparsity by converting all zero entries into non-zero, dense floating-point numbers. MaxAbsScaler is more appropriate.

B.

StandardScaler cannot handle zero values and will raise a 'division by zero' error during transformation.

C.

StandardScaler is computationally inefficient for sparse matrices. RobustScaler would be a better choice.

D.

Sparse data does not require scaling, as the zero values act as a natural baseline.

59 $Combining undersampling of the majority class with oversampling of the minority class is a common hybrid strategy. Which of the following pairs represents a sophisticated and synergistic combination for cleaning noisy datasets?$

class imbalance handling Hard

A.

ADASYN for oversampling followed by Cluster Centroids for undersampling.

B.

SMOTE for oversampling followed by Edited Nearest Neighbours (ENN) for undersampling.

C.

Tomek Links for undersampling followed by SMOTE for oversampling.

D.

Random Undersampling followed by Random Oversampling.

60 $During a data science competition, you discover that the row_id in the training data is perfectly correlated with the time the event occurred. For example, lower row_id values correspond to earlier dates. You decide not to use the row_id as a feature directly. However, you perform k-fold cross-validation by randomly shuffling and splitting the data. Why might your local cross-validation score be misleadingly high and not reflect your score on the future-dated, hidden test set?$

data leakage concept Hard

A.

K-fold cross-validation is not statistically valid for time-dependent data; bootstrap validation should be used instead.

B.

The row_id is a categorical feature and should have been one-hot encoded before cross-validation.

C.

The model is overfitting to the row_id even though it's not an explicit feature.

D.

The random shuffling violates the temporal ordering. The model is trained on future data to predict the past within each fold, a form of leakage that won't be possible on the real test set.

Unit 1 - Practice Quiz