1Which type of data represents categories with a meaningful order or ranking, but the difference between values is not defined?
A.Nominal Data
B.Ordinal Data
C.Interval Data
D.Ratio Data
Correct Answer: Ordinal Data
Explanation:Ordinal data represents categories with an intrinsic order (e.g., low, medium, high), but the mathematical difference between these categories is not quantifiable.
Incorrect! Try again.
2Temperature measured in degrees Celsius is an example of which type of data?
A.Nominal Data
B.Ordinal Data
C.Interval Data
D.Ratio Data
Correct Answer: Interval Data
Explanation:Celsius is Interval data because the difference between values is meaningful, but it lacks a true absolute zero (0°C does not mean 'no temperature').
Incorrect! Try again.
3In the context of Machine Learning, what is 'Data Leakage'?
A.The loss of data during transmission.
B.When information from outside the training dataset is used to create the model.
C.When the model performs poorly on training data.
D.The process of reducing dimensionality.
Correct Answer: When information from outside the training dataset is used to create the model.
Explanation:Data leakage occurs when the training data contains information about the target, but similar data will not be available when the model is used for prediction, leading to overly optimistic performance estimates.
Incorrect! Try again.
4Which of the following is a common cause of data leakage during pre-processing?
A.Removing outliers from the training set.
B.Imputing missing values using the mean of the entire dataset (train + test).
C.One-hot encoding categorical variables.
D.Splitting data into train and test sets.
Correct Answer: Imputing missing values using the mean of the entire dataset (train + test).
Explanation:Calculating statistics (like mean or variance) on the entire dataset allows information from the test set to 'leak' into the training process. Statistics should be calculated only on the training set.
Incorrect! Try again.
5If data is Missing Completely at Random (MCAR), it means that:
A. The probability of missingness depends on the observed data.
B.The probability of missingness depends on the unobserved data.
C.The probability of missingness is unrelated to any data, observed or unobserved.
D.The missing values are caused by a system error only.
Correct Answer: The probability of missingness is unrelated to any data, observed or unobserved.
Explanation:MCAR implies there is no relationship between the missingness of the data and any values, observed or missing. It is essentially random.
Incorrect! Try again.
6Which imputation technique is most suitable for handling missing values in a categorical feature?
Explanation:Categorical data cannot be averaged. Replacing missing values with the most frequent category (Mode) is a standard approach.
Incorrect! Try again.
7What is the primary risk of dropping all rows containing missing values?
A.It introduces bias if data is not MCAR.
B.It increases the computational time.
C.It creates outliers.
D.It increases the variance of the model.
Correct Answer: It introduces bias if data is not MCAR.
Explanation:If the missingness is not random (e.g., specific groups are less likely to report income), dropping rows reduces the sample size and introduces significant bias into the model.
Incorrect! Try again.
8Which formula represents Min-Max Scaling?
A.
B.
C.
D.
Correct Answer:
Explanation:Min-Max scaling transforms features by scaling each feature to a given range, usually [0, 1].
Incorrect! Try again.
9Standardization (Z-score normalization) transforms data such that:
A.The minimum is 0 and the maximum is 1.
B.The mean is 0 and the standard deviation is 1.
C.The median is 0 and the range is 1.
D.All values are positive.
Correct Answer: The mean is 0 and the standard deviation is 1.
Explanation:Standardization centers the distribution around the mean (0) and scales it to have a unit standard deviation.
Incorrect! Try again.
10Which scaling technique is robust to outliers?
A.Min-Max Scaler
B.Standard Scaler
C.Robust Scaler
D.MaxAbs Scaler
Correct Answer: Robust Scaler
Explanation:The Robust Scaler uses the median and the Interquartile Range (IQR) rather than the mean and variance, making it less influenced by extreme outliers.
Incorrect! Try again.
11In the context of outlier detection, what is the Interquartile Range (IQR)?
A.
B.
C.
D.
Correct Answer:
Explanation:IQR is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
Incorrect! Try again.
12Using the IQR method, a data point is typically considered an outlier if it falls below or above:
A.
B.
C.
D.
Correct Answer:
Explanation:The standard threshold for 'fences' in a boxplot to detect outliers is above the third quartile.
Incorrect! Try again.
13Which machine learning algorithm is generally insensitive to feature scaling?
A.K-Nearest Neighbors (KNN)
B.Support Vector Machines (SVM)
C.Decision Trees
D.Logistic Regression
Correct Answer: Decision Trees
Explanation:Decision trees split data based on thresholds of single features at a time. The absolute magnitude of the feature does not affect the split location relative to the data distribution.
Incorrect! Try again.
14One-Hot Encoding is best used for:
A.Ordinal categorical data.
B.Nominal categorical data with low cardinality.
C.Nominal categorical data with extremely high cardinality.
D.Continuous numerical data.
Correct Answer: Nominal categorical data with low cardinality.
Explanation:One-Hot Encoding creates binary columns for each category. It is ideal for nominal data (no order) but can cause the 'Curse of Dimensionality' if the cardinality (number of unique categories) is too high.
Incorrect! Try again.
15What is the 'Dummy Variable Trap'?
A.When categorical variables are ignored.
B.A scenario where independent variables are highly correlated (perfect multicollinearity) after One-Hot Encoding.
C.When missing values are replaced by zeros.
D.When the target variable is imbalanced.
Correct Answer: A scenario where independent variables are highly correlated (perfect multicollinearity) after One-Hot Encoding.
Explanation:If a category has unique values and we create binary columns, one column can be predicted from the others (sum is 1). This multicollinearity breaks some models like Linear Regression. Solution: drop one column ( dummies).
Incorrect! Try again.
16Label Encoding is potentially dangerous for linear models when used on nominal data because:
A.It introduces missing values.
B.It implies an artificial order or magnitude between categories.
C.It increases dimensionality.
D.It cannot handle text data.
Correct Answer: It implies an artificial order or magnitude between categories.
Explanation:Label encoding assigns integers (0, 1, 2...). A linear model might interpret category 2 as being 'greater than' category 1, which is incorrect for nominal data (e.g., Red vs Blue).
Incorrect! Try again.
17Which technique for handling high cardinality categorical features involves replacing the category with the mean of the target variable?
A.Label Encoding
B.One-Hot Encoding
C.Target Encoding (Mean Encoding)
D.Frequency Encoding
Correct Answer: Target Encoding (Mean Encoding)
Explanation:Target encoding replaces a categorical value with the mean of the target variable for that specific category. It handles high cardinality well but risks overfitting.
Incorrect! Try again.
18Which of the following is an example of 'Structured Data'?
A.Audio recordings.
B.Images.
C.Relational database tables.
D.Emails.
Correct Answer: Relational database tables.
Explanation:Structured data is highly organized and formatted, typically in rows and columns (like SQL tables or CSVs), making it easy to search and analyze.
Incorrect! Try again.
19SMOTE is a technique used for:
A.Feature Scaling.
B.Dimensionality Reduction.
C.Handling Class Imbalance.
D.Missing Value Imputation.
Correct Answer: Handling Class Imbalance.
Explanation:SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples for the minority class to balance the dataset.
B.It interpolates between existing minority samples and their nearest neighbors.
C.It generates random noise around the majority class.
D.It removes majority samples.
Correct Answer: It interpolates between existing minority samples and their nearest neighbors.
Explanation:SMOTE selects a minority instance, finds its k-nearest neighbors, and creates a new point along the line segment joining the instance and a neighbor.
Incorrect! Try again.
21In the context of class imbalance, what is 'Undersampling'?
A.Reducing the number of features.
B.Reducing the number of samples in the majority class.
C.Reducing the number of samples in the minority class.
D.Reducing the model complexity.
Correct Answer: Reducing the number of samples in the majority class.
Explanation:Undersampling balances the dataset by randomly (or strategically) removing examples from the majority class.
Incorrect! Try again.
22What is the primary drawback of Random Undersampling?
A.It leads to overfitting.
B.It increases training time significantly.
C.It may discard potentially useful information.
D.It creates synthetic data.
Correct Answer: It may discard potentially useful information.
Explanation:By removing actual data points from the majority class, the model loses information that could be valuable for learning decision boundaries.
Incorrect! Try again.
23When handling missing values in time-series data, which method fills the missing value with the previous observed value?
A.Backward Fill
B.Forward Fill
C.Linear Interpolation
D.Mean Imputation
Correct Answer: Forward Fill
Explanation:Forward fill propagates the last valid observation forward to next valid observation.
Incorrect! Try again.
24Which of the following distributions suggests that Log Transformation might be beneficial?
A.Normal Distribution
B.Uniform Distribution
C.Right-Skewed Distribution
D.Left-Skewed Distribution
Correct Answer: Right-Skewed Distribution
Explanation:Log transformation compresses the range of large values, making a right-skewed (long tail) distribution more normally distributed.
Incorrect! Try again.
25What is 'Binning' or 'Discretization' in data pre-processing?
A.Converting categorical features into numerical ones.
B.Converting continuous features into discrete intervals.
C.Removing missing values.
D.Scaling features to unit variance.
Correct Answer: Converting continuous features into discrete intervals.
Explanation:Binning groups continuous values into bins (intervals), converting numerical variables into categorical ones (e.g., Age 1-10 -> 'Child').
Incorrect! Try again.
26Which Z-score value is typically used as a threshold to identify outliers?
A.
B.
C.
D.
Correct Answer:
Explanation:In a normal distribution, 99.7% of data points lie within 3 standard deviations. Values with are typically considered outliers.
Incorrect! Try again.
27Why is 'Accuracy' a poor metric for imbalanced datasets?
A.It is computationally expensive to calculate.
B.It cannot handle negative values.
C.A model can achieve high accuracy by predicting only the majority class.
D.It requires scaled data.
Correct Answer: A model can achieve high accuracy by predicting only the majority class.
Explanation:If 99% of data is Class A and 1% is Class B, a model predicting 'All A' has 99% accuracy but is useless. F1-score or AUC-ROC are better metrics.
Incorrect! Try again.
28What is 'Winsorization' used for?
A.Imputing missing values.
B.Handling outliers by capping extreme values.
C.Encoding categorical variables.
D.Selecting features.
Correct Answer: Handling outliers by capping extreme values.
Explanation:Winsorization limits extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing them with a specified percentile value.
Incorrect! Try again.
29In text data pre-processing, what does 'Tokenization' refer to?
A.Converting text to numbers.
B.Removing stop words.
C.Splitting text into smaller units like words or subwords.
D.Reducing words to their root form.
Correct Answer: Splitting text into smaller units like words or subwords.
Explanation:Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
Incorrect! Try again.
30Which encoding method is suitable for cyclic features like 'Day of the Week' or 'Month'?
A.One-Hot Encoding
B.Label Encoding
C.Sin-Cos Transformation
D.Count Encoding
Correct Answer: Sin-Cos Transformation
Explanation:Cyclic features (where the end connects to the start, like Dec to Jan) are best represented using Sine and Cosine transformations to preserve the cyclic nature.
Incorrect! Try again.
31Which of the following indicates 'Left-Skewed' data?
A.
B.
C.
D.
Correct Answer:
Explanation:In a left-skewed distribution, the tail is on the left side, pulling the mean lower than the median.
Incorrect! Try again.
32Data cleaning is distinct from Data reduction because:
Explanation:Data cleaning focuses on fixing errors and missing values. Data reduction focuses on reducing the size of the dataset (via feature selection or instance selection) without losing significant information.
Incorrect! Try again.
33The 'Curse of Dimensionality' refers to problems caused by:
A.Too many rows in the dataset.
B.Too many missing values.
C.Too many features (columns) relative to the number of observations.
D.Highly correlated features.
Correct Answer: Too many features (columns) relative to the number of observations.
Explanation:As the number of features increases, the volume of the space increases so fast that the available data becomes sparse, making it difficult to find patterns.
Incorrect! Try again.
34Which method handles class imbalance by modifying the loss function to penalize mistakes on the minority class more heavily?
A.SMOTE
B.Undersampling
C.Class Weights / Cost-sensitive Learning
D.Standardization
Correct Answer: Class Weights / Cost-sensitive Learning
Explanation:Instead of changing the data, this approach assigns a higher weight (cost) to misclassifying the minority class during model training.
Incorrect! Try again.
35Which of the following is NOT a technique for Feature Scaling?
A.Min-Max Normalization
B.Z-score Standardization
C.Decimal Scaling
D.Principal Component Analysis (PCA)
Correct Answer: Principal Component Analysis (PCA)
Explanation:PCA is a dimensionality reduction technique, not a scaling technique (though it requires scaled data to work correctly).
Incorrect! Try again.
36KNN Imputation finds missing values by:
A.Using the mean of the column.
B.Using the most frequent value.
C.Finding 'k' similar samples and averaging their values.
D.Predicting using a Linear Regression model.
Correct Answer: Finding 'k' similar samples and averaging their values.
Explanation:KNN Imputation identifies the nearest data points in the multi-dimensional space and uses their values to impute the missing data.
Incorrect! Try again.
37When normalizing a dataset using Min-Max scaling to the range [0, 1], what happens to the maximum value of the original data?
A.It becomes 0.
B.It becomes 1.
C.It remains unchanged.
D.It becomes the mean.
Correct Answer: It becomes 1.
Explanation:By definition of the formula , when , the result is 1.
Incorrect! Try again.
38Which type of data leakage occurs when the feature set includes a variable that is a proxy for the target variable (e.g., 'Account Closed Date' when predicting 'Churn')?
A.Train-Test Contamination
B.Target Leakage
C.Optimization Leakage
D.Parameter Leakage
Correct Answer: Target Leakage
Explanation:Target leakage happens when a feature is included that would not be available at the time of prediction and effectively reveals the outcome.
Incorrect! Try again.
39What is 'Feature Hashing' useful for?
A.Reducing the dimensionality of high-cardinality categorical data.
B.Sorting data.
C.Encrypting data for security.
D.Removing outliers.
Correct Answer: Reducing the dimensionality of high-cardinality categorical data.
Explanation:Feature Hashing maps high-cardinality categories to a fixed-size vector using a hashing function, avoiding the memory issues of One-Hot Encoding.
Incorrect! Try again.
40Which of the following is a 'Univariate' method for outlier detection?
A.Z-score
B.DBSCAN
C.Isolation Forest
D.Local Outlier Factor (LOF)
Correct Answer: Z-score
Explanation:Z-score looks at a single variable's distribution. DBSCAN, Isolation Forest, and LOF are multivariate (they consider relationships between multiple variables).
Incorrect! Try again.
41L2 Normalization (scaling to unit vector) modifies data such that:
A.The sum of the values in a row is 1.
B.The sum of the squares of the values in a row is 1.
C.The maximum value in a row is 1.
D.The mean of the row is 0.
Correct Answer: The sum of the squares of the values in a row is 1.
Explanation:L2 Normalization divides each vector by its Euclidean length (L2 norm), so the sum of squared elements equals 1.
Incorrect! Try again.
42In the context of Missing Data, what does MNAR stand for?
A.Missing Not At Random
B.Missing Null At Random
C.Missing Number As Ratio
D.Mean Null And Ratio
Correct Answer: Missing Not At Random
Explanation:MNAR means the missingness is related to the value itself (e.g., people with very high incomes refusing to disclose them).
Incorrect! Try again.
43What is the result of 'Binarization'?
A.Scaling data to [0, 1].
B.Converting numerical features to boolean (0 or 1) based on a threshold.
C.Converting text to binary code.
D.Splitting data into two clusters.
Correct Answer: Converting numerical features to boolean (0 or 1) based on a threshold.
Explanation:Binarization maps continuous values to 0 or 1 depending on whether they are below or above a user-defined threshold.
Incorrect! Try again.
44Which pre-processing step is essential before applying Principal Component Analysis (PCA)?
A.One-Hot Encoding
B.Feature Scaling (Standardization)
C.SMOTE
D.Binning
Correct Answer: Feature Scaling (Standardization)
Explanation:PCA seeks to maximize variance. If features have different scales (e.g., KM vs MM), PCA will be biased toward the feature with larger magnitude unless standardized.
Incorrect! Try again.
45What is the risk of Over-sampling the minority class using simple duplication?
A.Loss of information.
B.Overfitting.
C.Underfitting.
D.Data Leakage.
Correct Answer: Overfitting.
Explanation:Duplicating existing samples causes the model to 'memorize' those specific points, leading to poor generalization (overfitting) on unseen data.
Incorrect! Try again.
46DBSCAN is a clustering algorithm that can also be used for:
A.Imputing missing values.
B.Outlier/Noise detection.
C.Feature Scaling.
D.Label Encoding.
Correct Answer: Outlier/Noise detection.
Explanation:DBSCAN groups dense points together. Points that do not belong to any cluster (in low-density regions) are classified as noise/outliers.
Incorrect! Try again.
47Which type of missing value mechanism allows for 'Row Deletion' (Complete Case Analysis) with the least bias?
A.MNAR (Missing Not At Random)
B.MAR (Missing At Random)
C.MCAR (Missing Completely At Random)
D.None of the above
Correct Answer: MCAR (Missing Completely At Random)
Explanation:If data is MCAR, the missing data is a random subset of the whole. Dropping it reduces sample size but does not bias the distribution.
Incorrect! Try again.
48When using K-Fold Cross Validation, data pre-processing (like scaling) should be applied:
A.To the entire dataset before splitting into folds.
B.Inside the cross-validation loop, fitted on the training fold and applied to the validation fold.
C.Only to the training data, ignoring the validation data.
D.After the model has been trained.
Correct Answer: Inside the cross-validation loop, fitted on the training fold and applied to the validation fold.
Explanation:Applying scaling before splitting causes data leakage (information from the validation fold influences the scaler). It must be done inside the loop.
Incorrect! Try again.
49Frequency Encoding replaces a category with:
A.A sequential number.
B.The count or percentage of times it appears in the dataset.
C.The mean of the target variable.
D.A binary vector.
Correct Answer: The count or percentage of times it appears in the dataset.
Explanation:Frequency encoding maps the category to its frequency. It preserves information about the prevalence of the category.
Incorrect! Try again.
50Identify the Ratio Data among the following:
A.IQ Score
B.Temperature in Celsius
C.Height in Centimeters
D.Movie Rating (1-5 stars)
Correct Answer: Height in Centimeters
Explanation:Height is Ratio data because it has a true zero (0 cm means no height) and ratios are meaningful ( is twice as tall as ). IQ and Celsius are Interval; Ratings are Ordinal.