Unit 1 - Practice Quiz

CSE274 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 Which type of data represents categories with a meaningful order or ranking, but the difference between values is not defined?

A. Nominal Data
B. Ordinal Data
C. Interval Data
D. Ratio Data

2 Temperature measured in degrees Celsius is an example of which type of data?

A. Nominal Data
B. Ordinal Data
C. Interval Data
D. Ratio Data

3 In the context of Machine Learning, what is 'Data Leakage'?

A. The loss of data during transmission.
B. When information from outside the training dataset is used to create the model.
C. When the model performs poorly on training data.
D. The process of reducing dimensionality.

4 Which of the following is a common cause of data leakage during pre-processing?

A. Removing outliers from the training set.
B. Imputing missing values using the mean of the entire dataset (train + test).
C. One-hot encoding categorical variables.
D. Splitting data into train and test sets.

5 If data is Missing Completely at Random (MCAR), it means that:

A. The probability of missingness depends on the observed data.
B. The probability of missingness depends on the unobserved data.
C. The probability of missingness is unrelated to any data, observed or unobserved.
D. The missing values are caused by a system error only.

6 Which imputation technique is most suitable for handling missing values in a categorical feature?

A. Mean Imputation
B. Median Imputation
C. Mode (Frequent Category) Imputation
D. Linear Interpolation

7 What is the primary risk of dropping all rows containing missing values?

A. It introduces bias if data is not MCAR.
B. It increases the computational time.
C. It creates outliers.
D. It increases the variance of the model.

8 Which formula represents Min-Max Scaling?

A.
B.
C.
D.

9 Standardization (Z-score normalization) transforms data such that:

A. The minimum is 0 and the maximum is 1.
B. The mean is 0 and the standard deviation is 1.
C. The median is 0 and the range is 1.
D. All values are positive.

10 Which scaling technique is robust to outliers?

A. Min-Max Scaler
B. Standard Scaler
C. Robust Scaler
D. MaxAbs Scaler

11 In the context of outlier detection, what is the Interquartile Range (IQR)?

A.
B.
C.
D.

12 Using the IQR method, a data point is typically considered an outlier if it falls below or above:

A.
B.
C.
D.

13 Which machine learning algorithm is generally insensitive to feature scaling?

A. K-Nearest Neighbors (KNN)
B. Support Vector Machines (SVM)
C. Decision Trees
D. Logistic Regression

14 One-Hot Encoding is best used for:

A. Ordinal categorical data.
B. Nominal categorical data with low cardinality.
C. Nominal categorical data with extremely high cardinality.
D. Continuous numerical data.

15 What is the 'Dummy Variable Trap'?

A. When categorical variables are ignored.
B. A scenario where independent variables are highly correlated (perfect multicollinearity) after One-Hot Encoding.
C. When missing values are replaced by zeros.
D. When the target variable is imbalanced.

16 Label Encoding is potentially dangerous for linear models when used on nominal data because:

A. It introduces missing values.
B. It implies an artificial order or magnitude between categories.
C. It increases dimensionality.
D. It cannot handle text data.

17 Which technique for handling high cardinality categorical features involves replacing the category with the mean of the target variable?

A. Label Encoding
B. One-Hot Encoding
C. Target Encoding (Mean Encoding)
D. Frequency Encoding

18 Which of the following is an example of 'Structured Data'?

A. Audio recordings.
B. Images.
C. Relational database tables.
D. Emails.

19 SMOTE is a technique used for:

A. Feature Scaling.
B. Dimensionality Reduction.
C. Handling Class Imbalance.
D. Missing Value Imputation.

20 How does SMOTE generate new samples?

A. It duplicates existing minority samples randomly.
B. It interpolates between existing minority samples and their nearest neighbors.
C. It generates random noise around the majority class.
D. It removes majority samples.

21 In the context of class imbalance, what is 'Undersampling'?

A. Reducing the number of features.
B. Reducing the number of samples in the majority class.
C. Reducing the number of samples in the minority class.
D. Reducing the model complexity.

22 What is the primary drawback of Random Undersampling?

A. It leads to overfitting.
B. It increases training time significantly.
C. It may discard potentially useful information.
D. It creates synthetic data.

23 When handling missing values in time-series data, which method fills the missing value with the previous observed value?

A. Backward Fill
B. Forward Fill
C. Linear Interpolation
D. Mean Imputation

24 Which of the following distributions suggests that Log Transformation might be beneficial?

A. Normal Distribution
B. Uniform Distribution
C. Right-Skewed Distribution
D. Left-Skewed Distribution

25 What is 'Binning' or 'Discretization' in data pre-processing?

A. Converting categorical features into numerical ones.
B. Converting continuous features into discrete intervals.
C. Removing missing values.
D. Scaling features to unit variance.

26 Which Z-score value is typically used as a threshold to identify outliers?

A.
B.
C.
D.

27 Why is 'Accuracy' a poor metric for imbalanced datasets?

A. It is computationally expensive to calculate.
B. It cannot handle negative values.
C. A model can achieve high accuracy by predicting only the majority class.
D. It requires scaled data.

28 What is 'Winsorization' used for?

A. Imputing missing values.
B. Handling outliers by capping extreme values.
C. Encoding categorical variables.
D. Selecting features.

29 In text data pre-processing, what does 'Tokenization' refer to?

A. Converting text to numbers.
B. Removing stop words.
C. Splitting text into smaller units like words or subwords.
D. Reducing words to their root form.

30 Which encoding method is suitable for cyclic features like 'Day of the Week' or 'Month'?

A. One-Hot Encoding
B. Label Encoding
C. Sin-Cos Transformation
D. Count Encoding

31 Which of the following indicates 'Left-Skewed' data?

A.
B.
C.
D.

32 Data cleaning is distinct from Data reduction because:

A. Cleaning handles missing values/noise; reduction decreases volume/dimensions.
B. Cleaning reduces dimensions; reduction handles noise.
C. They are the same process.
D. Cleaning is only for images.

33 The 'Curse of Dimensionality' refers to problems caused by:

A. Too many rows in the dataset.
B. Too many missing values.
C. Too many features (columns) relative to the number of observations.
D. Highly correlated features.

34 Which method handles class imbalance by modifying the loss function to penalize mistakes on the minority class more heavily?

A. SMOTE
B. Undersampling
C. Class Weights / Cost-sensitive Learning
D. Standardization

35 Which of the following is NOT a technique for Feature Scaling?

A. Min-Max Normalization
B. Z-score Standardization
C. Decimal Scaling
D. Principal Component Analysis (PCA)

36 KNN Imputation finds missing values by:

A. Using the mean of the column.
B. Using the most frequent value.
C. Finding 'k' similar samples and averaging their values.
D. Predicting using a Linear Regression model.

37 When normalizing a dataset using Min-Max scaling to the range [0, 1], what happens to the maximum value of the original data?

A. It becomes 0.
B. It becomes 1.
C. It remains unchanged.
D. It becomes the mean.

38 Which type of data leakage occurs when the feature set includes a variable that is a proxy for the target variable (e.g., 'Account Closed Date' when predicting 'Churn')?

A. Train-Test Contamination
B. Target Leakage
C. Optimization Leakage
D. Parameter Leakage

39 What is 'Feature Hashing' useful for?

A. Reducing the dimensionality of high-cardinality categorical data.
B. Sorting data.
C. Encrypting data for security.
D. Removing outliers.

40 Which of the following is a 'Univariate' method for outlier detection?

A. Z-score
B. DBSCAN
C. Isolation Forest
D. Local Outlier Factor (LOF)

41 L2 Normalization (scaling to unit vector) modifies data such that:

A. The sum of the values in a row is 1.
B. The sum of the squares of the values in a row is 1.
C. The maximum value in a row is 1.
D. The mean of the row is 0.

42 In the context of Missing Data, what does MNAR stand for?

A. Missing Not At Random
B. Missing Null At Random
C. Missing Number As Ratio
D. Mean Null And Ratio

43 What is the result of 'Binarization'?

A. Scaling data to [0, 1].
B. Converting numerical features to boolean (0 or 1) based on a threshold.
C. Converting text to binary code.
D. Splitting data into two clusters.

44 Which pre-processing step is essential before applying Principal Component Analysis (PCA)?

A. One-Hot Encoding
B. Feature Scaling (Standardization)
C. SMOTE
D. Binning

45 What is the risk of Over-sampling the minority class using simple duplication?

A. Loss of information.
B. Overfitting.
C. Underfitting.
D. Data Leakage.

46 DBSCAN is a clustering algorithm that can also be used for:

A. Imputing missing values.
B. Outlier/Noise detection.
C. Feature Scaling.
D. Label Encoding.

47 Which type of missing value mechanism allows for 'Row Deletion' (Complete Case Analysis) with the least bias?

A. MNAR (Missing Not At Random)
B. MAR (Missing At Random)
C. MCAR (Missing Completely At Random)
D. None of the above

48 When using K-Fold Cross Validation, data pre-processing (like scaling) should be applied:

A. To the entire dataset before splitting into folds.
B. Inside the cross-validation loop, fitted on the training fold and applied to the validation fold.
C. Only to the training data, ignoring the validation data.
D. After the model has been trained.

49 Frequency Encoding replaces a category with:

A. A sequential number.
B. The count or percentage of times it appears in the dataset.
C. The mean of the target variable.
D. A binary vector.

50 Identify the Ratio Data among the following:

A. IQ Score
B. Temperature in Celsius
C. Height in Centimeters
D. Movie Rating (1-5 stars)