Unit 1 - Practice Quiz

INT395

1 What is the primary characteristic of Supervised Learning?

A. The model learns from unlabeled data to find hidden patterns.
B. The model learns from a labeled dataset containing input-output pairs.
C. The model interacts with an environment and learns via a reward system.
D. The model groups data points based on inherent similarities without predefined categories.

2 Which of the following scenarios is a Regression problem?

A. Predicting whether an email is spam or not.
B. Predicting the price of a house based on its square footage.
C. Classifying an image as a cat or a dog.
D. Grouping customers based on purchasing behavior.

3 Which library in Python is the standard for implementing classic machine learning algorithms like Decision Trees and SVMs?

A. Pandas
B. Scikit-learn
C. Matplotlib
D. NumPy

4 In a dataset, Ordinal Data refers to:

A. Categorical data with no intrinsic order (e.g., Red, Blue, Green).
B. Categorical data with a clear ordering or ranking (e.g., Low, Medium, High).
C. Continuous numerical data (e.g., Height, Weight).
D. Binary data (e.g., True/False).

5 Which Pandas function is primarily used to load data from a Comma Separated Values file?

A. pd.load_csv()
B. pd.read_excel()
C. pd.read_csv()
D. pd.import_data()

6 What is the purpose of the df.describe() method in Pandas?

A. To visualize the correlation matrix.
B. To show the data types and non-null counts of columns.
C. To provide summary statistics (mean, std, min, max) for numerical columns.
D. To drop missing values from the dataframe.

7 When handling missing data, what is Imputation?

A. Removing the rows containing missing values.
B. Replacing missing values with substituted values (e.g., mean, median, mode).
C. Ignoring the column containing missing values.
D. Converting the missing values to a specific category like "Unknown".

8 Which visualization is most effective for identifying Outliers in a numerical feature?

A. Scatter Plot
B. Box Plot
C. Pie Chart
D. Bar Chart

9 In the context of outlier detection, what does the IQR (Interquartile Range) represent?

A. The difference between the maximum and minimum values.
B. The difference between the 75th percentile () and the 25th percentile ().
C. The standard deviation of the dataset.
D. The distance between the mean and the median.

10 What is the formula for Min-Max Scaling (Normalization)?

A.
B.
C.
D.

11 Which scaling technique transforms data to have a mean of 0 and a standard deviation of 1?

A. Min-Max Scaling
B. Standardization (Z-score normalization)
C. Robust Scaling
D. Log Transformation

12 Why is One-Hot Encoding preferred over Label Encoding for nominal categorical variables?

A. It requires less memory.
B. It prevents the model from assuming a mathematical order or rank between categories.
C. It handles missing values automatically.
D. It is faster to compute.

13 What is the Dummy Variable Trap?

A. When categorical variables are not encoded.
B. When independent variables are highly correlated (multicollinearity) due to including all dummy variables.
C. When the target variable is imbalanced.
D. When missing values are replaced by zeros.

14 Which technique is commonly used to handle Class Imbalance by generating synthetic samples for the minority class?

A. Random Undersampling
B. SMOTE (Synthetic Minority Over-sampling Technique)
C. Stratified K-Fold
D. Principal Component Analysis

15 What is the primary goal of Feature Selection?

A. To create new features from existing ones.
B. To select a subset of relevant features to improve model performance and reduce complexity.
C. To scale features to the same range.
D. To fill missing values in the features.

16 Which of the following is an example of a Wrapper Method for feature selection?

A. Correlation Matrix
B. Recursive Feature Elimination (RFE)
C. Lasso Regression (L1 regularization)
D. Variance Threshold

17 What is the purpose of train_test_split in machine learning?

A. To separate numerical and categorical columns.
B. To split the dataset into training and validation/test sets to evaluate generalization.
C. To split the dataset into features () and target ().
D. To remove outliers from the data.

18 What is Data Leakage?

A. When data is lost during file transfer.
B. When information from outside the training dataset (like the test set) is used to create the model.
C. When the model leaks sensitive user information.
D. When the variance of the data is too high.

19 Which plot is best for visualizing the relationship between two continuous variables?

A. Histogram
B. Scatter Plot
C. Bar Chart
D. Box Plot

20 In the context of Pandas, what does df.isnull().sum() return?

A. The total number of rows in the dataframe.
B. The sum of all values in the dataframe.
C. The count of missing values in each column.
D. The count of unique values in each column.

21 When performing a train-test split on an imbalanced dataset, which parameter ensures the class distribution is preserved in both sets?

A. shuffle=True
B. random_state=42
C. stratify=y
D. test_size=0.2

22 Which of the following is a technique for Dimensionality Reduction?

A. Linear Regression
B. Principal Component Analysis (PCA)
C. K-Nearest Neighbors
D. Logistic Regression

23 The Curse of Dimensionality refers to:

A. The difficulty of visualizing 3D data.
B. Issues that arise when analyzing data in high-dimensional spaces (sparse data, increased computation).
C. The inability to add more features to a model.
D. The error caused by using incorrect units of measurement.

24 What is Feature Engineering?

A. Selecting the best hardware for training.
B. The process of using domain knowledge to extract or create new features from raw data.
C. Removing all categorical variables.
D. Downloading datasets from the internet.

25 Which Scikit-learn module contains StandardScaler and MinMaxScaler?

A. sklearn.linear_model
B. sklearn.preprocessing
C. sklearn.metrics
D. sklearn.ensemble

26 If a feature has a Variance of 0, what does it imply?

A. The feature has a high correlation with the target.
B. The feature contains only one unique value for all samples.
C. The feature is normally distributed.
D. The feature has missing values.

27 Which of the following is considered Unstructured Data?

A. A SQL database table.
B. An Excel spreadsheet.
C. Images and Audio files.
D. A CSV file with labeled columns.

28 What does a correlation coefficient of -0.9 indicate between two features?

A. No relationship.
B. A strong positive linear relationship.
C. A strong negative linear relationship.
D. A weak negative linear relationship.

29 When using LabelEncoder, how is the data transformed?

A. It converts text labels into binary columns.
B. It converts text labels into integers (0, 1, 2, ...).
C. It scales the data between 0 and 1.
D. It removes the column.

30 Which algorithm is generally NOT sensitive to the scale of features?

A. K-Nearest Neighbors (KNN)
B. Support Vector Machines (SVM)
C. Decision Trees
D. K-Means Clustering

31 In Scikit-Learn, what is the role of the fit() method?

A. To make predictions on new data.
B. To calculate the accuracy of the model.
C. To learn parameters (e.g., mean, coefficients) from the training data.
D. To split the data.

32 What is the difference between fit_transform() and transform()?

A. fit_transform is used on the training set to learn parameters and apply them; transform is used on the test set using learned parameters.
B. fit_transform is used on the test set; transform is used on the training set.
C. They are identical and can be used interchangeably.
D. transform is only used for image data.

33 Which Seaborn plot is used to visualize the Distribution of a single numerical variable?

A. sns.heatmap()
B. sns.scatterplot()
C. sns.histplot() (or distplot)
D. sns.countplot()

34 How do you handle Duplicate Rows in Pandas?

A. df.drop_duplicates()
B. df.remove_copies()
C. df.delete_repeats()
D. df.unique()

35 In PCA, what represents the direction of maximum variance in the data?

A. The Eigenvalues
B. The Principal Components (Eigenvectors)
C. The Mean vector
D. The Covariance matrix

36 What is Target Encoding (or Mean Encoding)?

A. Encoding categorical variables based on the mean of the target variable for that category.
B. Encoding the target variable into a One-Hot vector.
C. Replacing the target with the mean of the features.
D. Assigning random numbers to the target.

37 Which of the following indicates a skewed distribution?

A. Mean = Median = Mode
B. The distribution is symmetrical.
C. The tail of the distribution is longer on one side than the other.
D. The standard deviation is 0.

38 What is the result of executing df.info()?

A. A summary of statistical metrics.
B. A concise summary of the DataFrame including index dtype, columns, non-null values, and memory usage.
C. The first 5 rows of the DataFrame.
D. A correlation heatmap.

39 Before feeding text data into a supervised learning model, it must be converted into numerical vectors. This process is called:

A. Vectorization (e.g., TF-IDF, Bag of Words)
B. Normalization
C. Imputation
D. Classification

40 Which method helps in identifying Multicollinearity among features?

A. Confusion Matrix
B. ROC Curve
C. Heatmap of the Correlation Matrix
D. Scatter plot of Feature vs Target

41 If a dataset has missing values that are MCAR (Missing Completely At Random), which handling method is generally safe if the dataset is large?

A. Dropping the rows with missing values.
B. Replacing with a constant like -1.
C. Using a complex prediction model.
D. Leaving them as NaN.

42 What is the advantage of using a Pipeline in Scikit-Learn?

A. It allows for parallel processing on GPUs.
B. It chains together multiple processing steps (scaling, encoding, modeling) into a single object, preventing data leakage.
C. It automatically selects the best algorithm.
D. It creates a graphical user interface.

43 Which feature selection method uses a model's coef_ or feature_importances_ attribute to select features?

A. Filter Method
B. Embedded Method
C. Wrapper Method
D. Unsupervised Method

44 What is the shape of the output of df.shape in Pandas?

A. (Number of Columns, Number of Rows)
B. (Number of Rows, Number of Columns)
C. (Total Elements,)
D. (Number of Unique Values,)

45 Which of the following is a Classification algorithm?

A. Linear Regression
B. Logistic Regression
C. Polynomial Regression
D. Ridge Regression

46 When detecting outliers using the Z-score method, a common threshold to identify an outlier is a Z-score absolute value greater than:

A. 1
B. 1.5
C. 3
D. 10

47 What is the correct syntax to drop a column named 'ID' from a Pandas DataFrame df?

A. df.drop('ID', axis=0)
B. df.drop('ID', axis=1)
C. df.remove('ID')
D. df.delete('ID')

48 Why is Data Exploration (EDA) a critical first step?

A. It is required by the Python interpreter.
B. To understand data structure, detect anomalies, test assumptions, and determine preprocessing needs.
C. It automatically trains the model.
D. It increases the size of the dataset.

49 Which encoding technique creates a binary column for every category level?

A. Label Encoding
B. Target Encoding
C. One-Hot Encoding
D. Ordinal Encoding

50 What is the main drawback of PCA?

A. It increases the dimensionality of the data.
B. It is computationally very expensive for small datasets.
C. The resulting Principal Components are often difficult to interpret in terms of original features.
D. It only works on categorical data.