Unit 1 - Practice Quiz

INT395 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary characteristic of Supervised Learning?

A. The model learns from unlabeled data to find hidden patterns.
B. The model learns from a labeled dataset containing input-output pairs.
C. The model groups data points based on inherent similarities without predefined categories.
D. The model interacts with an environment and learns via a reward system.

2 Which of the following scenarios is a Regression problem?

A. Predicting the price of a house based on its square footage.
B. Grouping customers based on purchasing behavior.
C. Predicting whether an email is spam or not.
D. Classifying an image as a cat or a dog.

3 Which library in Python is the standard for implementing classic machine learning algorithms like Decision Trees and SVMs?

A. Pandas
B. Matplotlib
C. NumPy
D. Scikit-learn

4 In a dataset, Ordinal Data refers to:

A. Categorical data with a clear ordering or ranking (e.g., Low, Medium, High).
B. Binary data (e.g., True/False).
C. Continuous numerical data (e.g., Height, Weight).
D. Categorical data with no intrinsic order (e.g., Red, Blue, Green).

5 Which Pandas function is primarily used to load data from a Comma Separated Values file?

A. pd.import_data()
B. pd.read_csv()
C. pd.read_excel()
D. pd.load_csv()

6 What is the purpose of the df.describe() method in Pandas?

A. To show the data types and non-null counts of columns.
B. To provide summary statistics (mean, std, min, max) for numerical columns.
C. To visualize the correlation matrix.
D. To drop missing values from the dataframe.

7 When handling missing data, what is Imputation?

A. Converting the missing values to a specific category like "Unknown".
B. Ignoring the column containing missing values.
C. Removing the rows containing missing values.
D. Replacing missing values with substituted values (e.g., mean, median, mode).

8 Which visualization is most effective for identifying Outliers in a numerical feature?

A. Box Plot
B. Pie Chart
C. Bar Chart
D. Scatter Plot

9 In the context of outlier detection, what does the IQR (Interquartile Range) represent?

A. The standard deviation of the dataset.
B. The difference between the 75th percentile () and the 25th percentile ().
C. The distance between the mean and the median.
D. The difference between the maximum and minimum values.

10 What is the formula for Min-Max Scaling (Normalization)?

A.
B.
C.
D.

11 Which scaling technique transforms data to have a mean of 0 and a standard deviation of 1?

A. Log Transformation
B. Min-Max Scaling
C. Standardization (Z-score normalization)
D. Robust Scaling

12 Why is One-Hot Encoding preferred over Label Encoding for nominal categorical variables?

A. It requires less memory.
B. It handles missing values automatically.
C. It prevents the model from assuming a mathematical order or rank between categories.
D. It is faster to compute.

13 What is the Dummy Variable Trap?

A. When independent variables are highly correlated (multicollinearity) due to including all dummy variables.
B. When categorical variables are not encoded.
C. When missing values are replaced by zeros.
D. When the target variable is imbalanced.

14 Which technique is commonly used to handle Class Imbalance by generating synthetic samples for the minority class?

A. SMOTE (Synthetic Minority Over-sampling Technique)
B. Random Undersampling
C. Stratified K-Fold
D. Principal Component Analysis

15 What is the primary goal of Feature Selection?

A. To create new features from existing ones.
B. To scale features to the same range.
C. To select a subset of relevant features to improve model performance and reduce complexity.
D. To fill missing values in the features.

16 Which of the following is an example of a Wrapper Method for feature selection?

A. Correlation Matrix
B. Recursive Feature Elimination (RFE)
C. Lasso Regression (L1 regularization)
D. Variance Threshold

17 What is the purpose of train_test_split in machine learning?

A. To remove outliers from the data.
B. To split the dataset into training and validation/test sets to evaluate generalization.
C. To separate numerical and categorical columns.
D. To split the dataset into features () and target ().

18 What is Data Leakage?

A. When the model leaks sensitive user information.
B. When information from outside the training dataset (like the test set) is used to create the model.
C. When data is lost during file transfer.
D. When the variance of the data is too high.

19 Which plot is best for visualizing the relationship between two continuous variables?

A. Bar Chart
B. Scatter Plot
C. Box Plot
D. Histogram

20 In the context of Pandas, what does df.isnull().sum() return?

A. The total number of rows in the dataframe.
B. The count of unique values in each column.
C. The count of missing values in each column.
D. The sum of all values in the dataframe.

21 When performing a train-test split on an imbalanced dataset, which parameter ensures the class distribution is preserved in both sets?

A. random_state=42
B. stratify=y
C. test_size=0.2
D. shuffle=True

22 Which of the following is a technique for Dimensionality Reduction?

A. Linear Regression
B. Logistic Regression
C. K-Nearest Neighbors
D. Principal Component Analysis (PCA)

23 The Curse of Dimensionality refers to:

A. The difficulty of visualizing 3D data.
B. The inability to add more features to a model.
C. Issues that arise when analyzing data in high-dimensional spaces (sparse data, increased computation).
D. The error caused by using incorrect units of measurement.

24 What is Feature Engineering?

A. Downloading datasets from the internet.
B. The process of using domain knowledge to extract or create new features from raw data.
C. Selecting the best hardware for training.
D. Removing all categorical variables.

25 Which Scikit-learn module contains StandardScaler and MinMaxScaler?

A. sklearn.preprocessing
B. sklearn.metrics
C. sklearn.linear_model
D. sklearn.ensemble

26 If a feature has a Variance of 0, what does it imply?

A. The feature has missing values.
B. The feature contains only one unique value for all samples.
C. The feature has a high correlation with the target.
D. The feature is normally distributed.

27 Which of the following is considered Unstructured Data?

A. A CSV file with labeled columns.
B. An Excel spreadsheet.
C. A SQL database table.
D. Images and Audio files.

28 What does a correlation coefficient of -0.9 indicate between two features?

A. A strong negative linear relationship.
B. A strong positive linear relationship.
C. No relationship.
D. A weak negative linear relationship.

29 When using LabelEncoder, how is the data transformed?

A. It converts text labels into binary columns.
B. It converts text labels into integers (0, 1, 2, ...).
C. It removes the column.
D. It scales the data between 0 and 1.

30 Which algorithm is generally NOT sensitive to the scale of features?

A. K-Nearest Neighbors (KNN)
B. K-Means Clustering
C. Decision Trees
D. Support Vector Machines (SVM)

31 In Scikit-Learn, what is the role of the fit() method?

A. To learn parameters (e.g., mean, coefficients) from the training data.
B. To calculate the accuracy of the model.
C. To make predictions on new data.
D. To split the data.

32 What is the difference between fit_transform() and transform()?

A. fit_transform is used on the test set; transform is used on the training set.
B. transform is only used for image data.
C. They are identical and can be used interchangeably.
D. fit_transform is used on the training set to learn parameters and apply them; transform is used on the test set using learned parameters.

33 Which Seaborn plot is used to visualize the Distribution of a single numerical variable?

A. sns.heatmap()
B. sns.histplot() (or distplot)
C. sns.countplot()
D. sns.scatterplot()

34 How do you handle Duplicate Rows in Pandas?

A. df.drop_duplicates()
B. df.unique()
C. df.delete_repeats()
D. df.remove_copies()

35 In PCA, what represents the direction of maximum variance in the data?

A. The Principal Components (Eigenvectors)
B. The Eigenvalues
C. The Mean vector
D. The Covariance matrix

36 What is Target Encoding (or Mean Encoding)?

A. Replacing the target with the mean of the features.
B. Encoding categorical variables based on the mean of the target variable for that category.
C. Encoding the target variable into a One-Hot vector.
D. Assigning random numbers to the target.

37 Which of the following indicates a skewed distribution?

A. The tail of the distribution is longer on one side than the other.
B. The standard deviation is 0.
C. The distribution is symmetrical.
D. Mean = Median = Mode

38 What is the result of executing df.info()?

A. A correlation heatmap.
B. A summary of statistical metrics.
C. The first 5 rows of the DataFrame.
D. A concise summary of the DataFrame including index dtype, columns, non-null values, and memory usage.

39 Before feeding text data into a supervised learning model, it must be converted into numerical vectors. This process is called:

A. Vectorization (e.g., TF-IDF, Bag of Words)
B. Classification
C. Normalization
D. Imputation

40 Which method helps in identifying Multicollinearity among features?

A. Scatter plot of Feature vs Target
B. Heatmap of the Correlation Matrix
C. Confusion Matrix
D. ROC Curve

41 If a dataset has missing values that are MCAR (Missing Completely At Random), which handling method is generally safe if the dataset is large?

A. Using a complex prediction model.
B. Dropping the rows with missing values.
C. Leaving them as NaN.
D. Replacing with a constant like -1.

42 What is the advantage of using a Pipeline in Scikit-Learn?

A. It allows for parallel processing on GPUs.
B. It chains together multiple processing steps (scaling, encoding, modeling) into a single object, preventing data leakage.
C. It automatically selects the best algorithm.
D. It creates a graphical user interface.

43 Which feature selection method uses a model's coef_ or feature_importances_ attribute to select features?

A. Unsupervised Method
B. Wrapper Method
C. Embedded Method
D. Filter Method

44 What is the shape of the output of df.shape in Pandas?

A. (Number of Unique Values,)
B. (Number of Columns, Number of Rows)
C. (Total Elements,)
D. (Number of Rows, Number of Columns)

45 Which of the following is a Classification algorithm?

A. Ridge Regression
B. Polynomial Regression
C. Linear Regression
D. Logistic Regression

46 When detecting outliers using the Z-score method, a common threshold to identify an outlier is a Z-score absolute value greater than:

A. 1
B. 3
C. 10
D. 1.5

47 What is the correct syntax to drop a column named 'ID' from a Pandas DataFrame df?

A. df.drop('ID', axis=0)
B. df.remove('ID')
C. df.delete('ID')
D. df.drop('ID', axis=1)

48 Why is Data Exploration (EDA) a critical first step?

A. It increases the size of the dataset.
B. To understand data structure, detect anomalies, test assumptions, and determine preprocessing needs.
C. It automatically trains the model.
D. It is required by the Python interpreter.

49 Which encoding technique creates a binary column for every category level?

A. Target Encoding
B. One-Hot Encoding
C. Ordinal Encoding
D. Label Encoding

50 What is the main drawback of PCA?

A. It is computationally very expensive for small datasets.
B. The resulting Principal Components are often difficult to interpret in terms of original features.
C. It increases the dimensionality of the data.
D. It only works on categorical data.