Unit 1 - Practice Quiz

INT395 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary characteristic of Supervised Learning?

A. The model interacts with an environment and learns via a reward system.
B. The model groups data points based on inherent similarities without predefined categories.
C. The model learns from unlabeled data to find hidden patterns.
D. The model learns from a labeled dataset containing input-output pairs.

2 Which of the following scenarios is a Regression problem?

A. Grouping customers based on purchasing behavior.
B. Predicting the price of a house based on its square footage.
C. Classifying an image as a cat or a dog.
D. Predicting whether an email is spam or not.

3 Which library in Python is the standard for implementing classic machine learning algorithms like Decision Trees and SVMs?

A. Scikit-learn
B. Pandas
C. Matplotlib
D. NumPy

4 In a dataset, Ordinal Data refers to:

A. Categorical data with a clear ordering or ranking (e.g., Low, Medium, High).
B. Categorical data with no intrinsic order (e.g., Red, Blue, Green).
C. Binary data (e.g., True/False).
D. Continuous numerical data (e.g., Height, Weight).

5 Which Pandas function is primarily used to load data from a Comma Separated Values file?

A. pd.load_csv()
B. pd.read_excel()
C. pd.import_data()
D. pd.read_csv()

6 What is the purpose of the df.describe() method in Pandas?

A. To visualize the correlation matrix.
B. To provide summary statistics (mean, std, min, max) for numerical columns.
C. To drop missing values from the dataframe.
D. To show the data types and non-null counts of columns.

7 When handling missing data, what is Imputation?

A. Ignoring the column containing missing values.
B. Removing the rows containing missing values.
C. Converting the missing values to a specific category like "Unknown".
D. Replacing missing values with substituted values (e.g., mean, median, mode).

8 Which visualization is most effective for identifying Outliers in a numerical feature?

A. Scatter Plot
B. Bar Chart
C. Pie Chart
D. Box Plot

9 In the context of outlier detection, what does the IQR (Interquartile Range) represent?

A. The standard deviation of the dataset.
B. The difference between the maximum and minimum values.
C. The distance between the mean and the median.
D. The difference between the 75th percentile () and the 25th percentile ().

10 What is the formula for Min-Max Scaling (Normalization)?

A.
B.
C.
D.

11 Which scaling technique transforms data to have a mean of 0 and a standard deviation of 1?

A. Log Transformation
B. Min-Max Scaling
C. Robust Scaling
D. Standardization (Z-score normalization)

12 Why is One-Hot Encoding preferred over Label Encoding for nominal categorical variables?

A. It prevents the model from assuming a mathematical order or rank between categories.
B. It is faster to compute.
C. It handles missing values automatically.
D. It requires less memory.

13 What is the Dummy Variable Trap?

A. When categorical variables are not encoded.
B. When the target variable is imbalanced.
C. When independent variables are highly correlated (multicollinearity) due to including all dummy variables.
D. When missing values are replaced by zeros.

14 Which technique is commonly used to handle Class Imbalance by generating synthetic samples for the minority class?

A. SMOTE (Synthetic Minority Over-sampling Technique)
B. Principal Component Analysis
C. Stratified K-Fold
D. Random Undersampling

15 What is the primary goal of Feature Selection?

A. To scale features to the same range.
B. To fill missing values in the features.
C. To create new features from existing ones.
D. To select a subset of relevant features to improve model performance and reduce complexity.

16 Which of the following is an example of a Wrapper Method for feature selection?

A. Variance Threshold
B. Correlation Matrix
C. Lasso Regression (L1 regularization)
D. Recursive Feature Elimination (RFE)

17 What is the purpose of train_test_split in machine learning?

A. To split the dataset into training and validation/test sets to evaluate generalization.
B. To split the dataset into features () and target ().
C. To remove outliers from the data.
D. To separate numerical and categorical columns.

18 What is Data Leakage?

A. When data is lost during file transfer.
B. When the model leaks sensitive user information.
C. When information from outside the training dataset (like the test set) is used to create the model.
D. When the variance of the data is too high.

19 Which plot is best for visualizing the relationship between two continuous variables?

A. Scatter Plot
B. Bar Chart
C. Box Plot
D. Histogram

20 In the context of Pandas, what does df.isnull().sum() return?

A. The sum of all values in the dataframe.
B. The total number of rows in the dataframe.
C. The count of missing values in each column.
D. The count of unique values in each column.

21 When performing a train-test split on an imbalanced dataset, which parameter ensures the class distribution is preserved in both sets?

A. test_size=0.2
B. random_state=42
C. shuffle=True
D. stratify=y

22 Which of the following is a technique for Dimensionality Reduction?

A. Linear Regression
B. K-Nearest Neighbors
C. Principal Component Analysis (PCA)
D. Logistic Regression

23 The Curse of Dimensionality refers to:

A. The difficulty of visualizing 3D data.
B. Issues that arise when analyzing data in high-dimensional spaces (sparse data, increased computation).
C. The inability to add more features to a model.
D. The error caused by using incorrect units of measurement.

24 What is Feature Engineering?

A. The process of using domain knowledge to extract or create new features from raw data.
B. Selecting the best hardware for training.
C. Removing all categorical variables.
D. Downloading datasets from the internet.

25 Which Scikit-learn module contains StandardScaler and MinMaxScaler?

A. sklearn.ensemble
B. sklearn.metrics
C. sklearn.preprocessing
D. sklearn.linear_model

26 If a feature has a Variance of 0, what does it imply?

A. The feature is normally distributed.
B. The feature has a high correlation with the target.
C. The feature contains only one unique value for all samples.
D. The feature has missing values.

27 Which of the following is considered Unstructured Data?

A. A CSV file with labeled columns.
B. A SQL database table.
C. An Excel spreadsheet.
D. Images and Audio files.

28 What does a correlation coefficient of -0.9 indicate between two features?

A. A strong negative linear relationship.
B. A weak negative linear relationship.
C. A strong positive linear relationship.
D. No relationship.

29 When using LabelEncoder, how is the data transformed?

A. It converts text labels into binary columns.
B. It scales the data between 0 and 1.
C. It removes the column.
D. It converts text labels into integers (0, 1, 2, ...).

30 Which algorithm is generally NOT sensitive to the scale of features?

A. K-Nearest Neighbors (KNN)
B. Decision Trees
C. Support Vector Machines (SVM)
D. K-Means Clustering

31 In Scikit-Learn, what is the role of the fit() method?

A. To calculate the accuracy of the model.
B. To make predictions on new data.
C. To split the data.
D. To learn parameters (e.g., mean, coefficients) from the training data.

32 What is the difference between fit_transform() and transform()?

A. fit_transform is used on the training set to learn parameters and apply them; transform is used on the test set using learned parameters.
B. transform is only used for image data.
C. fit_transform is used on the test set; transform is used on the training set.
D. They are identical and can be used interchangeably.

33 Which Seaborn plot is used to visualize the Distribution of a single numerical variable?

A. sns.histplot() (or distplot)
B. sns.countplot()
C. sns.heatmap()
D. sns.scatterplot()

34 How do you handle Duplicate Rows in Pandas?

A. df.remove_copies()
B. df.drop_duplicates()
C. df.delete_repeats()
D. df.unique()

35 In PCA, what represents the direction of maximum variance in the data?

A. The Eigenvalues
B. The Mean vector
C. The Principal Components (Eigenvectors)
D. The Covariance matrix

36 What is Target Encoding (or Mean Encoding)?

A. Assigning random numbers to the target.
B. Encoding categorical variables based on the mean of the target variable for that category.
C. Replacing the target with the mean of the features.
D. Encoding the target variable into a One-Hot vector.

37 Which of the following indicates a skewed distribution?

A. The standard deviation is 0.
B. Mean = Median = Mode
C. The tail of the distribution is longer on one side than the other.
D. The distribution is symmetrical.

38 What is the result of executing df.info()?

A. A summary of statistical metrics.
B. A concise summary of the DataFrame including index dtype, columns, non-null values, and memory usage.
C. A correlation heatmap.
D. The first 5 rows of the DataFrame.

39 Before feeding text data into a supervised learning model, it must be converted into numerical vectors. This process is called:

A. Classification
B. Vectorization (e.g., TF-IDF, Bag of Words)
C. Normalization
D. Imputation

40 Which method helps in identifying Multicollinearity among features?

A. ROC Curve
B. Scatter plot of Feature vs Target
C. Heatmap of the Correlation Matrix
D. Confusion Matrix

41 If a dataset has missing values that are MCAR (Missing Completely At Random), which handling method is generally safe if the dataset is large?

A. Leaving them as NaN.
B. Replacing with a constant like -1.
C. Using a complex prediction model.
D. Dropping the rows with missing values.

42 What is the advantage of using a Pipeline in Scikit-Learn?

A. It allows for parallel processing on GPUs.
B. It automatically selects the best algorithm.
C. It chains together multiple processing steps (scaling, encoding, modeling) into a single object, preventing data leakage.
D. It creates a graphical user interface.

43 Which feature selection method uses a model's coef_ or feature_importances_ attribute to select features?

A. Unsupervised Method
B. Filter Method
C. Wrapper Method
D. Embedded Method

44 What is the shape of the output of df.shape in Pandas?

A. (Number of Columns, Number of Rows)
B. (Number of Rows, Number of Columns)
C. (Total Elements,)
D. (Number of Unique Values,)

45 Which of the following is a Classification algorithm?

A. Logistic Regression
B. Linear Regression
C. Ridge Regression
D. Polynomial Regression

46 When detecting outliers using the Z-score method, a common threshold to identify an outlier is a Z-score absolute value greater than:

A. 1.5
B. 10
C. 1
D. 3

47 What is the correct syntax to drop a column named 'ID' from a Pandas DataFrame df?

A. df.delete('ID')
B. df.drop('ID', axis=0)
C. df.drop('ID', axis=1)
D. df.remove('ID')

48 Why is Data Exploration (EDA) a critical first step?

A. To understand data structure, detect anomalies, test assumptions, and determine preprocessing needs.
B. It increases the size of the dataset.
C. It is required by the Python interpreter.
D. It automatically trains the model.

49 Which encoding technique creates a binary column for every category level?

A. Ordinal Encoding
B. Label Encoding
C. One-Hot Encoding
D. Target Encoding

50 What is the main drawback of PCA?

A. It only works on categorical data.
B. The resulting Principal Components are often difficult to interpret in terms of original features.
C. It is computationally very expensive for small datasets.
D. It increases the dimensionality of the data.