1What is the primary characteristic of Supervised Learning?
A.The model interacts with an environment and learns via a reward system.
B.The model groups data points based on inherent similarities without predefined categories.
C.The model learns from unlabeled data to find hidden patterns.
D.The model learns from a labeled dataset containing input-output pairs.
Correct Answer: The model learns from a labeled dataset containing input-output pairs.
Explanation:
In supervised learning, the algorithm is trained on a labeled dataset, meaning the data includes both the input features and the corresponding correct output (target).
Incorrect! Try again.
2Which of the following scenarios is a Regression problem?
A.Grouping customers based on purchasing behavior.
B.Predicting the price of a house based on its square footage.
C.Classifying an image as a cat or a dog.
D.Predicting whether an email is spam or not.
Correct Answer: Predicting the price of a house based on its square footage.
Explanation:
Regression deals with predicting continuous numerical values (like price), whereas the other options represent classification (categorical output) or clustering.
Incorrect! Try again.
3Which library in Python is the standard for implementing classic machine learning algorithms like Decision Trees and SVMs?
A.Scikit-learn
B.Pandas
C.Matplotlib
D.NumPy
Correct Answer: Scikit-learn
Explanation:
Scikit-learn (sklearn) is the most widely used Python library for implementing standard machine learning algorithms, preprocessing, and model evaluation.
Incorrect! Try again.
4In a dataset, Ordinal Data refers to:
A.Categorical data with a clear ordering or ranking (e.g., Low, Medium, High).
B.Categorical data with no intrinsic order (e.g., Red, Blue, Green).
C.Binary data (e.g., True/False).
D.Continuous numerical data (e.g., Height, Weight).
Correct Answer: Categorical data with a clear ordering or ranking (e.g., Low, Medium, High).
Explanation:
Ordinal data is a type of categorical data where the values have a meaningful order or rank, but the intervals between the ranks may not be equal.
Incorrect! Try again.
5Which Pandas function is primarily used to load data from a Comma Separated Values file?
A.pd.load_csv()
B.pd.read_excel()
C.pd.import_data()
D.pd.read_csv()
Correct Answer: pd.read_csv()
Explanation:
The function pd.read_csv() is the standard Pandas method for loading data from CSV files into a DataFrame.
Incorrect! Try again.
6What is the purpose of the df.describe() method in Pandas?
A.To visualize the correlation matrix.
B.To provide summary statistics (mean, std, min, max) for numerical columns.
C.To drop missing values from the dataframe.
D.To show the data types and non-null counts of columns.
Correct Answer: To provide summary statistics (mean, std, min, max) for numerical columns.
Explanation:
df.describe() generates descriptive statistics including those that summarize the central tendency, dispersion, and shape of a dataset’s distribution.
Incorrect! Try again.
7When handling missing data, what is Imputation?
A.Ignoring the column containing missing values.
B.Removing the rows containing missing values.
C.Converting the missing values to a specific category like "Unknown".
D.Replacing missing values with substituted values (e.g., mean, median, mode).
Imputation refers to the process of replacing missing data with substituted values to retain the data point for analysis.
Incorrect! Try again.
8Which visualization is most effective for identifying Outliers in a numerical feature?
A.Scatter Plot
B.Bar Chart
C.Pie Chart
D.Box Plot
Correct Answer: Box Plot
Explanation:
A Box Plot visually depicts groups of numerical data through their quartiles and explicitly indicates outliers as individual points beyond the 'whiskers'.
Incorrect! Try again.
9In the context of outlier detection, what does the IQR (Interquartile Range) represent?
A.The standard deviation of the dataset.
B.The difference between the maximum and minimum values.
C.The distance between the mean and the median.
D.The difference between the 75th percentile () and the 25th percentile ().
Correct Answer: The difference between the 75th percentile () and the 25th percentile ().
Explanation:
. It measures the statistical spread of the middle 50% of the data and is used to define the bounds for outliers.
Incorrect! Try again.
10What is the formula for Min-Max Scaling (Normalization)?
A.
B.
C.
D.
Correct Answer:
Explanation:
Min-Max scaling transforms features by scaling each feature to a given range, usually [0, 1], using the minimum and maximum values of that feature.
Incorrect! Try again.
11Which scaling technique transforms data to have a mean of 0 and a standard deviation of 1?
Standardization (using StandardScaler in sklearn) centers the distribution around 0 and scales it to unit variance.
Incorrect! Try again.
12Why is One-Hot Encoding preferred over Label Encoding for nominal categorical variables?
A.It prevents the model from assuming a mathematical order or rank between categories.
B.It is faster to compute.
C.It handles missing values automatically.
D.It requires less memory.
Correct Answer: It prevents the model from assuming a mathematical order or rank between categories.
Explanation:
Label encoding assigns integers (0, 1, 2...) to categories, which some algorithms might misinterpret as an ordinal relationship (2 > 1). One-Hot encoding avoids this.
Incorrect! Try again.
13What is the Dummy Variable Trap?
A.When categorical variables are not encoded.
B.When the target variable is imbalanced.
C.When independent variables are highly correlated (multicollinearity) due to including all dummy variables.
D.When missing values are replaced by zeros.
Correct Answer: When independent variables are highly correlated (multicollinearity) due to including all dummy variables.
Explanation:
The Dummy Variable Trap occurs when one variable can be predicted from the others (e.g., Female = 1 - Male). This multicollinearity can break some models like linear regression. It is solved by dropping one dummy column.
Incorrect! Try again.
14Which technique is commonly used to handle Class Imbalance by generating synthetic samples for the minority class?
Wrapper methods, like RFE, select features by recursively training the model on subsets of features and evaluating performance.
Incorrect! Try again.
17What is the purpose of train_test_split in machine learning?
A.To split the dataset into training and validation/test sets to evaluate generalization.
B.To split the dataset into features () and target ().
C.To remove outliers from the data.
D.To separate numerical and categorical columns.
Correct Answer: To split the dataset into training and validation/test sets to evaluate generalization.
Explanation:
Splitting data ensures the model is evaluated on unseen data, preventing the estimation of performance based on data the model has already memorized.
Incorrect! Try again.
18What is Data Leakage?
A.When data is lost during file transfer.
B.When the model leaks sensitive user information.
C.When information from outside the training dataset (like the test set) is used to create the model.
D.When the variance of the data is too high.
Correct Answer: When information from outside the training dataset (like the test set) is used to create the model.
Explanation:
Data leakage occurs when the model unknowingly has access to the target or test distribution during training (e.g., scaling before splitting), leading to overly optimistic performance estimates.
Incorrect! Try again.
19Which plot is best for visualizing the relationship between two continuous variables?
A.Scatter Plot
B.Bar Chart
C.Box Plot
D.Histogram
Correct Answer: Scatter Plot
Explanation:
Scatter plots map individual data points on an X-Y plane, making them ideal for observing correlations between two continuous variables.
Incorrect! Try again.
20In the context of Pandas, what does df.isnull().sum() return?
A.The sum of all values in the dataframe.
B.The total number of rows in the dataframe.
C.The count of missing values in each column.
D.The count of unique values in each column.
Correct Answer: The count of missing values in each column.
Explanation:
isnull() returns a boolean mask, and sum() counts the True values (which represent missing data) per column.
Incorrect! Try again.
21When performing a train-test split on an imbalanced dataset, which parameter ensures the class distribution is preserved in both sets?
A.test_size=0.2
B.random_state=42
C.shuffle=True
D.stratify=y
Correct Answer: stratify=y
Explanation:
The stratify parameter ensures that the proportion of values in the sample produced will be the same as the proportion of values provided in the target array y.
Incorrect! Try again.
22Which of the following is a technique for Dimensionality Reduction?
A.Linear Regression
B.K-Nearest Neighbors
C.Principal Component Analysis (PCA)
D.Logistic Regression
Correct Answer: Principal Component Analysis (PCA)
Explanation:
PCA is a technique used to reduce the dimensionality of datasets, increasing interpretability but minimizing information loss by creating new uncorrelated variables.
Incorrect! Try again.
23The Curse of Dimensionality refers to:
A.The difficulty of visualizing 3D data.
B.Issues that arise when analyzing data in high-dimensional spaces (sparse data, increased computation).
C.The inability to add more features to a model.
D.The error caused by using incorrect units of measurement.
Correct Answer: Issues that arise when analyzing data in high-dimensional spaces (sparse data, increased computation).
Explanation:
As the number of features increases, the volume of the space increases exponentially, making the data sparse and distance metrics less meaningful.
Incorrect! Try again.
24What is Feature Engineering?
A.The process of using domain knowledge to extract or create new features from raw data.
B.Selecting the best hardware for training.
C.Removing all categorical variables.
D.Downloading datasets from the internet.
Correct Answer: The process of using domain knowledge to extract or create new features from raw data.
Explanation:
Feature engineering involves creating new input features from the existing ones (e.g., extracting 'Year' from a 'Date' column) to improve model performance.
Incorrect! Try again.
25Which Scikit-learn module contains StandardScaler and MinMaxScaler?
A.sklearn.ensemble
B.sklearn.metrics
C.sklearn.preprocessing
D.sklearn.linear_model
Correct Answer: sklearn.preprocessing
Explanation:
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation suitable for downstream estimators.
Incorrect! Try again.
26If a feature has a Variance of 0, what does it imply?
A.The feature is normally distributed.
B.The feature has a high correlation with the target.
C.The feature contains only one unique value for all samples.
D.The feature has missing values.
Correct Answer: The feature contains only one unique value for all samples.
Explanation:
Variance measures spread. If variance is 0, the values do not spread at all, meaning all values are identical. Such features carry no information and should be removed.
Incorrect! Try again.
27Which of the following is considered Unstructured Data?
A.A CSV file with labeled columns.
B.A SQL database table.
C.An Excel spreadsheet.
D.Images and Audio files.
Correct Answer: Images and Audio files.
Explanation:
Unstructured data does not have a predefined data model or is not organized in a pre-defined manner (like tables), examples include text, images, and audio.
Incorrect! Try again.
28What does a correlation coefficient of -0.9 indicate between two features?
A.A strong negative linear relationship.
B.A weak negative linear relationship.
C.A strong positive linear relationship.
D.No relationship.
Correct Answer: A strong negative linear relationship.
Explanation:
Correlation coefficients range from -1 to 1. A value close to -1 indicates that as one variable increases, the other decreases strongly.
Incorrect! Try again.
29When using LabelEncoder, how is the data transformed?
A.It converts text labels into binary columns.
B.It scales the data between 0 and 1.
C.It removes the column.
D.It converts text labels into integers (0, 1, 2, ...).
Correct Answer: It converts text labels into integers (0, 1, 2, ...).
Explanation:
LabelEncoder replaces unique categories with integer codes.
Incorrect! Try again.
30Which algorithm is generally NOT sensitive to the scale of features?
A.K-Nearest Neighbors (KNN)
B.Decision Trees
C.Support Vector Machines (SVM)
D.K-Means Clustering
Correct Answer: Decision Trees
Explanation:
Decision Trees and Random Forests split nodes based on thresholds of single features, so the absolute scale of the feature does not affect the structure of the tree. Distance-based models (KNN, SVM, K-Means) are highly sensitive.
Incorrect! Try again.
31In Scikit-Learn, what is the role of the fit() method?
A.To calculate the accuracy of the model.
B.To make predictions on new data.
C.To split the data.
D.To learn parameters (e.g., mean, coefficients) from the training data.
Correct Answer: To learn parameters (e.g., mean, coefficients) from the training data.
Explanation:
fit() triggers the training process where the algorithm learns the internal parameters from the provided data.
Incorrect! Try again.
32What is the difference between fit_transform() and transform()?
A.fit_transform is used on the training set to learn parameters and apply them; transform is used on the test set using learned parameters.
B.transform is only used for image data.
C.fit_transform is used on the test set; transform is used on the training set.
D.They are identical and can be used interchangeably.
Correct Answer: fit_transform is used on the training set to learn parameters and apply them; transform is used on the test set using learned parameters.
Explanation:
We use fit_transform on training data to calculate means/SDs and scale the data. We use transform on test data to scale it using the training means/SDs to prevent data leakage.
Incorrect! Try again.
33Which Seaborn plot is used to visualize the Distribution of a single numerical variable?
A.sns.histplot() (or distplot)
B.sns.countplot()
C.sns.heatmap()
D.sns.scatterplot()
Correct Answer: sns.histplot() (or distplot)
Explanation:
Histograms (histplot or the deprecated distplot) are designed to show the frequency distribution of a single continuous variable.
Incorrect! Try again.
34How do you handle Duplicate Rows in Pandas?
A.df.remove_copies()
B.df.drop_duplicates()
C.df.delete_repeats()
D.df.unique()
Correct Answer: df.drop_duplicates()
Explanation:
drop_duplicates() is the Pandas method to remove duplicate rows from a DataFrame.
Incorrect! Try again.
35In PCA, what represents the direction of maximum variance in the data?
A.The Eigenvalues
B.The Mean vector
C.The Principal Components (Eigenvectors)
D.The Covariance matrix
Correct Answer: The Principal Components (Eigenvectors)
Explanation:
The first Principal Component is the eigenvector associated with the largest eigenvalue, representing the direction of maximum variance.
Incorrect! Try again.
36What is Target Encoding (or Mean Encoding)?
A.Assigning random numbers to the target.
B.Encoding categorical variables based on the mean of the target variable for that category.
C.Replacing the target with the mean of the features.
D.Encoding the target variable into a One-Hot vector.
Correct Answer: Encoding categorical variables based on the mean of the target variable for that category.
Explanation:
Target encoding replaces a categorical feature value with the average value of the target variable for that specific category. It is powerful but risks overfitting.
Incorrect! Try again.
37Which of the following indicates a skewed distribution?
A.The standard deviation is 0.
B.Mean = Median = Mode
C.The tail of the distribution is longer on one side than the other.
D.The distribution is symmetrical.
Correct Answer: The tail of the distribution is longer on one side than the other.
Explanation:
Skewness refers to asymmetry in the distribution. A long tail to the right is positive skew; a long tail to the left is negative skew.
Incorrect! Try again.
38What is the result of executing df.info()?
A.A summary of statistical metrics.
B.A concise summary of the DataFrame including index dtype, columns, non-null values, and memory usage.
C.A correlation heatmap.
D.The first 5 rows of the DataFrame.
Correct Answer: A concise summary of the DataFrame including index dtype, columns, non-null values, and memory usage.
Explanation:
df.info() is essential for initial exploration to check data types and identify missing values structurally.
Incorrect! Try again.
39Before feeding text data into a supervised learning model, it must be converted into numerical vectors. This process is called:
A.Classification
B.Vectorization (e.g., TF-IDF, Bag of Words)
C.Normalization
D.Imputation
Correct Answer: Vectorization (e.g., TF-IDF, Bag of Words)
Explanation:
Machine learning models require numerical input. Text vectorization transforms text strings into numerical vectors.
Incorrect! Try again.
40Which method helps in identifying Multicollinearity among features?
A.ROC Curve
B.Scatter plot of Feature vs Target
C.Heatmap of the Correlation Matrix
D.Confusion Matrix
Correct Answer: Heatmap of the Correlation Matrix
Explanation:
A correlation heatmap visualizes the correlation coefficients between all pairs of features. High values between two independent features indicate multicollinearity.
Incorrect! Try again.
41If a dataset has missing values that are MCAR (Missing Completely At Random), which handling method is generally safe if the dataset is large?
A.Leaving them as NaN.
B.Replacing with a constant like -1.
C.Using a complex prediction model.
D.Dropping the rows with missing values.
Correct Answer: Dropping the rows with missing values.
Explanation:
If data is MCAR, the missingness implies no hidden bias. If the dataset is large enough, dropping these rows does not introduce bias, though it reduces sample size.
Incorrect! Try again.
42What is the advantage of using a Pipeline in Scikit-Learn?
A.It allows for parallel processing on GPUs.
B.It automatically selects the best algorithm.
C.It chains together multiple processing steps (scaling, encoding, modeling) into a single object, preventing data leakage.
D.It creates a graphical user interface.
Correct Answer: It chains together multiple processing steps (scaling, encoding, modeling) into a single object, preventing data leakage.
Explanation:
Pipelines ensure that preprocessing steps (like scaling) are applied correctly during cross-validation (fitting only on train folds), preventing leakage.
Incorrect! Try again.
43Which feature selection method uses a model's coef_ or feature_importances_ attribute to select features?
A.Unsupervised Method
B.Filter Method
C.Wrapper Method
D.Embedded Method
Correct Answer: Embedded Method
Explanation:
Embedded methods (like Lasso or Random Forest) perform feature selection during the model training process, assigning weights or importance scores to features.
Incorrect! Try again.
44What is the shape of the output of df.shape in Pandas?
A.(Number of Columns, Number of Rows)
B.(Number of Rows, Number of Columns)
C.(Total Elements,)
D.(Number of Unique Values,)
Correct Answer: (Number of Rows, Number of Columns)
Explanation:
df.shape returns a tuple representing the dimensionality of the DataFrame in the format (rows, columns).
Incorrect! Try again.
45Which of the following is a Classification algorithm?
A.Logistic Regression
B.Linear Regression
C.Ridge Regression
D.Polynomial Regression
Correct Answer: Logistic Regression
Explanation:
Despite the name, Logistic Regression is a classification algorithm used to predict binary outcomes (probabilities).
Incorrect! Try again.
46When detecting outliers using the Z-score method, a common threshold to identify an outlier is a Z-score absolute value greater than:
A.1.5
B.10
C.1
D.3
Correct Answer: 3
Explanation:
In a normal distribution, 99.7% of data points lie within 3 standard deviations. Points beyond are typically considered outliers.
Incorrect! Try again.
47What is the correct syntax to drop a column named 'ID' from a Pandas DataFrame df?
A.df.delete('ID')
B.df.drop('ID', axis=0)
C.df.drop('ID', axis=1)
D.df.remove('ID')
Correct Answer: df.drop('ID', axis=1)
Explanation:
axis=1 refers to columns. axis=0 refers to rows.
Incorrect! Try again.
48Why is Data Exploration (EDA) a critical first step?
A.To understand data structure, detect anomalies, test assumptions, and determine preprocessing needs.
B.It increases the size of the dataset.
C.It is required by the Python interpreter.
D.It automatically trains the model.
Correct Answer: To understand data structure, detect anomalies, test assumptions, and determine preprocessing needs.
Explanation:
EDA allows the data scientist to understand the nature of the data, relationships between variables, and quality issues before attempting to model.
Incorrect! Try again.
49Which encoding technique creates a binary column for every category level?
A.Ordinal Encoding
B.Label Encoding
C.One-Hot Encoding
D.Target Encoding
Correct Answer: One-Hot Encoding
Explanation:
One-Hot Encoding expands a categorical column into multiple binary columns (0 or 1), one for each unique category.
Incorrect! Try again.
50What is the main drawback of PCA?
A.It only works on categorical data.
B.The resulting Principal Components are often difficult to interpret in terms of original features.
C.It is computationally very expensive for small datasets.
D.It increases the dimensionality of the data.
Correct Answer: The resulting Principal Components are often difficult to interpret in terms of original features.
Explanation:
PCA transforms original features into linear combinations (Principal Components). While this reduces dimensions, the physical meaning of the original features is lost in the transformation.