1 $Data that represents categories with a meaningful order, such as t-shirt sizes (Small, Medium, Large), is called:$

Types of data Easy

A.

Nominal data

B.

Ratio data

C.

Numerical data

D.

Ordinal data

2 $Which command is used in the Pandas library to load data from a CSV file named data.csv into a DataFrame?$

Loading data using Pandas Easy

A.

pd.load_csv('data.csv')

B.

pd.read_csv('data.csv')

C.

pd.open_file('data.csv')

D.

pd.import_data('data.csv')

3 $What is the primary purpose of a histogram in univariate analysis?$

Univariate analysis using Histogram Easy

A.

To count the occurrences of categorical data.

B.

To visualize the frequency distribution of a single numerical variable.

C.

To show the relationship between two different variables.

D.

To display the exact value of each data point.

4 $In a box plot, what does the line inside the box represent?$

Box plot Easy

A.

The standard deviation

B.

The mode (most frequent value)

C.

The mean (average)

D.

The median (50th percentile)

5 $A count plot is most suitable for visualizing the frequency of which type of data?$

Count plots Easy

A.

Numerical data

B.

Time-series data

C.

Geospatial data

D.

Categorical data

6 $What is a scatter plot primarily used for in bivariate analysis?$

Bivariate analysis using Scatter plots Easy

A.

To visualize the relationship and correlation between two numerical variables.

B.

To plot data points over a continuous time interval.

C.

To show the distribution of a single categorical variable.

D.

To compare the means of a numerical variable across different categories.

7 $A line plot is most effective for visualizing what kind of data?$

Line plots Easy

A.

The distribution of a single feature.

B.

The correlation between two non-sequential variables.

C.

Data that changes over a continuous interval, like time.

D.

Static counts of different categories.

8 $In a correlation heatmap, what does a value close to -1.0 between two variables indicate?$

Correlation analysis using Heatmaps Easy

A.

A weak relationship.

B.

No linear relationship.

C.

A strong negative linear relationship.

D.

A strong positive linear relationship.

9 $What is the definition of multicollinearity?$

Multicollinearity Easy

A.

When an independent variable is highly correlated with the target variable.

B.

When two or more independent variables in a regression model are highly correlated.

C.

When the dataset has too many columns.

D.

When a variable's distribution is not normal.

10 $A distribution where the tail is longer on the left side is described as:$

Distribution analysis: Skewness Easy

A.

Symmetrical

B.

Normally distributed

C.

Negatively skewed

D.

Positively skewed

11 $What does the kurtosis of a data distribution measure?$

Distribution analysis: Kurtosis Easy

A.

The asymmetry or skew of the distribution.

B.

The central point of the distribution.

C.

The correlation between variables.

D.

The 'tailedness' or presence of outliers in the distribution.

12 $In EDA, a data point that lies far away from the other data points is generally called an:$

Detecting patterns, anomalies and trends Easy

A.

Mode

B.

Outlier

C.

Inlier

D.

Median

13 $What is the primary goal of Exploratory Data Analysis (EDA)?$

EDA workflow integration Easy

A.

To prove a pre-defined hypothesis with statistical tests.

B.

To collect the raw data from its source.

C.

To understand the data, find patterns, and summarize its main characteristics.

D.

To build the final, most accurate machine learning model.

14 $A country's name (e.g., 'USA', 'Canada', 'India') is an example of what type of data?$

Types of data Easy

A.

Ordinal data

B.

Ratio data

C.

Nominal data

D.

Interval data

15 $If a scatter plot shows data points forming a cloud with no discernible slope, what does this suggest about the relationship between the two variables?$

Bivariate analysis using Scatter plots Easy

A.

There is a strong positive correlation.

B.

There is little to no linear correlation.

C.

The variables are causally related.

D.

There is a strong negative correlation.

16 $In a box plot, the 'box' itself visually represents the:$

Box plot Easy

A.

Interquartile Range (IQR)

B.

Full range of the data

C.

Standard Deviation

D.

Mean

17 $A violin plot is useful because it combines a box plot with a:$

Violin plots for category vs numeric Easy

A.

Kernel Density Estimate (KDE) plot

B.

Line plot

C.

Bar chart

D.

Scatter plot

18 $What is the main advantage of using a heatmap to display a correlation matrix?$

Correlation analysis using Heatmaps Easy

A.

It calculates the p-value for each correlation automatically.

B.

It can plot more than two variables at once.

C.

It is the only way to visualize correlations.

D.

It uses color to help quickly identify the strength and direction of correlations.

19 $In a perfectly symmetrical distribution, what is the value of skewness?$

Distribution analysis: Skewness Easy

A.

0

B.

1

C.

It depends on the standard deviation.

D.

-1

20 $After performing initial data cleaning and loading, which activity is a common next step in the EDA workflow?$

EDA workflow integration Easy

A.

Writing the final project report.

B.

Univariate analysis to understand individual variables.

C.

Deploying the model to production.

D.

Choosing the final model algorithm.

21 $A data scientist is preparing data for a multiple linear regression model. They generate a correlation heatmap and notice two independent variables, house_size_sqft and num_bedrooms, have a correlation coefficient of 0.88. What is the most significant issue this presents for the model?$

Multicollinearity Medium

A.

The presence of significant outliers in both variables.

B.

A non-linear relationship that the linear model cannot capture.

C.

Multicollinearity, which can make model coefficients unstable and difficult to interpret.

D.

Low variance in the predictor variables, making them poor predictors.

22 $An analyst wants to compare the distribution of salaries across different departments in a company. Which plot provides the most detailed information by combining the summary statistics of a box plot with a kernel density estimate of the distribution?$

Violin plots for category vs numeric Medium

A.

Scatter plot

B.

Clustered bar chart

C.

Violin plot

D.

Stacked histogram

23 $You calculate the skewness of a feature representing exam scores and find it to be -1.5. How would you interpret the distribution of these scores?$

Distribution analysis: Skewness, Kurtosis Medium

A.

The distribution is heavily skewed to the right, with a long tail of high scores, and most students scored low.

B.

The distribution is heavily skewed to the left, with a long tail of low scores, and most students scored high.

C.

The distribution is bimodal, suggesting two distinct groups of student performance.

D.

The distribution is symmetric, with an equal number of high and low scores around the mean.

24 $A box plot for a feature age shows the median line (Q2) is very close to the bottom of the box (the first quartile, Q1). What does this indicate about the data distribution?$

Box plot Medium

A.

Approximately 50% of the data points are considered outliers.

B.

The data is negatively skewed (skewed to the left).

C.

The interquartile range (IQR) is very small, indicating low variability.

D.

The data is positively skewed (skewed to the right).

25 $When analyzing a correlation heatmap, you see a cell with a value of -0.92 between variables study_hours and error_rate . What is the most appropriate interpretation?$

Correlation analysis using Heatmaps Medium

A.

There is a strong positive linear relationship; as one increases, the other increases.

B.

The relationship is causal; more study hours cause a lower error rate.

C.

There is a strong negative linear relationship; as one increases, the other decreases.

D.

There is virtually no relationship between the variables.

26 $You are trying to load a very large CSV file named transactions.csv that contains millions of rows, and your machine is running out of memory. Which Pandas read_csv strategy is most effective for processing the entire file without loading it all into memory at once?$

Loading data using Pandas Medium

A.

Setting header=None to ignore the column names and save a small amount of memory.

B.

Setting low_memory=False, which forces Pandas to read the whole file at once to infer dtypes.

C.

Using usecols to load only a small subset of columns, even if all columns are needed for analysis.

D.

Using the chunksize parameter to create an iterator and process the file in smaller pieces.

27 $You are analyzing daily website traffic data over several years using a line plot. You notice that traffic consistently peaks in November-December and drops in January-February of every year. What is this recurring pattern called?$

Detecting patterns, anomalies and trends Medium

A.

An anomaly

B.

A random fluctuation

C.

Seasonality

D.

A long-term trend

28 $You create a histogram for the price of houses in a city and observe two distinct, separate peaks (a bimodal distribution). What is the most plausible interpretation?$

Univariate analysis using Histogram Medium

A.

The data contains a significant number of entry errors.

B.

The data is perfectly normally distributed.

C.

There are likely two different sub-populations of houses (e.g., apartments and detached homes) with different typical price points.

D.

The data is uniformly distributed across all price ranges.

29 $A scatter plot of fertilizer_amount vs crop_yield shows points forming a curve that goes up and then comes down, resembling an inverted 'U'. The calculated Pearson correlation coefficient is 0.05. What is the correct conclusion?$

Bivariate analysis using Scatter plots Medium

A.

There is no relationship between fertilizer amount and crop yield.

B.

There is a strong linear relationship between the variables.

C.

There is a strong non-linear relationship, and the low Pearson correlation is expected because it only measures linear association.

D.

The data contains too many outliers to draw a valid conclusion.

30 $In a typical machine learning project workflow, what is the primary role of Exploratory Data Analysis (EDA)?$

EDA workflow integration Medium

A.

To directly build and train the final predictive model without any data transformation.

B.

To deploy the final model to a production environment.

C.

To collect raw data from various sources.

D.

To summarize data characteristics, find patterns, detect anomalies, and inform feature engineering and model selection.

31 $A dataset of daily stock returns has a kurtosis value of 9.0. A normal distribution has a kurtosis of 3.0. What does this high kurtosis value (leptokurtosis) imply about the investment risk?$

Distribution analysis: Skewness, Kurtosis Medium

A.

The distribution is platykurtic, with fewer outliers than a normal distribution.

B.

The risk is lower than a normal distribution because returns are more predictable.

C.

The risk is higher because the distribution has "fatter tails," meaning a higher probability of extreme positive or negative returns (outliers).

D.

The kurtosis value is irrelevant for assessing financial risk.

32 $A dataset for an e-commerce site includes a product_ID column (e.g., "SKU-84302", "SKU-91034"). Although it contains numbers, it is used for identification only, and mathematical operations on it are meaningless. What is the correct data type for this column?$

Types of data Medium

A.

Continuous Numerical

B.

Discrete Numerical

C.

Ordinal Categorical

D.

Nominal Categorical

33 $You are analyzing a dataset of customer support tickets. To quickly visualize the number of tickets for each priority level ("Low", "Medium", "High", "Urgent"), which type of plot is most suitable?$

Univariate analysis using Count plots Medium

A.

Count plot or Bar chart

B.

Histogram

C.

Scatter plot

D.

Box plot

34 $An analyst is studying the closing price of a stock over the last 5 years. To best visualize the trend and potential seasonal patterns over this continuous time period, what is the most appropriate plot?$

Bivariate analysis using Line plots Medium

A.

A line plot with Time on the x-axis and Price on the y-axis.

B.

A histogram of all the stock prices.

C.

A box plot of prices grouped by month.

D.

A scatter plot of Price vs. Day of the Year.

35 $A common method to diagnose multicollinearity is by calculating the Variance Inflation Factor (VIF) for each predictor. If a predictor variable has a VIF of 25, what does this indicate?$

Multicollinearity Medium

A.

The variable has a weak, negative correlation with the target variable.

B.

The variable is not correlated with any other predictors and is a good candidate for the model.

C.

The variance of the regression coefficient for this variable is inflated by a factor of 25 due to its strong correlation with other predictors.

D.

The variable has 25% missing values that need to be imputed.

36 $You are comparing the distribution of test scores for two different teaching methods, A and B. When would a violin plot be significantly more informative than a standard box plot?$

Violin plots for category vs numeric Medium

A.

When the dataset is extremely small (e.g., fewer than 10 students per method).

B.

When the distribution of scores for method A is bimodal (has two peaks), while method B's is unimodal.

C.

When you need to identify the exact minimum and maximum scores in the data.

D.

When you only care about the median and interquartile range.

37 $You are examining a scatter plot of Years of Experience vs Salary . You observe that the points generally form a tight, upward-sloping line, but there are a few points representing high experience with an unusually low salary. What do these points likely represent?$

Detecting patterns, anomalies and trends Medium

A.

The normal, expected variance in salary for a given experience level.

B.

Evidence of a non-linear relationship that requires a polynomial model.

C.

Anomalies, possibly due to data entry errors or representing a different employee group (e.g., a different profession or part-time workers).

D.

A negative correlation between experience and salary for senior employees.

38 $In a correlation heatmap for a real estate dataset, the cell for crime_rate and property_value is colored dark red, while the cell for school_rating and property_value is colored dark blue. The color scale indicates red for negative and blue for positive correlation. What does this imply?$

Correlation analysis using Heatmaps Medium

A.

There is no significant correlation between these variables and property_value .

B.

crime_rate has a strong negative correlation with property_value, and school_rating has a strong positive correlation.

C.

Both crime_rate and school_rating have a strong positive correlation with property_value .

D.

crime_rate has a strong positive correlation with property_value, and school_rating has a strong negative correlation.

39 $You create a scatter plot to investigate the relationship between two continuous variables, A and B . The plot shows the points arranged in a distinct funnel shape, where the vertical spread of B increases as A increases. This pattern is a classic sign of:$

Bivariate analysis using Scatter plots Medium

A.

Autocorrelation

B.

Multicollinearity

C.

Heteroscedasticity

D.

Homoscedasticity

40 $In a standard Tukey box plot, a data point is typically flagged as a potential outlier if it falls outside which of the following ranges? (Where IQR is the Interquartile Range, Q1 is the first quartile, and Q3 is the third quartile).$

Box plot Medium

A.

Below the mean minus 2 standard deviations or above the mean plus 2 standard deviations.

B.

Below or above .

C.

Outside the 5th and 95th percentiles of the data.

D.

Strictly within the range of .

41 $You are building a predictive model and find high multicollinearity between two features, feature_A and feature_B . How would this multicollinearity likely affect a standard Linear Regression model versus a Random Forest model?$

Multicollinearity Hard

A.

It will cause overfitting in Linear Regression but lead to underfitting in the Random Forest model due to redundant information.

B.

It will destabilize both models, causing poor predictive performance and unreliable feature importances in both Linear Regression and Random Forest.

C.

It will primarily impact the Random Forest by making feature importance measures (like Gini importance) unreliable for feature_A and feature_B, but will have a lesser effect on the Linear Regression's coefficient stability.

D.

It will significantly inflate coefficient variance in Linear Regression, making interpretation unreliable, but will have a minimal impact on the Random Forest's predictive accuracy and feature importance stability.

42 $A financial analyst is modeling stock returns and finds the distribution is highly leptokurtic (excess kurtosis >> 0), even though it is symmetric (skewness \approx 0). If the analyst uses a model that assumes a normal distribution (mesokurtic), what is the most critical risk-related consequence?$

Distribution analysis: Kurtosis Hard

A.

The model will systematically underestimate the probability of extreme events (both positive and negative), leading to an underestimation of risk (e.g., Value at Risk).

B.

The model's prediction of the mean return will be biased, even though the distribution is symmetric.

C.

The model will be unable to converge because the variance of the return distribution is technically infinite.

D.

The model will systematically overestimate the probability of extreme events, leading to an overly conservative risk assessment.

43 $You are comparing the distribution of 'salaries' across two 'departments' (A and B) using violin plots. Both departments have the exact same median salary. However, Department A's violin plot is short and wide, resembling a circle, while Department B's is tall and narrow with long tails. What is the most insightful business interpretation?$

Bivariate analysis using Violin plots for category vs numeric Hard

A.

Salaries in Department B are more predictable and consistent around the median, but there are extreme outliers, whereas Department A has high salary variance for the majority of its employees.

B.

Department A has a more equitable salary distribution with most employees earning near the median, while Department B has significant salary inequality with clusters at high and low ends.

C.

Department B is a better department to work for because the potential for a very high salary is greater, as indicated by the long upper tail.

D.

The total salary expenditure for both departments is likely the same because their medians are identical.

44 $During the EDA phase of a customer churn prediction project, you discover a feature last_call_complaint that has a 98% correlation with the target variable churned . What is the most critical next step in your EDA workflow?$

EDA workflow integration Hard

A.

Immediately investigate the feature's temporal relationship with the target variable to check for data leakage.

B.

Remove the feature from the dataset to prevent multicollinearity with other features.

C.

Apply a non-linear transformation to the feature to see if the correlation can be increased to 100%.

D.

Conclude that you have found the most important predictor and proceed to build a simple logistic regression model using only this feature.

45 $You are loading a 10 GB CSV file into a machine with 8 GB of RAM using Pandas. The file contains 50 columns, but you only need to perform calculations on 3 of them: user_id, transaction_amount, and timestamp . What is the most memory-efficient approach to calculate the average transaction_amount per user_id ?$

Loading data using Pandas Hard

A.

Use pd.read_csv with the chunksize and usecols parameters, process each chunk in a loop, and aggregate the results.

B.

Use the low_memory=False parameter in pd.read_csv to load the data in a single, more efficient block.

C.

Increase the system's swap/page file size to be larger than 10 GB before loading the data with pd.read_csv .

D.

Load the entire file using pd.read_csv and then immediately delete the unnecessary columns using df.drop() before processing.

46 $You have a feature with a strong positive skew (right-skewed). You apply a log transformation (), but the resulting distribution becomes moderately negatively skewed. What is the most likely reason for this 'overshoot' and what would be a more appropriate next step?$

Distribution analysis: Skewness Hard

A.

This indicates the presence of bimodal distribution which was not apparent before the transformation. The next step should be to use a clustering algorithm to separate the modes.

B.

The negative skew is an artifact of floating-point precision errors and should be ignored. The log-transformed data is the best version to use for modeling.

C.

The original data must contain negative values, which makes the log transformation mathematically invalid and produces unpredictable results.

D.

The log transformation was too strong for the given skewness. A milder transformation like a square root transform () or a more adaptive one like a Box-Cox transformation should be tried.

47 $You generate a box plot for a feature and observe that the median line is identical to the first quartile (Q1). What is the most accurate interpretation of this observation?$

Univariate analysis using Box plot Hard

A.

There are no data points between the first quartile and the median.

B.

The dataset is invalid, as it is mathematically impossible for the median and Q1 to be equal.

C.

The data is highly negatively skewed, causing the median to be pulled down towards Q1.

D.

At least 25% of the data points have the exact same value, which is the median and Q1 value.

48 $In a correlation heatmap for a regression problem, you observe that feature_X has a correlation of +0.7 with the target, feature_Y has a correlation of +0.6 with the target, and feature_X and feature_Y have a correlation of +0.9 with each other. From a feature selection perspective for a linear model, what is the most strategically sound decision?$

Correlation analysis using Heatmaps Hard

A.

Keep both features and use a regularization technique like Ridge or Lasso to handle the multicollinearity during modeling.

B.

Combine feature_X and feature_Y into a single feature using Principal Component Analysis (PCA) before modeling.

C.

Discard both features because their high inter-correlation makes them unreliable predictors.

D.

Keep feature_X and discard feature_Y because X is more correlated with the target and including both would introduce strong multicollinearity.

49 $A scatter plot of a model's residuals (Y-axis) against its predicted values (X-axis) shows a distinct funnel shape, widening from left to right. What is this pattern called and what is its primary implication for a linear regression model?$

Bivariate analysis using Scatter plots Hard

A.

Multicollinearity; it suggests that two or more predictor variables are highly correlated with each other, affecting coefficient stability.

B.

Autocorrelation; it violates the assumption of independent errors, suggesting the model is not capturing some time-series or sequential pattern.

C.

Heteroscedasticity; it violates the assumption of constant variance of errors, making inferences about the coefficients (like p-values and confidence intervals) unreliable.

D.

Non-linearity; it violates the assumption of a linear relationship, indicating that a polynomial or other non-linear model is needed.

50 $While performing EDA on transaction data, you isolate a small, dense cluster of points using a density-based algorithm like DBSCAN. These points are far from the main distribution but are internally consistent. The business context is fraud detection. What is the most appropriate classification and action for this cluster?$

Detecting patterns, anomalies and trends Hard

A.

A segment of legitimate, high-value customers; they should be analyzed separately to build a specialized personalization model.

B.

A potential emerging fraud pattern (a 'wolf pack'); these points should be investigated as a group rather than being dismissed as individual random outliers.

C.

Global outliers; they should be removed from the dataset before training a model to improve its generalization on the main distribution.

D.

A collection of data entry errors; these points should be corrected or removed after consulting with the data source owner.

51 $A survey asks for user satisfaction on a scale: 1 (Very Unsatisfied), 2 (Unsatisfied), 3 (Neutral), 4 (Satisfied), 5 (Very Satisfied). A data scientist decides to calculate the mean satisfaction score. What is the fundamental assumption they are making and why might it be problematic?$

Types of data Hard

A.

They are treating ordinal data as interval data, which assumes the 'distance' between each category (e.g., between 1 and 2, and between 4 and 5) is equal and meaningful, which may not be true.

B.

They are treating interval data as ratio data, which incorrectly assumes the existence of a true zero point (e.g., a score of 4 is twice as good as a score of 2).

C.

They are treating categorical data as numerical data, which will cause a TypeError in most analytical software.

D.

They are treating nominal data as ordinal data, which assumes an inherent order in the categories that does not exist.

52 $After running a regression, you find that the model has a very high R-squared value (0.92), but most of your predictor variables have high p-values, suggesting they are not statistically significant. What is the most likely diagnosis and the most appropriate tool to confirm it?$

Multicollinearity Hard

A.

The issue is heteroscedasticity. The best diagnostic tool is to create a residual vs. fitted values plot and look for a pattern.

B.

The issue is likely severe multicollinearity. The best diagnostic tool is to calculate the Variance Inflation Factor (VIF) for each predictor.

C.

The model is overfitted to the training data. The best way to confirm this is to evaluate the model's performance on a hold-out test set.

D.

The sample size is too small to achieve statistical significance. The best approach is to collect more data.

53 $An analyst creates a histogram of customer ages and observes a distinct bimodal distribution with peaks around 25 years and 55 years. What is the most likely and actionable insight derived from this observation during EDA?$

Univariate analysis using Histogram Hard

A.

The sampling method used to collect the data was biased, over-sampling from two specific age groups, and the data is not representative of the true population.

B.

The customer base consists of two distinct subpopulations or segments (e.g., 'young professionals' and 'pre-retirees'). This suggests that a single model for all customers may underperform, and segment-specific analysis or modeling is warranted.

C.

The data contains a significant number of outliers that are creating a second, artificial peak in the distribution. These outliers should be removed.

D.

The bin width of the histogram is too small, creating artificial peaks and troughs. The analysis should be redone with a larger bin size to smooth the distribution.

54 $A data scientist generates a correlation heatmap and finds a correlation of -0.95 between average_daily_temperature and heating_oil_consumption . A colleague suggests that this implies that low temperatures cause high heating oil consumption. Why is this conclusion, although plausible, not rigorously supported by the heatmap alone?$

Correlation analysis using Heatmaps Hard

A.

A heatmap is not the correct visualization for establishing causation; a series of scatter plots would be required.

B.

The correlation value is too close to -1.0, which suggests a data leak or an error in calculation rather than a real-world relationship.

C.

The correlation is negative, which indicates an inverse relationship, not a causal one. Causal relationships must have positive correlations.

D.

Correlation does not imply causation. The heatmap only shows a statistical association; an unobserved confounding variable (e.g., time of year/season) could be the true cause for both.

55 $You are comparing two features for an anomaly detection system. Feature A has Skewness = 0, Kurtosis = 8. Feature B has Skewness = 3, Kurtosis = 3 (mesokurtic). Which feature is likely to present a greater challenge for an anomaly detection algorithm that is sensitive to extreme values, and why?$

Distribution analysis: Skewness, Kurtosis Hard

A.

Neither, because anomaly detection algorithms are specifically designed to be robust to non-normal distributions.

B.

Both will be equally challenging, as any deviation from a normal distribution (Skewness=0, Kurtosis=3) complicates anomaly detection.

C.

Feature B, because its high skewness indicates that the distribution is asymmetric, which inherently makes it harder to define a 'normal' range compared to a symmetric distribution.

D.

Feature A, because its high kurtosis (leptokurtic) indicates a distribution with extremely heavy tails, meaning outliers are more frequent and more extreme than in a normal distribution.

56 $While using pd.read_csv('data.csv'), you encounter a DtypeWarning . Upon inspection, you find a column zip_code contains mostly 5-digit numbers but also some extended 'ZIP+4' codes as strings (e.g., '90210-1234'). Pandas infers the column as mixed-type, which is inefficient. What is the most robust way to load this data, ensuring the entire zip_code column is treated as a string to preserve all information?$

Loading data using Pandas Hard

A.

Use the low_memory=False parameter to force Pandas to analyze the whole file before assigning a type.

B.

Specify dtype={'zip_code': str} within the pd.read_csv call.

C.

Load the data as is, then use df['zip_code'] = df['zip_code'].astype(str) after loading.

D.

Pre-process the CSV file with another tool (e.g., sed/awk) to add quotes around all zip codes before loading.

57 $You are performing EDA for a binary classification problem and a count plot of the target variable shows a 99:1 class imbalance. Which of the following is the most critical risk to consider during the EDA and visualization phase, even before modeling begins?$

Univariate analysis using Count plots Hard

A.

The severe imbalance means that accuracy will be a misleading metric for model evaluation.

B.

The dataset will require over-sampling techniques like SMOTE, which must be planned during EDA.

C.

Many standard visualizations (like comparing feature distributions) can be misleading because the minority class is so sparse it may be invisible or appear as noise.

D.

Count plots are an inappropriate visualization for imbalanced data; a pie chart should be used instead.

58 $Within a mature, agile machine learning pipeline, EDA is not a one-time, upfront phase. When is it most critical to re-run a significant portion of the EDA process?$

EDA workflow integration Hard

A.

Before every single model retraining cycle, even if the data source and performance are stable.

B.

When new data is ingested from a source that has undergone a known schema change or when a significant concept drift is suspected in the model's performance.

C.

Only when the lead data scientist decides the current model's accuracy has dropped by more than 5%.

D.

After the feature engineering phase is complete, to validate the new features, but not before.

59 $You are analyzing a time-series line plot of website traffic. You observe a strong weekly seasonal pattern (peaks on weekdays, troughs on weekends) and an overall upward trend. However, there is a sudden, sharp, and permanent drop in the baseline traffic starting on a specific date. What is the most likely interpretation of this drop?$

Bivariate analysis using Line plots Hard

A.

A simple outlier that should be smoothed over using a moving average before modeling.

B.

Cyclical behavior in the data, which is a long-term pattern that will eventually reverse itself.

C.

A structural break in the time series, possibly caused by an external event like a change in search engine algorithms or an internal event like a website redesign.

D.

An issue with data collection or logging that started on that date, which should be investigated before any modeling.

60 $When analyzing user behavior data, you notice that a metric 'average_session_length' is consistently increasing month-over-month (a trend). However, the 'number_of_users' is decreasing over the same period. What is the most nuanced and potentially critical business insight from these two opposing trends?$

Detecting patterns, anomalies and trends Hard

A.

The 'average_session_length' metric is likely flawed or being calculated differently over time, causing an artificial inflation.

B.

The platform is losing casual users, but retaining a core group of highly engaged 'power users', potentially indicating a shift towards a more niche but dedicated user base.

C.

The two trends are unrelated and should be analyzed and reported on separately to avoid drawing spurious conclusions.

D.

The user interface has become less efficient, causing the remaining users to take longer to accomplish the same tasks, indicating a user experience problem.

Unit 3 - Practice Quiz