1Which of the following data types represents categories with a meaningful order or ranking but no fixed distance between them?
A.Nominal Data
B.Ordinal Data
C.Interval Data
D.Ratio Data
Correct Answer: Ordinal Data
Explanation:Ordinal data represents categories that have a natural order or ranking (e.g., Low, Medium, High), but the intervals between the ranks are not necessarily equal.
Incorrect! Try again.
2Which Pandas function is primarily used to load data from a Comma Separated Values file into a DataFrame?
A.pd.load_csv()
B.pd.read_file()
C.pd.read_csv()
D.pd.import_csv()
Correct Answer: pd.read_csv()
Explanation:The standard function in the Pandas library to load CSV files is pd.read_csv().
Incorrect! Try again.
3In a box plot, the central line inside the box represents which statistical measure?
A.Mean
B.Mode
C.Median
D.Standard Deviation
Correct Answer: Median
Explanation:The line inside the box of a box plot represents the median () of the dataset.
Incorrect! Try again.
4What is the primary purpose of a histogram in Univariate analysis?
A.To show the relationship between two variables
B.To visualize the frequency distribution of a continuous variable
C.To show the count of categorical variables
D.To visualize trends over time
Correct Answer: To visualize the frequency distribution of a continuous variable
Explanation:Histograms are used to represent the distribution of numerical data by dividing the data into bins and plotting the frequency of observations in each bin.
Incorrect! Try again.
5Which Pandas method provides a concise summary of a DataFrame, including the index dtype and columns, non-null values, and memory usage?
A.df.describe()
B.df.head()
C.df.info()
D.df.shape()
Correct Answer: df.info()
Explanation:df.info() prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage.
Incorrect! Try again.
6A scatter plot is most suitable for analyzing the relationship between:
A.One categorical and one numerical variable
B.Two categorical variables
C.Two continuous numerical variables
D.Time and a categorical variable
Correct Answer: Two continuous numerical variables
Explanation:Scatter plots display values for typically two continuous variables for a set of data, used to detect correlations.
Incorrect! Try again.
7When analyzing the correlation between variables, a Pearson correlation coefficient () of -0.95 indicates:
A.A strong positive linear relationship
B.A weak negative linear relationship
C.A strong negative linear relationship
D.No linear relationship
Correct Answer: A strong negative linear relationship
Explanation:The Pearson correlation coefficient ranges from -1 to 1. Values close to -1 indicate a strong negative linear relationship.
Incorrect! Try again.
8Which visualization is best suited to show the distribution of a quantitative variable across several levels of a categorical variable, including the probability density?
A.Box plot
B.Violin plot
C.Scatter plot
D.Bar chart
Correct Answer: Violin plot
Explanation:A violin plot plays a similar role to a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared, and it features a kernel density estimation of the underlying distribution.
Incorrect! Try again.
9In the context of EDA, what is Multicollinearity?
A.When a variable has a non-linear relationship with the target
B.When two or more independent variables are highly correlated with each other
C.When the data has too many missing values
D.When the target variable is categorical
Correct Answer: When two or more independent variables are highly correlated with each other
Explanation:Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent.
Incorrect! Try again.
10Which metric is used to measure the asymmetry of the probability distribution of a real-valued random variable about its mean?
A.Kurtosis
B.Variance
C.Skewness
D.Standard Deviation
Correct Answer: Skewness
Explanation:Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Incorrect! Try again.
11If a distribution has a long tail on the right side, it is considered:
A.Negatively skewed
B.Positively skewed
C.Symmetric
D.Normal
Correct Answer: Positively skewed
Explanation:Positive skewness (right-skewed) means the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.
Incorrect! Try again.
12Which plot is specifically designed to visualize the count of observations in each categorical bin using bars?
A.Scatter plot
B.Count plot
C.Line plot
D.Violin plot
Correct Answer: Count plot
Explanation:A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.
Incorrect! Try again.
13What does the Interquartile Range (IQR) represent in a box plot?
A.The range between the minimum and maximum values
B.The difference between the 75th percentile () and the 25th percentile ()
C.The variance of the data
D.The difference between the median and the mean
Correct Answer: The difference between the 75th percentile () and the 25th percentile ()
Explanation:The IQR is the middle 50% of the data, calculated as .
Incorrect! Try again.
14Which tool is commonly used to visualize a correlation matrix?
A.Pie Chart
B.Heatmap
C.Histogram
D.Box Plot
Correct Answer: Heatmap
Explanation:Heatmaps are graphical representations of data where individual values contained in a matrix are represented as colors, making them ideal for visualizing correlation matrices.
Incorrect! Try again.
15High Kurtosis in a data distribution implies:
A.The data has light tails or lack of outliers
B.The data is perfectly normal
C.The data has heavy tails or outliers
D.The data is flat
Correct Answer: The data has heavy tails or outliers
Explanation:High kurtosis (Leptokurtic) indicates a distribution with heavy tails and a sharper peak, implying a higher presence of outliers compared to a normal distribution.
Incorrect! Try again.
16Which of the following represents Ratio data?
A.Temperature in Celsius
B.Likert Scale (Satisfied, Neutral, Dissatisfied)
C.Height in centimeters
D.Zip Codes
Correct Answer: Height in centimeters
Explanation:Ratio data has a meaningful zero point (absence of the attribute) and equal intervals. Height is ratio data because 0 cm means no height.
Incorrect! Try again.
17How do you check the first 5 rows of a Pandas DataFrame named df?
A.df.tail()
B.df.sample(5)
C.df.head()
D.df.columns
Correct Answer: df.head()
Explanation:df.head() returns the first n rows (default is 5) of the DataFrame.
Incorrect! Try again.
18Which statistic helps in detecting outliers using the box plot method?
A.
B.Standard Deviation
C.Mean
D.Z-Score
Correct Answer:
Explanation:In a box plot, outliers are typically defined as observations that fall below or above .
Incorrect! Try again.
19Which type of plot is best for detecting trends over a period of time?
A.Pie Chart
B.Line Plot
C.Scatter Plot
D.Violin Plot
Correct Answer: Line Plot
Explanation:Line plots are used to display information as a series of data points called 'markers' connected by straight line segments, ideal for time-series analysis.
Incorrect! Try again.
20If two variables have a correlation of 0, it means:
A.They are identical
B.They have a linear relationship
C.There is no linear relationship between them
D.One causes the other
Correct Answer: There is no linear relationship between them
Explanation:A correlation coefficient of 0 indicates no linear relationship between the two variables.
Incorrect! Try again.
21Which Variance Inflation Factor (VIF) value typically indicates high multicollinearity requiring attention?
A.VIF = 1
B.VIF < 5
C.VIF > 5 or 10
D.VIF = 0
Correct Answer: VIF > 5 or 10
Explanation:A VIF of 1 indicates no correlation. Generally, a VIF above 5 or 10 indicates high multicollinearity that may need to be addressed.
Incorrect! Try again.
22What is the skewness of a perfectly symmetrical Normal Distribution?
A.1
B.-1
C.0
D.0.5
Correct Answer: 0
Explanation:A normal distribution is symmetric around its mean, so its skewness is 0.
Incorrect! Try again.
23In a Pandas DataFrame, what does df.describe() output?
Explanation:df.describe() generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution.
Incorrect! Try again.
24Which of the following is an example of Nominal Data?
A.Age
B.Income
C.Eye Color (Blue, Brown, Green)
D.Class Rank
Correct Answer: Eye Color (Blue, Brown, Green)
Explanation:Nominal data represents labels or names without any intrinsic order. Eye color is a classic example.
Incorrect! Try again.
25In EDA, what is an 'anomaly'?
A.A missing value
B.A data point that deviates significantly from the rest of the data
C.The average value of the dataset
D.A categorical variable
Correct Answer: A data point that deviates significantly from the rest of the data
Explanation:Anomalies (or outliers) are observations that differ significantly from the majority of the data.
Incorrect! Try again.
26Which library is built on top of Matplotlib and provides a high-level interface for drawing attractive statistical graphics, like violin plots and heatmaps?
A.NumPy
B.Pandas
C.Seaborn
D.Scikit-learn
Correct Answer: Seaborn
Explanation:Seaborn is a Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.
Incorrect! Try again.
27When interpreting a box plot, the 'whiskers' usually extend to:
A.The minimum and maximum values (excluding outliers)
B.The standard deviation
C.The variance
D.The 10th and 90th percentiles
Correct Answer: The minimum and maximum values (excluding outliers)
Explanation:The whiskers extend from the box to the rest of the distribution, typically , representing the range of non-outlier data.
Incorrect! Try again.
28What is the relationship between Mean, Median, and Mode in a negatively skewed distribution?
A.Mean > Median > Mode
B.Mean = Median = Mode
C.Mean < Median < Mode
D.Mode < Mean < Median
Correct Answer: Mean < Median < Mode
Explanation:In a negatively skewed (left-skewed) distribution, the tail is on the left, pulling the mean lower than the median and the mode.
Incorrect! Try again.
29Which pandas function is used to check for missing values in a dataset?
A.df.missing()
B.df.isnull()
C.df.check_na()
D.df.empty()
Correct Answer: df.isnull()
Explanation:df.isnull() (or df.isna()) returns a DataFrame of the same shape as df containing booleans indicating whether values are missing.
Incorrect! Try again.
30The visual inspection of a Scatter plot allows you to determine:
A.Only the strength of the relationship
B.Only the direction of the relationship
C.Both the strength and direction of the relationship
D.The exact equation of the line
Correct Answer: Both the strength and direction of the relationship
Explanation:Scatter plots visualize the correlation, allowing you to see if the relationship is positive/negative (direction) and how tight the points are (strength).
Incorrect! Try again.
31A correlation matrix is a square table that shows:
A.The covariance between variables
B.The correlation coefficients between pairs of variables
C.The variance of each variable
D.The summary statistics
Correct Answer: The correlation coefficients between pairs of variables
Explanation:A correlation matrix shows the correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
Incorrect! Try again.
32Which data type is 'Temperature in Celsius'?
A.Nominal
B.Ordinal
C.Interval
D.Ratio
Correct Answer: Interval
Explanation:It is Interval data because the difference between two values is meaningful, but there is no true zero (0 Celsius does not mean 'no temperature').
Incorrect! Try again.
33In a Histogram, the width of the bars represents:
A.The number of observations
B.The interval (bin) size of the variable
C.The standard deviation
D.The mean value
Correct Answer: The interval (bin) size of the variable
Explanation:The horizontal axis of a histogram represents the continuous variable, divided into bins (intervals). The width represents the size of these bins.
Incorrect! Try again.
34What is Platykurtic distribution?
A.A distribution with negative kurtosis (flatter than normal)
B.A distribution with positive kurtosis (peaked)
C.A distribution with zero kurtosis
D.A skewed distribution
Correct Answer: A distribution with negative kurtosis (flatter than normal)
Explanation:Platykurtic distributions have negative excess kurtosis. They are flatter with lighter tails compared to a normal distribution.
Incorrect! Try again.
35Which pandas method is used to count the unique values in a specific column?
A.df['col'].unique()
B.df['col'].nunique()
C.df['col'].value_counts()
D.df['col'].count()
Correct Answer: df['col'].value_counts()
Explanation:value_counts() returns a Series containing counts of unique values.
Incorrect! Try again.
36When detecting patterns, 'Seasonality' refers to:
A.A long-term increase or decrease in data
B.Random fluctuations in data
C.Regular, repeating fluctuations over a specific period
D.One-time anomalies
Correct Answer: Regular, repeating fluctuations over a specific period
Explanation:Seasonality refers to periodic fluctuations that occur at regular time intervals (e.g., daily, weekly, yearly).
Incorrect! Try again.
37Why is handling multicollinearity important for linear regression models?
A.It ensures the target variable is normally distributed
B.It stabilizes the estimates of the regression coefficients
C.It increases the number of features
D.It removes outliers
Correct Answer: It stabilizes the estimates of the regression coefficients
Explanation:Multicollinearity makes the estimates of regression coefficients unstable and difficult to interpret because the independent variables are not truly independent.
Incorrect! Try again.
38Which plot is essentially a box plot with a rotated kernel density plot on each side?
A.Histogram
B.Scatter plot
C.Violin plot
D.Strip plot
Correct Answer: Violin plot
Explanation:A violin plot combines a box plot with a kernel density estimation (KDE) plot.
Incorrect! Try again.
39The command df.corr() in Pandas calculates which correlation coefficient by default?
A.Spearman
B.Kendall
C.Pearson
D.Point-Biserial
Correct Answer: Pearson
Explanation:The default method for df.corr() is 'pearson'.
Incorrect! Try again.
40To visualize the relationship between a categorical variable and a continuous variable, which pair of plots is most appropriate?
A.Scatter plot and Line plot
B.Box plot and Violin plot
C.Heatmap and Histogram
D.Pie chart and Bar chart
Correct Answer: Box plot and Violin plot
Explanation:Box plots and Violin plots are specifically designed to compare the distribution of a continuous variable across different categories.
Incorrect! Try again.
41If a dataset has NaN values, how does df.dropna() handle them?
A.It fills them with zeros
B.It fills them with the mean
C.It removes the rows (or columns) containing missing values
D.It highlights them in red
Correct Answer: It removes the rows (or columns) containing missing values
Explanation:dropna() is used to remove missing values.
Incorrect! Try again.
42What is the primary difference between a Bar Chart and a Histogram?
A.Bar charts are for numerical data; Histograms for categorical
B.Histograms are for continuous numerical distributions; Bar charts are for categorical comparisons
C.There is no difference
D.Bar charts always touch each other; Histograms have gaps
Correct Answer: Histograms are for continuous numerical distributions; Bar charts are for categorical comparisons
Explanation:Histograms visualize the frequency distribution of continuous variables (bars touch to show continuity), whereas bar charts compare discrete categories (bars usually have gaps).
Incorrect! Try again.
43Which of the following describes 'Discrete' quantitative data?
A.It can take any value within a range (e.g., height)
B.It can only take specific, separate values (e.g., number of students)
C.It is purely descriptive text
D.It is based on ranking
Correct Answer: It can only take specific, separate values (e.g., number of students)
Explanation:Discrete data consists of distinct, separate values, often integers (counts).
Incorrect! Try again.
44In a heatmap, what does the color intensity typically represent?
A.The count of null values
B.The magnitude of the value or correlation coefficient
C.The index of the row
D.The data type
Correct Answer: The magnitude of the value or correlation coefficient
Explanation:In a heatmap, color variations (intensity or hue) correspond to the magnitude of the data value in that cell.
Incorrect! Try again.
45If a variable has zero variance, what does it imply?
A.The variable is normally distributed
B.All values in the variable are the same
C.The variable has many outliers
D.The mean is zero
Correct Answer: All values in the variable are the same
Explanation:Variance measures the spread of data. If variance is zero, there is no spread, meaning all data points are identical.
Incorrect! Try again.
46What is the first step in the EDA workflow after loading the data?
A.Training the machine learning model
B.Understanding the data structure (shape, types, head)
C.Hyperparameter tuning
D.Deploying the model
Correct Answer: Understanding the data structure (shape, types, head)
Explanation:The initial step involves inspecting the data to understand its dimensions, data types, and general look (using head/info/shape).
Incorrect! Try again.
47Which plot is useful for visualizing the pairwise relationships and distributions for multiple variables in a dataset simultaneously?
A.Pair plot (Scatter matrix)
B.Box plot
C.Pie chart
D.Area plot
Correct Answer: Pair plot (Scatter matrix)
Explanation:A pair plot (or scatter matrix) plots pairwise relationships in a dataset. The diagonal plots usually show the univariate distribution of the data.
Incorrect! Try again.
48In the context of Pandas, what is a DataFrame?
A.A 1D labeled array
B.A 2D labeled data structure with columns of potentially different types
C.A 3D array
D.A visualization tool
Correct Answer: A 2D labeled data structure with columns of potentially different types
Explanation:A DataFrame is the primary Pandas data structure, representing tabular data with rows and columns.
Incorrect! Try again.
49Which statistic is most robust to outliers?
A.Mean
B.Range
C.Standard Deviation
D.Median
Correct Answer: Median
Explanation:The median is a resistant measure of center; it is not heavily influenced by extreme outliers, unlike the mean.
Incorrect! Try again.
50Correlation does not imply:
A.Association
B.Relationship
C.Causation
D.Dependency
Correct Answer: Causation
Explanation:Just because two variables are correlated does not mean that one causes the other (e.g., ice cream sales and shark attacks are correlated due to temperature, not causation).