1What is the primary purpose of using a Pipeline in machine learning workflows, such as those in Scikit-Learn?
A.To visualize the neural network architecture
B.To chain together multiple processing steps (transformers) and a final estimator
C.To automatically select the best algorithm for the dataset
D.To deploy the model directly to a cloud server
Correct Answer: To chain together multiple processing steps (transformers) and a final estimator
Explanation:Pipelines allow you to sequentially apply a list of transformers to preprocess the data and conclude with a final estimator, ensuring that the same preprocessing steps are applied to both training and testing data.
Incorrect! Try again.
2When using a Pipeline, how is data leakage prevented during Cross-Validation?
A.By fitting the transformer on the entire dataset before splitting
B.By using only the final estimator during cross-validation
C.By fitting transformers only on the training folds and applying them to the validation fold
D.By shuffling the data repeatedly
Correct Answer: By fitting transformers only on the training folds and applying them to the validation fold
Explanation:Pipelines ensure that preprocessing steps (like scaling or imputation) are fit only on the training subset of the cross-validation fold, preventing information from the validation set from leaking into the training process.
Incorrect! Try again.
3In a Scikit-Learn Pipeline, which method is called on the intermediate steps during the training phase?
A.predict()
B.transform()
C.fit_transform()
D.score()
Correct Answer: fit_transform()
Explanation:During training, fit_transform() is called on all intermediate steps to learn parameters and transform the data for the next step. The final step only has fit() called.
Incorrect! Try again.
4Which of the following classes allows you to apply different transformers to different columns of an array or pandas DataFrame?
A.FeatureUnion
B.ColumnTransformer
C.FunctionTransformer
D.GridSearchCV
Correct Answer: ColumnTransformer
Explanation:ColumnTransformer allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
Incorrect! Try again.
5In -Fold Cross-Validation, if , what percentage of the data is used for training in each iteration?
A.20%
B.50%
C.80%
D.100%
Correct Answer: 80%
Explanation:In 5-Fold CV, the data is split into 5 parts. In each iteration, 1 part (20%) is used for validation, and the remaining 4 parts (80%) are used for training.
Incorrect! Try again.
6Which Cross-Validation strategy is recommended for classification problems where the target classes are imbalanced?
A.K-Fold
B.Stratified K-Fold
C.Leave-One-Out
D.TimeSeriesSplit
Correct Answer: Stratified K-Fold
Explanation:Stratified K-Fold ensures that each fold of the dataset has the same proportion of observations with a given label as the whole dataset, which is crucial for imbalanced classes.
Incorrect! Try again.
7What is a significant drawback of Leave-One-Out Cross-Validation (LOOCV) on large datasets?
A.It has high bias
B.It is computationally expensive
C.It reduces the variance of the estimator significantly
D.It cannot handle categorical data
Correct Answer: It is computationally expensive
Explanation:LOOCV requires fitting the model times (where is the number of samples). For large datasets, this becomes extremely slow and computationally expensive.
Incorrect! Try again.
8When debugging an algorithm, a Learning Curve plots the model performance (score) against:
A.The values of a specific hyperparameter
B.The number of training samples
C.The number of features
D.The time taken to train
Correct Answer: The number of training samples
Explanation:A Learning Curve shows the validation and training scores of an estimator for varying numbers of training samples. It helps determine if the model benefits from adding more data.
Incorrect! Try again.
9If a Learning Curve shows a high training score but a low validation score with a large gap between them, the model is suffering from:
A.High Bias (Underfitting)
B.High Variance (Overfitting)
C.Convergence failure
D.Data leakage
Correct Answer: High Variance (Overfitting)
Explanation:A large gap where the training error is low (high score) and validation error is high (low score) indicates the model has memorized the training data but generalizes poorly, which is High Variance (Overfitting).
Incorrect! Try again.
10If both the training score and validation score converge to a low value (high error) on a Learning Curve, what is the diagnosis?
A.High Bias (Underfitting)
B.High Variance (Overfitting)
C.Optimal performance
D.Need for regularization
Correct Answer: High Bias (Underfitting)
Explanation:When both scores are low and close together, the model is not complex enough to capture the underlying pattern in the data, indicating High Bias.
Incorrect! Try again.
11A Validation Curve is used to evaluate the effect of:
A.Training set size
B.A single hyperparameter
C.Feature scaling
D.Different random seeds
Correct Answer: A single hyperparameter
Explanation:Validation curves plot the training and validation scores against varying values of a single hyperparameter (e.g., in SVM or max_depth in Decision Trees).
Incorrect! Try again.
12In the context of the Bias-Variance tradeoff, increasing the complexity of a model usually leads to:
A.Lower Bias and Lower Variance
B.Higher Bias and Higher Variance
C.Lower Bias and Higher Variance
D.Higher Bias and Lower Variance
Correct Answer: Lower Bias and Higher Variance
Explanation:Complex models fit the training data very well (Low Bias) but become sensitive to noise in the training data, leading to poor generalization (High Variance).
Incorrect! Try again.
13Which of the following actions is most likely to fix a model suffering from High Variance?
A.Adding more polynomial features
B.Reducing the size of the training set
C.Getting more training data
D.Decreasing the regularization parameter
Correct Answer: Getting more training data
Explanation:Adding more training data helps the model generalize better and reduces overfitting (High Variance). Reducing complexity or increasing regularization also helps.
Incorrect! Try again.
14Which of the following actions is most likely to fix a model suffering from High Bias?
A.Removing features
B.Adding polynomial features or increasing model complexity
C.Increasing the regularization parameter
D.Using a simpler algorithm
Correct Answer: Adding polynomial features or increasing model complexity
Explanation:High Bias means the model is too simple. Adding features (like polynomial terms) or switching to a more complex model helps capture the data patterns.
Incorrect! Try again.
15What is Model Serialization?
A.Converting a trained model into a format that can be stored or transmitted
B.Training a model in a serial sequence rather than parallel
C.Assigning a unique serial number to a model version
D.Converting categorical features into serial integers
Correct Answer: Converting a trained model into a format that can be stored or transmitted
Explanation:Serialization (often called pickling in Python) is the process of converting an object hierarchy (the trained model) into a byte stream to save to a disk or send over a network.
Incorrect! Try again.
16Which Python library is commonly used for serializing Scikit-Learn models, particularly efficient for NumPy arrays?
A.json
B.joblib
C.pandas
D.csv
Correct Answer: joblib
Explanation:joblib is often recommended over Python's built-in pickle for Scikit-Learn estimators because it is more efficient at handling large NumPy arrays internally.
Incorrect! Try again.
17What is the inverse process of serialization called, where a saved model is loaded back into memory?
A.Decoding
B.Deserialization
C.Parsing
D.Unzipping
Correct Answer: Deserialization
Explanation:Deserialization (or unpickling) is the process of reconstructing a Python object from the serialized byte stream.
Incorrect! Try again.
18What is a major security risk associated with deserializing data using pickle or joblib?
A.The model accuracy decreases
B.Arbitrary code execution if the file is malicious
C.The file size becomes too large
D.It converts float64 to float32
Correct Answer: Arbitrary code execution if the file is malicious
Explanation:Pickle/Joblib files can contain executable code. If you unpickle a file from an untrusted source, it can execute malicious commands on your system.
Incorrect! Try again.
19Which format is specifically designed as an open standard for representing machine learning models to allow interoperability between different frameworks (e.g., PyTorch to ONNX Runtime)?
Explanation:ONNX provides a standard definition of the computation graph, allowing models trained in one framework to be exported and run in another environment.
Incorrect! Try again.
20What does Model Deployment generally refer to?
A.The process of cleaning data
B.Integrating a machine learning model into an existing production environment to make practical business decisions
C.Hyperparameter tuning using GridSearch
D.Writing the documentation for the model
Correct Answer: Integrating a machine learning model into an existing production environment to make practical business decisions
Explanation:Deployment is the stage where the model is moved from a research/development environment to a live application where it serves predictions to users or systems.
Incorrect! Try again.
21In a Web Service deployment (e.g., using Flask or FastAPI), how does a client typically request a prediction?
A.By emailing the dataset to the server
B.By sending an HTTP request (usually POST) with data in JSON format
C.By downloading the model file and running it locally
D.By connecting via SSH to the server console
Correct Answer: By sending an HTTP request (usually POST) with data in JSON format
Explanation:Web services expose endpoints (APIs). Clients send data payloads (usually JSON) via HTTP methods like POST, and the server returns the prediction.
Incorrect! Try again.
22What is Containerization (e.g., using Docker) useful for in model deployment?
A.It increases the model's accuracy automatically
B.It packages the model with all its dependencies (libraries, OS settings) to ensure consistency across environments
C.It compresses the model to a smaller file size
D.It converts Python code to C++
Correct Answer: It packages the model with all its dependencies (libraries, OS settings) to ensure consistency across environments
Explanation:Docker containers encapsulate the application and its environment, solving the 'it works on my machine' problem and ensuring the model runs the same way in production.
A.The model runs without any hardware physically existing anywhere
B.The developer manages the physical servers and operating system updates
C.The cloud provider dynamically manages the allocation of machine resources, and you pay only for the compute time used
D.It requires a dedicated server running 24/7
Correct Answer: The cloud provider dynamically manages the allocation of machine resources, and you pay only for the compute time used
Explanation:Serverless computing abstracts the infrastructure. Code runs in response to events (triggers), and the provider handles scaling and provisioning, charging only for execution time.
Incorrect! Try again.
24Which of the following is a disadvantage of Local Deployment (running the model on the user's device, e.g., mobile app)?
A.High latency due to network transfer
B.Dependence on internet connectivity
C.Limited computational resources (battery, CPU/RAM) on the device
D.Data privacy concerns
Correct Answer: Limited computational resources (battery, CPU/RAM) on the device
Explanation:Local devices (phones, IoT) have limited power and processing capability compared to cloud servers, restricting the size and complexity of models that can be deployed.
Incorrect! Try again.
25In the context of deployment, what is Concept Drift?
A.The model code changing over time due to git commits
B.The statistical properties of the target variable change over time, making the model less accurate
C.Moving the model from one cloud provider to another
D.The loss of floating-point precision during serialization
Correct Answer: The statistical properties of the target variable change over time, making the model less accurate
Explanation:Concept drift occurs when the relationship between input data and the target variable changes over time (e.g., fraud patterns changing), requiring model retraining.
Incorrect! Try again.
26What is Batch Prediction (Offline Inference)?
A.Generating predictions for a large set of observations at once, often on a schedule
B.Generating a prediction immediately when a user clicks a button
C.Training the model in batches
D.Grouping multiple models together
Correct Answer: Generating predictions for a large set of observations at once, often on a schedule
Explanation:Batch prediction involves processing large volumes of data (e.g., nightly jobs) where immediate real-time results are not required.
Incorrect! Try again.
27Which HTTP method is most appropriate for a REST API endpoint that accepts input features and returns a model prediction?
A.GET
B.POST
C.DELETE
D.HEAD
Correct Answer: POST
Explanation:POST is used because the client is sending data (the input features) to the server to be processed. GET is generally for retrieving resources and has URL length limits.
Incorrect! Try again.
28Why is Pipeline serialization preferred over serializing just the model estimator?
A.It makes the file smaller
B.It ensures that raw data fed into the loaded object undergoes the exact same preprocessing steps as training data
C.Pipelines cannot be serialized
D.It is faster to load
Correct Answer: It ensures that raw data fed into the loaded object undergoes the exact same preprocessing steps as training data
Explanation:If you only save the model, you must manually reproduce the preprocessing logic in production. Saving the Pipeline guarantees the preprocessing transforms are bundled with the estimator.
Incorrect! Try again.
29In Scikit-Learn, how do you perform a Grid Search over parameters inside a Pipeline?
A.You cannot perform Grid Search on a Pipeline
B.Pass the parameters directly to the estimator
C.Use the syntax step_name__parameter_name in the param_grid
D.Modify the source code of the library
Correct Answer: Use the syntax step_name__parameter_name in the param_grid
Explanation:To access parameters of steps within a Pipeline, Scikit-Learn uses a double underscore syntax (e.g., classifier__C) to specify which step the parameter belongs to.
Incorrect! Try again.
30What is Nested Cross-Validation used for?
A.To tune hyperparameters without biasing the model evaluation
B.To visualize the data in 3D
C.To reduce the training time
D.To use multiple models simultaneously
Correct Answer: To tune hyperparameters without biasing the model evaluation
Explanation:Nested CV uses an inner loop for hyperparameter tuning and an outer loop for error estimation, providing an unbiased estimate of the generalization error.
Incorrect! Try again.
31Which cross-validation method is appropriate for time-series data?
A.Random K-Fold
B.ShuffleSplit
C.TimeSeriesSplit (Rolling basis)
D.Leave-One-Out
Correct Answer: TimeSeriesSplit (Rolling basis)
Explanation:Time series data relies on temporal order. Random splitting would train on future data to predict the past (leakage). TimeSeriesSplit respects the temporal order.
Incorrect! Try again.
32In a validation curve, if the training score is 0.99 and the validation score is 0.60, what should you do regarding the hyperparameter being tested?
A.Keep the parameter as is
B.Adjust the parameter to increase model complexity
C.Adjust the parameter to decrease model complexity (increase regularization)
D.Stop collecting data
Correct Answer: Adjust the parameter to decrease model complexity (increase regularization)
Explanation:This large gap indicates overfitting. You should adjust the hyperparameter to constrain the model (e.g., increase regularization or reduce depth).
Incorrect! Try again.
33What is the primary benefit of Microservices architecture for ML deployment?
A.It puts all code into one giant script
B.It allows the ML model to be developed, deployed, and scaled independently of the main application
C.It removes the need for APIs
D.It eliminates the need for data preprocessing
Correct Answer: It allows the ML model to be developed, deployed, and scaled independently of the main application
Explanation:Microservices decouple the ML model service from the rest of the application, allowing teams to update the model or scale its infrastructure without redeploying the whole app.
Incorrect! Try again.
34Which file extension is commonly associated with a Python pickled model?
A..txt
B..pkl
C..html
D..css
Correct Answer: .pkl
Explanation:.pkl or .pickle are standard conventions for naming files created by the Python pickle module.
Incorrect! Try again.
35A Canary Deployment strategy involves:
A.Releasing the model to a small percentage of users first to monitor performance before full rollout
B.Deploying the model to a coal mine
C.Replacing the old model instantly for all users
D.Running the model only on weekends
Correct Answer: Releasing the model to a small percentage of users first to monitor performance before full rollout
Explanation:Canary deployment is a risk-reduction strategy where the new version is rolled out to a small subset of users to catch issues early.
Incorrect! Try again.
36In Scikit-Learn, Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())]). What object does steps expect?
A.A dictionary of parameters
B.A list of (name, transform) tuples
C.A JSON string
D.A pandas DataFrame
Correct Answer: A list of (name, transform) tuples
Explanation:The steps argument takes a list of tuples, where the first element is a string name and the second is the estimator/transformer object.
Incorrect! Try again.
37Which metric on a learning curve would indicate that obtaining more data is NOT worth the cost?
A.The training score is increasing rapidly
B.The validation score has plateaued and converged with the training score
C.The gap between training and validation score is widening
D.The validation score is fluctuating wildly
Correct Answer: The validation score has plateaued and converged with the training score
Explanation:If the validation score has plateaued and is close to the training score (convergence), adding more data is unlikely to improve performance significantly; model complexity limits improvement.
Incorrect! Try again.
38What is the purpose of make_pipeline in Scikit-Learn compared to the Pipeline class constructor?
A.It creates a pipeline that runs faster
B.It automatically names the steps based on the class names of the estimators
C.It only supports regression models
D.It does not support cross-validation
Correct Answer: It automatically names the steps based on the class names of the estimators
Explanation:make_pipeline is a utility function that generates names for the steps automatically (lowercased class name), saving you from typing the name strings manually.
Incorrect! Try again.
39When deploying a model via a REST API, what is Latency?
A.The number of requests the server can handle per second
B.The time taken from sending the request to receiving the prediction
C.The accuracy of the model
D.The cost of the server
Correct Answer: The time taken from sending the request to receiving the prediction
Explanation:Latency is the delay or time duration required to process a single inference request.
Incorrect! Try again.
40What is PMML (Predictive Model Markup Language)?
A.A python library for plotting
B.An XML-based standard for representing predictive models
C.A cloud service provider
D.A type of neural network
Correct Answer: An XML-based standard for representing predictive models
Explanation:PMML is an XML-based format used to describe statistical and data mining models to share them between compliant applications.
Incorrect! Try again.
41In Repeated K-Fold Cross-Validation:
A.K-Fold CV is run times with different randomization splits
B.The same K-Fold split is repeated exactly
C.The model is trained repeatedly on the same fold
D.Data is duplicated times before splitting
Correct Answer: K-Fold CV is run times with different randomization splits
Explanation:Repeated K-Fold repeats the K-Fold procedure times, producing different splits each time (if shuffled), providing a more robust estimate of model performance.
Incorrect! Try again.
42What is the formula for Total Error in the context of Bias-Variance decomposition?
A.
B.
C.
D.
Correct Answer:
Explanation:The expected test error decomposes into the square of the bias, the variance, and the irreducible error (noise).
Incorrect! Try again.
43Which Scikit-Learn utility helps ensure that the training and testing sets have the same distribution of classes?
A.train_test_split(..., stratify=y)
B.train_test_split(..., shuffle=False)
C.StandardScaler()
D.Pipeline()
Correct Answer: train_test_split(..., stratify=y)
Explanation:The stratify parameter in train_test_split ensures the split preserves the percentage of samples for each class found in the original target variable y.
Incorrect! Try again.
44Why might a FunctionTransformer be included in a Pipeline?
A.To execute a custom Python function (like log transformation) as a step
B.To transform the pipeline into a function
C.To debug the pipeline
D.To plot the data
Correct Answer: To execute a custom Python function (like log transformation) as a step
Explanation:FunctionTransformer allows you to wrap an arbitrary function (e.g., np.log1p) to make it compatible with the Scikit-Learn transformer API (fit/transform).
Incorrect! Try again.
45When serializing a model that relies on external code files (custom classes), what common issue arises?
A.The pickle file works everywhere automatically
B.The pickle file may fail to load if the custom class definition is not available in the loading environment
C.The file size doubles
D.The model becomes a regression model
Correct Answer: The pickle file may fail to load if the custom class definition is not available in the loading environment
Explanation:Pickle saves the structure and data, but not the code of custom classes. The environment loading the pickle must have the class definition importable.
Incorrect! Try again.
46Which of the following is an example of Online Learning deployment?
A.Training a model once a year
B.The model updates its weights incrementally as new data streams in
C.Uploading the model to the internet
D.Using a static HTML page
Correct Answer: The model updates its weights incrementally as new data streams in
Explanation:Online learning involves updating the model continuously as new data arrives, rather than retraining from scratch on the whole dataset.
Incorrect! Try again.
47What is the purpose of the random_state parameter in Cross-Validation splitters?
A.To improve accuracy
B.To ensure reproducibility of the splits
C.To randomise the hyperparameters
D.To delete random data
Correct Answer: To ensure reproducibility of the splits
Explanation:Setting a fixed random_state ensures that the random shuffling of data produces the exact same splits every time the code is run.
Incorrect! Try again.
48In a pipeline, what happens if you call predict()?
A.It calls fit() on all steps
B.It calls transform() on all transformers and predict() on the final estimator
C.It calls predict() on all steps
D.It returns an error
Correct Answer: It calls transform() on all transformers and predict() on the final estimator
Explanation:During prediction, the pipeline passes the data through the transform method of all preprocessing steps and finally calls predict on the last step (the estimator).
Incorrect! Try again.
49Which tool is commonly used to create an isolated environment for Python dependencies to avoid version conflicts during development and deployment?
A.Pipenv / Virtualenv / Conda
B.Notepad
C.Chrome
D.Excel
Correct Answer: Pipenv / Virtualenv / Conda
Explanation:Virtual environments allow you to manage project-specific dependencies, ensuring that the deployed model uses the exact library versions it was trained with.
Incorrect! Try again.
50What is the typical use of A/B Testing in model deployment?
A.Comparing two different models (Model A and Model B) on live traffic to see which performs better
B.Checking if the model works on inputs A and B
C.Testing the model in Alpha and Beta stages
D.Debugging syntax errors
Correct Answer: Comparing two different models (Model A and Model B) on live traffic to see which performs better
Explanation:A/B testing involves directing a subset of users to the new model (B) while others use the current model (A) to statistically compare performance (conversion, click-through, etc.).
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.