Unit 6 - Practice Quiz

INT395

1 What is the primary purpose of using a Pipeline in machine learning workflows, such as those in Scikit-Learn?

A. To visualize the neural network architecture
B. To chain together multiple processing steps (transformers) and a final estimator
C. To automatically select the best algorithm for the dataset
D. To deploy the model directly to a cloud server

2 When using a Pipeline, how is data leakage prevented during Cross-Validation?

A. By fitting the transformer on the entire dataset before splitting
B. By using only the final estimator during cross-validation
C. By fitting transformers only on the training folds and applying them to the validation fold
D. By shuffling the data repeatedly

3 In a Scikit-Learn Pipeline, which method is called on the intermediate steps during the training phase?

A. predict()
B. transform()
C. fit_transform()
D. score()

4 Which of the following classes allows you to apply different transformers to different columns of an array or pandas DataFrame?

A. FeatureUnion
B. ColumnTransformer
C. FunctionTransformer
D. GridSearchCV

5 In -Fold Cross-Validation, if , what percentage of the data is used for training in each iteration?

A. 20%
B. 50%
C. 80%
D. 100%

6 Which Cross-Validation strategy is recommended for classification problems where the target classes are imbalanced?

A. K-Fold
B. Stratified K-Fold
C. Leave-One-Out
D. TimeSeriesSplit

7 What is a significant drawback of Leave-One-Out Cross-Validation (LOOCV) on large datasets?

A. It has high bias
B. It is computationally expensive
C. It reduces the variance of the estimator significantly
D. It cannot handle categorical data

8 When debugging an algorithm, a Learning Curve plots the model performance (score) against:

A. The values of a specific hyperparameter
B. The number of training samples
C. The number of features
D. The time taken to train

9 If a Learning Curve shows a high training score but a low validation score with a large gap between them, the model is suffering from:

A. High Bias (Underfitting)
B. High Variance (Overfitting)
C. Convergence failure
D. Data leakage

10 If both the training score and validation score converge to a low value (high error) on a Learning Curve, what is the diagnosis?

A. High Bias (Underfitting)
B. High Variance (Overfitting)
C. Optimal performance
D. Need for regularization

11 A Validation Curve is used to evaluate the effect of:

A. Training set size
B. A single hyperparameter
C. Feature scaling
D. Different random seeds

12 In the context of the Bias-Variance tradeoff, increasing the complexity of a model usually leads to:

A. Lower Bias and Lower Variance
B. Higher Bias and Higher Variance
C. Lower Bias and Higher Variance
D. Higher Bias and Lower Variance

13 Which of the following actions is most likely to fix a model suffering from High Variance?

A. Adding more polynomial features
B. Reducing the size of the training set
C. Getting more training data
D. Decreasing the regularization parameter

14 Which of the following actions is most likely to fix a model suffering from High Bias?

A. Removing features
B. Adding polynomial features or increasing model complexity
C. Increasing the regularization parameter
D. Using a simpler algorithm

15 What is Model Serialization?

A. Converting a trained model into a format that can be stored or transmitted
B. Training a model in a serial sequence rather than parallel
C. Assigning a unique serial number to a model version
D. Converting categorical features into serial integers

16 Which Python library is commonly used for serializing Scikit-Learn models, particularly efficient for NumPy arrays?

A. json
B. joblib
C. pandas
D. csv

17 What is the inverse process of serialization called, where a saved model is loaded back into memory?

A. Decoding
B. Deserialization
C. Parsing
D. Unzipping

18 What is a major security risk associated with deserializing data using pickle or joblib?

A. The model accuracy decreases
B. Arbitrary code execution if the file is malicious
C. The file size becomes too large
D. It converts float64 to float32

19 Which format is specifically designed as an open standard for representing machine learning models to allow interoperability between different frameworks (e.g., PyTorch to ONNX Runtime)?

A. CSV
B. ONNX (Open Neural Network Exchange)
C. Pickle
D. HDF5

20 What does Model Deployment generally refer to?

A. The process of cleaning data
B. Integrating a machine learning model into an existing production environment to make practical business decisions
C. Hyperparameter tuning using GridSearch
D. Writing the documentation for the model

21 In a Web Service deployment (e.g., using Flask or FastAPI), how does a client typically request a prediction?

A. By emailing the dataset to the server
B. By sending an HTTP request (usually POST) with data in JSON format
C. By downloading the model file and running it locally
D. By connecting via SSH to the server console

22 What is Containerization (e.g., using Docker) useful for in model deployment?

A. It increases the model's accuracy automatically
B. It packages the model with all its dependencies (libraries, OS settings) to ensure consistency across environments
C. It compresses the model to a smaller file size
D. It converts Python code to C++

23 What characterizes Serverless deployment (e.g., AWS Lambda, Azure Functions)?

A. The model runs without any hardware physically existing anywhere
B. The developer manages the physical servers and operating system updates
C. The cloud provider dynamically manages the allocation of machine resources, and you pay only for the compute time used
D. It requires a dedicated server running 24/7

24 Which of the following is a disadvantage of Local Deployment (running the model on the user's device, e.g., mobile app)?

A. High latency due to network transfer
B. Dependence on internet connectivity
C. Limited computational resources (battery, CPU/RAM) on the device
D. Data privacy concerns

25 In the context of deployment, what is Concept Drift?

A. The model code changing over time due to git commits
B. The statistical properties of the target variable change over time, making the model less accurate
C. Moving the model from one cloud provider to another
D. The loss of floating-point precision during serialization

26 What is Batch Prediction (Offline Inference)?

A. Generating predictions for a large set of observations at once, often on a schedule
B. Generating a prediction immediately when a user clicks a button
C. Training the model in batches
D. Grouping multiple models together

27 Which HTTP method is most appropriate for a REST API endpoint that accepts input features and returns a model prediction?

A. GET
B. POST
C. DELETE
D. HEAD

28 Why is Pipeline serialization preferred over serializing just the model estimator?

A. It makes the file smaller
B. It ensures that raw data fed into the loaded object undergoes the exact same preprocessing steps as training data
C. Pipelines cannot be serialized
D. It is faster to load

29 In Scikit-Learn, how do you perform a Grid Search over parameters inside a Pipeline?

A. You cannot perform Grid Search on a Pipeline
B. Pass the parameters directly to the estimator
C. Use the syntax step_name__parameter_name in the param_grid
D. Modify the source code of the library

30 What is Nested Cross-Validation used for?

A. To tune hyperparameters without biasing the model evaluation
B. To visualize the data in 3D
C. To reduce the training time
D. To use multiple models simultaneously

31 Which cross-validation method is appropriate for time-series data?

A. Random K-Fold
B. ShuffleSplit
C. TimeSeriesSplit (Rolling basis)
D. Leave-One-Out

32 In a validation curve, if the training score is 0.99 and the validation score is 0.60, what should you do regarding the hyperparameter being tested?

A. Keep the parameter as is
B. Adjust the parameter to increase model complexity
C. Adjust the parameter to decrease model complexity (increase regularization)
D. Stop collecting data

33 What is the primary benefit of Microservices architecture for ML deployment?

A. It puts all code into one giant script
B. It allows the ML model to be developed, deployed, and scaled independently of the main application
C. It removes the need for APIs
D. It eliminates the need for data preprocessing

34 Which file extension is commonly associated with a Python pickled model?

A. .txt
B. .pkl
C. .html
D. .css

35 A Canary Deployment strategy involves:

A. Releasing the model to a small percentage of users first to monitor performance before full rollout
B. Deploying the model to a coal mine
C. Replacing the old model instantly for all users
D. Running the model only on weekends

36 In Scikit-Learn, Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())]). What object does steps expect?

A. A dictionary of parameters
B. A list of (name, transform) tuples
C. A JSON string
D. A pandas DataFrame

37 Which metric on a learning curve would indicate that obtaining more data is NOT worth the cost?

A. The training score is increasing rapidly
B. The validation score has plateaued and converged with the training score
C. The gap between training and validation score is widening
D. The validation score is fluctuating wildly

38 What is the purpose of make_pipeline in Scikit-Learn compared to the Pipeline class constructor?

A. It creates a pipeline that runs faster
B. It automatically names the steps based on the class names of the estimators
C. It only supports regression models
D. It does not support cross-validation

39 When deploying a model via a REST API, what is Latency?

A. The number of requests the server can handle per second
B. The time taken from sending the request to receiving the prediction
C. The accuracy of the model
D. The cost of the server

40 What is PMML (Predictive Model Markup Language)?

A. A python library for plotting
B. An XML-based standard for representing predictive models
C. A cloud service provider
D. A type of neural network

41 In Repeated K-Fold Cross-Validation:

A. K-Fold CV is run times with different randomization splits
B. The same K-Fold split is repeated exactly
C. The model is trained repeatedly on the same fold
D. Data is duplicated times before splitting

42 What is the formula for Total Error in the context of Bias-Variance decomposition?

A.
B.
C.
D.

43 Which Scikit-Learn utility helps ensure that the training and testing sets have the same distribution of classes?

A. train_test_split(..., stratify=y)
B. train_test_split(..., shuffle=False)
C. StandardScaler()
D. Pipeline()

44 Why might a FunctionTransformer be included in a Pipeline?

A. To execute a custom Python function (like log transformation) as a step
B. To transform the pipeline into a function
C. To debug the pipeline
D. To plot the data

45 When serializing a model that relies on external code files (custom classes), what common issue arises?

A. The pickle file works everywhere automatically
B. The pickle file may fail to load if the custom class definition is not available in the loading environment
C. The file size doubles
D. The model becomes a regression model

46 Which of the following is an example of Online Learning deployment?

A. Training a model once a year
B. The model updates its weights incrementally as new data streams in
C. Uploading the model to the internet
D. Using a static HTML page

47 What is the purpose of the random_state parameter in Cross-Validation splitters?

A. To improve accuracy
B. To ensure reproducibility of the splits
C. To randomise the hyperparameters
D. To delete random data

48 In a pipeline, what happens if you call predict()?

A. It calls fit() on all steps
B. It calls transform() on all transformers and predict() on the final estimator
C. It calls predict() on all steps
D. It returns an error

49 Which tool is commonly used to create an isolated environment for Python dependencies to avoid version conflicts during development and deployment?

A. Pipenv / Virtualenv / Conda
B. Notepad
C. Chrome
D. Excel

50 What is the typical use of A/B Testing in model deployment?

A. Comparing two different models (Model A and Model B) on live traffic to see which performs better
B. Checking if the model works on inputs A and B
C. Testing the model in Alpha and Beta stages
D. Debugging syntax errors