1What is the primary function of a tool like Tableau?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Easy
A.To store large amounts of unstructured data
B.To train complex neural networks from scratch
C.To create interactive data visualizations and dashboards
D.To write and compile computer code
Correct Answer: To create interactive data visualizations and dashboards
Explanation:
Tableau is a leading data visualization tool used to create charts, graphs, and dashboards to help users understand and interpret data.
Incorrect! Try again.
2Which of the following is a classic example of structured data?
Working with structured and unstructured data
Easy
A.A table of customer information in a SQL database
B.Audio recordings from a call center
C.A collection of customer review emails
D.A folder of images from a security camera
Correct Answer: A table of customer information in a SQL database
Explanation:
Structured data is highly organized and formatted in a way that is easily searchable, like data in a relational database (e.g., SQL) with clear rows and columns.
Incorrect! Try again.
3Which of these data types is considered unstructured?
Working with structured and unstructured data
Easy
A.Video files
B.A database of student grades
C.An Excel spreadsheet with employee IDs
D.A CSV file with sales figures
Correct Answer: Video files
Explanation:
Unstructured data, such as video, audio, and text documents, does not have a predefined data model or organization, making it more difficult to process and analyze.
Incorrect! Try again.
4What is the main purpose of a data pipeline?
Data pipelines and automation
Easy
A.To exclusively visualize data
B.To create AI models
C.To move data from a source to a destination, often with transformations
D.To store data backups
Correct Answer: To move data from a source to a destination, often with transformations
Explanation:
A data pipeline is a series of processes that ingests data from various sources and moves it to a destination (like a data warehouse) for analysis. It often includes steps for cleaning and transforming the data along the way.
Incorrect! Try again.
5What is a major advantage of using cloud services like AWS or Azure for training AI models?
AI Model Environments & Lifecycle Basics: Cloud services
Easy
A.It is always free to use
B.Access to powerful computing resources on-demand
C.It guarantees the model will be 100% accurate
D.It does not require an internet connection
Correct Answer: Access to powerful computing resources on-demand
Explanation:
Cloud services provide scalability and access to expensive hardware (like GPUs and TPUs) on a pay-as-you-go basis, which is often more cost-effective than purchasing and maintaining the hardware yourself.
Incorrect! Try again.
6What does edge deployment for an AI model mean?
AI Model Environments & Lifecycle Basics: Edge deployment
Easy
A.Running the model on a central cloud server
B.Storing the model on a cutting-edge hard drive
C.Training the model on multiple computers simultaneously
D.Running the model directly on a local device like a smartphone or sensor
Correct Answer: Running the model directly on a local device like a smartphone or sensor
Explanation:
Edge deployment involves placing the AI model on the device where the data is generated (the 'edge' of the network), which reduces latency and reliance on a constant internet connection.
Incorrect! Try again.
7What does the term MLOps primarily refer to?
Introduction to MLOps and lifecycle management
Easy
A.A programming language for statistics
B.A set of practices for collaboration and communication between data scientists and IT professionals
C.A brand of computer hardware for AI
D.A new type of machine learning algorithm
Correct Answer: A set of practices for collaboration and communication between data scientists and IT professionals
Explanation:
MLOps (Machine Learning Operations) aims to streamline the lifecycle of machine learning models, from development to deployment and monitoring, by applying DevOps principles to the machine learning process.
Incorrect! Try again.
8In machine learning, what is overfitting?
Error identification
Easy
A.When the dataset is too small to use
B.When a model is too simple to capture the underlying data patterns
C.When a model performs poorly on both training and new data
D.When a model performs very well on training data but poorly on new data
Correct Answer: When a model performs very well on training data but poorly on new data
Explanation:
Overfitting occurs when a model learns the training data, including its noise and random fluctuations, so well that it fails to generalize to new, unseen data.
Incorrect! Try again.
9What is the primary goal of AI process automation?
AI process automation
Easy
A.To use AI to perform repetitive tasks previously done by humans
B.To analyze stock market trends exclusively
C.To replace all human jobs with robots
D.To create art and music using AI
Correct Answer: To use AI to perform repetitive tasks previously done by humans
Explanation:
AI process automation focuses on using AI and machine learning to handle routine, rule-based, and high-volume tasks, freeing up human workers to focus on more complex, strategic activities.
Incorrect! Try again.
10What is often considered the first step in troubleshooting a problem with an AI model?
Troubleshooting
Easy
A.Adding more data to the training set
B.Changing the model's algorithm
C.Immediately deleting the model and starting over
D.Identifying and understanding the specific problem or error
Correct Answer: Identifying and understanding the specific problem or error
Explanation:
Before you can fix a problem, you must first clearly identify what is going wrong. This involves looking at error messages, model performance metrics, and logs.
Incorrect! Try again.
11When using a tool like ChatGPT's Advanced Data Analysis, what kind of input do you typically provide to start an analysis?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Easy
A.A natural language prompt describing the task and a data file
B.A connection to a live-streaming database
C.Complex Python code
D.A pre-trained neural network
Correct Answer: A natural language prompt describing the task and a data file
Explanation:
These tools are designed to be user-friendly. The user uploads a data file (like a CSV) and types a request in plain English, such as "Create a bar chart of sales by region."
Incorrect! Try again.
12What does automation in a data pipeline help to reduce?
Data pipelines and automation
Easy
A.The amount of data being processed
B.The number of data sources
C.The need for manual intervention and human error
D.The complexity of the data
Correct Answer: The need for manual intervention and human error
Explanation:
Automating a data pipeline ensures that data processing steps run consistently and on a schedule without a person needing to manually trigger each step, which reduces the chance of errors.
Incorrect! Try again.
13Which of these applications is a good candidate for edge deployment?
AI Model Environments & Lifecycle Basics: Edge deployment
Easy
A.A real-time object detection feature on a smartphone camera
B.Analyzing a decade of a company's financial records
C.A massive climate change simulation model
D.Training a large language model like GPT-4
Correct Answer: A real-time object detection feature on a smartphone camera
Explanation:
Edge deployment is ideal for applications that require low latency (fast response times) and may need to function without a stable internet connection, such as real-time processing on a mobile device.
Incorrect! Try again.
14Which stage of the AI model lifecycle involves putting a trained model into a live environment to make predictions?
Introduction to MLOps and lifecycle management
Easy
A.Data collection
B.Deployment
C.Feature engineering
D.Model training
Correct Answer: Deployment
Explanation:
Deployment is the phase where the model is integrated into a production system, such as a web application or mobile app, so it can be used by end-users.
Incorrect! Try again.
15The term for using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server, is called:
AI Model Environments & Lifecycle Basics: Cloud services
Easy
A.Local Hosting
B.Cloud Computing
C.Edge Computing
D.Personal Computing
Correct Answer: Cloud Computing
Explanation:
Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user.
Incorrect! Try again.
16What is a syntax error in a computer program?
Error identification
Easy
A.An error caused by a user providing invalid input
B.An error in the code that violates the rules of the programming language
C.An error that occurs only when the program is out of memory
D.An error where the program runs but produces incorrect results
Correct Answer: An error in the code that violates the rules of the programming language
Explanation:
A syntax error is like a grammatical mistake in a language. The code cannot be correctly interpreted or compiled because it doesn't follow the language's structural rules.
Incorrect! Try again.
17In software, what does debugging refer to?
Troubleshooting
Easy
A.The process of writing new features for an application
B.The process of designing the user interface
C.The process of deploying the application to a server
D.The process of finding and fixing errors or 'bugs' in code
Correct Answer: The process of finding and fixing errors or 'bugs' in code
Explanation:
Debugging is a systematic process of identifying, analyzing, and removing errors (bugs) from computer software or hardware to make it behave as expected.
Incorrect! Try again.
18What is the key characteristic of structured data?
Working with structured and unstructured data
Easy
A.It can only be text
B.It has a predefined format and a fixed schema
C.It is always stored in PDF files
D.It has no internal structure
Correct Answer: It has a predefined format and a fixed schema
Explanation:
The defining feature of structured data is its organization. It conforms to a tabular format with clear relationships between rows and columns or data points, defined by a schema.
Incorrect! Try again.
19After an AI model is deployed, what is a critical MLOps practice to ensure it continues to perform well?
Introduction to MLOps and lifecycle management
Easy
A.Monitoring and maintenance
B.Never updating the model
C.Deleting the training data
D.Hiding the model's predictions from users
Correct Answer: Monitoring and maintenance
Explanation:
Models can degrade over time due to changes in the real-world data (a concept called 'model drift'). MLOps emphasizes continuous monitoring to detect and correct such issues.
Incorrect! Try again.
20A business wants to automatically categorize incoming customer support emails into 'Urgent', 'Billing Question', or 'General Inquiry'. This is an example of:
AI process automation
Easy
A.Edge deployment
B.Hardware troubleshooting
C.Data visualization
D.AI process automation
Correct Answer: AI process automation
Explanation:
This task uses Natural Language Processing (an AI technique) to automate the routine process of sorting emails, which would otherwise be done manually by a person.
Incorrect! Try again.
21A business analyst has a 500MB CSV file of sales data and wants to quickly explore potential correlations, generate summary statistics, and create a few initial plots without writing any code. Which tool would be most efficient for this initial exploratory data analysis task?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Medium
A.Tableau by connecting to the data source and manually dragging and dropping fields to create worksheets.
B.ChatGPT Advanced Data Analysis by uploading the file and using natural language prompts.
C.Writing a custom Python script using the Pandas and Matplotlib libraries.
D.Importing the data into a SQL database and writing complex queries.
Correct Answer: ChatGPT Advanced Data Analysis by uploading the file and using natural language prompts.
Explanation:
For rapid, prompt-based exploration without coding, ChatGPT's Advanced Data Analysis is ideal. It can interpret requests like "find the correlation between sales and advertising spend" and generate the code and output automatically. While Tableau is a powerful visualization tool, it requires more manual setup. Python and SQL are powerful but require explicit coding, which contradicts the user's requirement.
Incorrect! Try again.
22An AI system is designed to analyze customer support tickets. Each ticket contains the customer's name (text), a priority level (low, medium, high), the date of submission (timestamp), and a free-text description of the problem. How should this data be categorized?
Working with structured and unstructured data
Medium
A.Entirely unstructured data because it contains free-text.
B.A mix of structured (name, priority, date) and unstructured (problem description) data.
C.Entirely structured data because it's all stored in a database.
D.Primarily time-series data due to the submission date.
Correct Answer: A mix of structured (name, priority, date) and unstructured (problem description) data.
Explanation:
Structured data has a predefined format (e.g., name, priority level, date). Unstructured data does not have a clear, predefined model (e.g., the free-text problem description). This scenario clearly contains both types.
Incorrect! Try again.
23In an automated data pipeline for a machine learning model, what is the primary purpose of the 'Data Validation' stage that typically follows data ingestion?
Data pipelines and automation
Medium
A.To check if the incoming data meets certain quality and schema expectations before processing.
B.To convert raw data into features for the model (e.g., normalization).
C.To train the machine learning model on the new data.
D.To store the raw data in a data lake or warehouse.
Correct Answer: To check if the incoming data meets certain quality and schema expectations before processing.
Explanation:
The Data Validation stage acts as a quality gate. It ensures that the new data conforms to the expected schema (e.g., correct data types, number of columns) and statistical properties, preventing corrupted or unexpected data from breaking the pipeline or degrading model performance.
Incorrect! Try again.
24A hospital is developing an AI tool to assist surgeons by providing real-time analysis of a video feed from a laparoscopic camera during an operation. The system must have minimal latency (delay) to be effective. What is the most appropriate deployment environment for this AI model?
AI Model Environments & Lifecycle Basics: Cloud services, Edge deployment
Medium
A.Hybrid deployment where data is sent to the cloud for processing and results are sent back.
B.Edge deployment on the surgical equipment itself.
C.Cloud deployment on a high-performance server in a remote data center.
D.Batch processing on a local server after the surgery is complete.
Correct Answer: Edge deployment on the surgical equipment itself.
Explanation:
For applications requiring real-time response and high reliability, such as surgical assistance, edge deployment is critical. Processing the data directly on or near the device avoids the network latency and potential connectivity issues associated with sending data to a remote cloud server.
Incorrect! Try again.
25A machine learning model is trained to predict customer churn. During evaluation, the model's accuracy on the training dataset is 98%, but its accuracy on a new, unseen test dataset is only 60%. This significant performance gap is a classic sign of:
Error identification, Troubleshooting
Medium
A.Underfitting
B.Data leakage
C.Class imbalance
D.Overfitting
Correct Answer: Overfitting
Explanation:
Overfitting occurs when a model learns the training data too well, including its noise and specific patterns, but fails to generalize to new, unseen data. The high accuracy on training data and low accuracy on test data is the primary indicator of this problem.
Incorrect! Try again.
26What is the primary role of a 'feature store' in an MLOps framework?
Introduction to MLOps and lifecycle management
Medium
A.To log the performance metrics of models in production.
B.To store the final, trained machine learning models.
C.To provide a centralized repository for storing, retrieving, and managing curated features for model training and serving.
D.To orchestrate the entire data pipeline from ingestion to deployment.
Correct Answer: To provide a centralized repository for storing, retrieving, and managing curated features for model training and serving.
Explanation:
A feature store solves problems of feature duplication and inconsistency. It's a central place where data scientists can define, store, and share features, ensuring that the same feature engineering logic is used for both model training and real-time inference, which helps prevent train-serve skew.
Incorrect! Try again.
27A company wants to automate the process of categorizing incoming customer support emails into 'Billing', 'Technical Issue', or 'General Inquiry' before they are assigned to an agent. Which AI technology is best suited for this task?
AI process automation
Medium
A.Anomaly detection for finding outliers.
B.Robotic Process Automation (RPA) for mimicking UI clicks.
C.Computer Vision for image recognition.
D.Natural Language Processing (NLP) for text classification.
Correct Answer: Natural Language Processing (NLP) for text classification.
Explanation:
This task involves understanding and categorizing text, which is a core function of Natural Language Processing (NLP). A text classification model can be trained to automatically assign the correct category to each email based on its content, streamlining the support workflow.
Incorrect! Try again.
28When preparing unstructured text data, such as movie reviews, for a sentiment analysis model, a common preprocessing step is 'vectorization'. What does this process accomplish?
Working with structured and unstructured data
Medium
A.It summarizes the entire text into a single sentence.
B.It stores the text in a highly compressed format to save space.
C.It converts the text into a numerical representation (vectors) that a machine learning model can understand.
D.It corrects all spelling and grammar mistakes in the text.
Correct Answer: It converts the text into a numerical representation (vectors) that a machine learning model can understand.
Explanation:
Machine learning models operate on numbers, not raw text. Vectorization techniques (like TF-IDF or word embeddings) transform words, sentences, or documents into numerical vectors, allowing the model to perform mathematical operations and learn patterns from the text data.
Incorrect! Try again.
29A large e-commerce company trains its product recommendation model weekly on terabytes of new user interaction data. Why is a cloud environment better suited for this task than an on-premise server?
AI Model Environments & Lifecycle Basics: Cloud services, Edge deployment
Medium
A.On-premise servers are incapable of handling terabytes of data.
B.Cloud environments are inherently more secure than any on-premise solution.
C.Cloud services guarantee lower latency for model inference for all users globally.
D.Cloud services offer elastic scalability, allowing the company to provision powerful computing resources (like many GPUs/TPUs) for the training period and then scale them down to save costs.
Correct Answer: Cloud services offer elastic scalability, allowing the company to provision powerful computing resources (like many GPUs/TPUs) for the training period and then scale them down to save costs.
Explanation:
The key advantage of the cloud for large-scale training is elasticity. The company can access a massive amount of computational power when needed for the heavy training job and then release those resources, paying only for what they use. Maintaining such hardware on-premise would be extremely expensive and inefficient.
Incorrect! Try again.
30You are tasked with creating a highly interactive, public-facing dashboard that allows users to filter data by region, date range, and product category. The dashboard must be embeddable in a website and handle live data connections. Which tool is designed for this specific purpose?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Medium
A.Microsoft Excel
B.Tableau
C.A Jupyter Notebook with static plots
D.ChatGPT Advanced Data Analysis
Correct Answer: Tableau
Explanation:
Tableau specializes in creating interactive, shareable, and embeddable dashboards with features like live data connections and user-driven filters. While ChatGPT can create plots, it's not designed for building persistent, interactive dashboards. Jupyter notebooks are for analysis, not public-facing applications, and Excel has limitations in interactivity and live data handling.
Incorrect! Try again.
31Within the MLOps lifecycle, what is the primary purpose of 'model monitoring' after a model has been deployed?
Introduction to MLOps and lifecycle management
Medium
A.To continuously retrain the model with new data every few seconds.
B.To detect performance degradation, data drift, or concept drift in the production environment.
C.To A/B test different versions of the model's user interface.
D.To keep a version-controlled history of the model's source code.
Correct Answer: To detect performance degradation, data drift, or concept drift in the production environment.
Explanation:
Once a model is in production, it's crucial to monitor its performance on live data. Model monitoring tracks key metrics and data distributions to alert the team when the model's performance drops (degradation) or when the input data characteristics change (data drift), indicating that retraining may be necessary.
Incorrect! Try again.
32An AI model for predicting house prices is found to have high bias. What is the most likely symptom of this problem?
Error identification, Troubleshooting
Medium
A.The model's predictions fluctuate wildly with small changes in the input data.
B.The model takes an excessively long time to train.
C.The model performs perfectly on the training data but fails miserably on the test data.
D.The model performs poorly on both the training data and the test data, consistently making large errors.
Correct Answer: The model performs poorly on both the training data and the test data, consistently making large errors.
Explanation:
High bias is synonymous with underfitting. The model is too simple to capture the underlying patterns in the data. As a result, it performs poorly not just on new data, but also on the data it was trained on, indicating it has failed to learn the relationships effectively.
Incorrect! Try again.
33In the context of data pipelines, what is the concept of 'idempotency'?
Data pipelines and automation
Medium
A.The pipeline can process both structured and unstructured data simultaneously.
B.Running the pipeline multiple times with the same input will always produce the same output, without causing unintended side effects.
C.The pipeline runs on a predefined schedule, such as once every 24 hours.
D.The pipeline automatically scales its resources based on the volume of data.
Correct Answer: Running the pipeline multiple times with the same input will always produce the same output, without causing unintended side effects.
Explanation:
Idempotency is a critical property for data pipeline reliability. It means that if a pipeline task fails and is retried, or is run accidentally multiple times, it won't duplicate data, create incorrect entries, or corrupt the final state. The end result is the same as if it had run successfully just once.
Incorrect! Try again.
34Which of the following scenarios is a better fit for traditional Robotic Process Automation (RPA) rather than a more complex AI-based automation solution?
AI process automation
Medium
A.Copying data from a specific cell in an Excel sheet and pasting it into a fixed field in a web-based form.
B.Reading a handwritten doctor's note and summarizing the key points.
C.Determining the overall sentiment (positive/negative) of a customer's email.
D.Forecasting next quarter's sales based on historical data and market trends.
Correct Answer: Copying data from a specific cell in an Excel sheet and pasting it into a fixed field in a web-based form.
Explanation:
Traditional RPA excels at automating repetitive, rule-based tasks that involve structured data and deterministic steps. Copying and pasting between well-defined locations is a perfect example. The other options require cognitive capabilities like understanding handwriting, analyzing sentiment, or predictive modeling, which fall into the domain of AI.
Incorrect! Try again.
35A data science team has two versions of a fraud detection model. They want to test which one performs better on live traffic without fully replacing the old model. They decide to route 10% of user requests to the new model and 90% to the old one. This deployment strategy is known as:
Introduction to MLOps and lifecycle management
Medium
A.Blue-Green Deployment
B.Shadow Deployment
C.A/B Testing
D.Canary Deployment
Correct Answer: Canary Deployment
Explanation:
Canary deployment involves rolling out a new version of a model to a small subset of users/requests first. This allows the team to monitor its performance and stability in a live environment with minimal risk. If it performs well, the rollout can be gradually increased until it replaces the old version entirely.
Incorrect! Try again.
36An AI model is being built to extract information from scanned PDF invoices. What is a primary challenge that distinguishes this task from analyzing a simple text file?
Working with structured and unstructured data
Medium
A.The text in a PDF is always perfectly clean and requires no preprocessing.
B.PDF invoices contain only structured data, which is difficult to parse.
C.The model must understand the spatial layout and structure (e.g., tables, key-value pairs) of the document, not just the raw text.
D.PDF files cannot be read by programming languages.
Correct Answer: The model must understand the spatial layout and structure (e.g., tables, key-value pairs) of the document, not just the raw text.
Explanation:
Documents like invoices have a crucial 2D structure. The location of text (e.g., text next to the label 'Total Amount') is key to its meaning. This requires technologies like Optical Character Recognition (OCR) combined with layout analysis, making it more complex than just processing a linear sequence of text.
Incorrect! Try again.
37During the exploratory data analysis phase, you discover that the 'price' column in your dataset, which should be numerical, contains values like '$1,200.50' and '950 USD'. If you try to feed this data directly into a regression model, what type of error will most likely occur?
Error identification, Troubleshooting
Medium
A.An overfitting error due to the high variance in price.
B.A logical error where the model produces negative price predictions.
C.A data type error, as the model expects a numeric type but receives a string.
D.A data leakage error from the currency symbols.
Correct Answer: A data type error, as the model expects a numeric type but receives a string.
Explanation:
Most machine learning libraries and models require numerical input for regression tasks. The presence of non-numeric characters ($, ,, USD) makes the data type a string (or object). This mismatch will cause the program to fail, typically with a TypeError or ValueError, until the data is cleaned and converted to a proper numeric format (e.g., float or integer).
Incorrect! Try again.
38A key component of a robust, automated ML training pipeline is data and model versioning. Why is it crucial to version not just the code, but also the data used for training?
Data pipelines and automation
Medium
A.To automatically encrypt the dataset for security.
B.To speed up the data loading process during training.
C.To reduce the storage space required for the dataset.
D.To ensure reproducibility, allowing you to recreate a specific model by using the exact same code and data it was trained on.
Correct Answer: To ensure reproducibility, allowing you to recreate a specific model by using the exact same code and data it was trained on.
Explanation:
Reproducibility is a cornerstone of MLOps and scientific rigor. A model is a product of both code and data. If the training data changes, even with the same code, you will get a different model. Versioning the data (e.g., using tools like DVC) allows you to track which dataset was used to produce which model version, making experiments reproducible and debugging easier.
Incorrect! Try again.
39What is a major drawback of edge deployment compared to cloud deployment for AI models?
AI Model Environments & Lifecycle Basics: Cloud services, Edge deployment
Medium
A.Difficulty in scaling to serve millions of users simultaneously.
B.Higher costs associated with paying for on-demand cloud computing resources.
C.Higher network latency and dependence on internet connectivity.
D.Limited computational power, memory, and energy on edge devices, making it difficult to run large, complex models.
Correct Answer: Limited computational power, memory, and energy on edge devices, making it difficult to run large, complex models.
Explanation:
Edge devices (like smartphones, sensors, or IoT devices) have significant constraints on processing power, RAM, and battery life compared to cloud servers. This often requires model optimization, quantization, or the use of smaller, less complex models, which can sometimes trade off accuracy for efficiency.
Incorrect! Try again.
40A user provides Tableau with a dataset containing 'State', 'City', and 'Sales' data. Tableau automatically recognizes that 'State' and 'City' are geographical data types and suggests plotting them on a map. This feature is an example of:
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Medium
A.Automated data type inference and semantic recognition.
B.Natural Language Processing (NLP) of the column headers.
C.A manually programmed rule for all columns named 'State' or 'City'.
D.AI-powered predictive modeling.
Correct Answer: Automated data type inference and semantic recognition.
Explanation:
Modern BI tools like Tableau have built-in intelligence to analyze the data within a column (not just its header) and infer its semantic type. Recognizing patterns like state names or city names allows it to automatically assign a 'Geographical Role', enabling powerful map-based visualizations without manual configuration.
Incorrect! Try again.
41An MLOps team is managing a credit risk model for a bank. They detect that the model's predictions are systematically drifting for a specific demographic group, indicating potential fairness issues. The model is retrained automatically every month on new data. What is the most robust MLOps strategy to address this specific type of concept drift?
Introduction to MLOps and lifecycle management
Hard
A.Roll back to the previous model version and halt the automatic retraining pipeline until the data distribution stabilizes.
B.Increase the retraining frequency to weekly to adapt to the new data distribution more quickly.
C.Implement stratified retraining batches that ensure consistent demographic representation in every training cycle and add a fairness constraint to the model's loss function.
D.Trigger an alert for manual review by the data science team whenever the model's accuracy drops below a predefined threshold.
Correct Answer: Implement stratified retraining batches that ensure consistent demographic representation in every training cycle and add a fairness constraint to the model's loss function.
Explanation:
This is the most proactive and robust solution. Increasing retraining frequency (B) might just retrain the bias faster. Manual review (C) is reactive, not preventative. Rolling back (D) is a temporary fix that doesn't solve the underlying problem. Stratified retraining ensures the model sees a fair distribution of data, and adding a fairness constraint directly penalizes the model for biased predictions during training, addressing the root cause systematically.
Incorrect! Try again.
42A company is deploying a real-time object detection model on a fleet of 10,000 battery-powered drones. The key constraints are inference latency (< 50ms) and power consumption. The original model is a large TensorFlow FP32 model. Which model optimization strategy represents the most sophisticated and effective approach for this specific edge scenario?
AI Model Environments & Lifecycle Basics: Edge deployment
Hard
A.Prune the model by 50% to reduce its size and then use post-training static quantization with a representative dataset of images from the drones.
B.Convert the model to TensorFlow Lite (FP32) and deploy it, relying on the hardware's GPU for acceleration.
C.Implement Quantization-Aware Training (QAT) to retrain the model, simulating INT8 quantization during training, and then deploy the resulting model.
D.Use post-training dynamic range quantization to convert weights to INT8, as it requires no representative dataset and is simple to implement.
Correct Answer: Implement Quantization-Aware Training (QAT) to retrain the model, simulating INT8 quantization during training, and then deploy the resulting model.
Explanation:
For strict latency and power constraints, INT8 quantization is necessary. Quantization-Aware Training (QAT) almost always yields the highest accuracy for quantized models because it allows the model to adapt its weights to the precision loss during the training process itself. Post-training quantization (A, B) is easier but often results in a more significant accuracy drop. Simply converting to TFLite without quantization (D) may not be sufficient to meet the strict power and latency requirements on resource-constrained devices.
Incorrect! Try again.
43A data engineer is designing a data pipeline that processes financial transactions. The pipeline has a critical step that aggregates transactions and writes the summary to a database. If the pipeline fails after this step and is re-run, it must not create duplicate summaries or incorrect aggregates. Which property is essential for this specific step?
Data pipelines and automation
Hard
A.Idempotency
B.Observability
C.Latency
D.Scalability
Correct Answer: Idempotency
Explanation:
Idempotency is the property of an operation that ensures running it multiple times has the same effect as running it once. In this scenario, an idempotent aggregation step would be designed to overwrite or correctly update the summary data upon a re-run, preventing duplicate records or incorrect calculations that would result from simply adding the same data again. This is crucial for data integrity in pipelines that can fail and be restarted.
Incorrect! Try again.
44During the training of a Generative Adversarial Network (GAN), the generator's loss drops to near zero while the discriminator's loss remains high and erratic. The generated images are all very similar and lack diversity. This phenomenon is best described as:
Error identification, Troubleshooting
Hard
A.Vanishing Gradients
B.Exploding Gradients
C.Overfitting
D.Mode Collapse
Correct Answer: Mode Collapse
Explanation:
Mode collapse is a common GAN failure mode where the generator learns to produce a very limited set of outputs (a few 'modes' of the data distribution) that are particularly effective at fooling the discriminator. This leads to the generator's loss becoming very low (as it's succeeding at its one trick) while the discriminator's loss stays high (as it can't distinguish the fake samples from the real ones in that specific mode). The lack of diversity in generated samples is the key symptom.
Incorrect! Try again.
45An analyst uses ChatGPT's Advanced Data Analysis to analyze a sales dataset. They ask it to "Identify the top 3 product categories by profit margin and visualize the result." The AI generates a bar chart showing 3 categories. However, when the analyst manually calculates the profit margin using the formula , they find a different set of top 3 categories. What is the most likely cause of this discrepancy originating from the AI's process?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Hard
A.The AI's Python execution environment had a floating-point precision error that miscalculated the division for certain categories.
B.The dataset contained null or zero values in the 'Sales' column for some rows, and the AI's default data cleaning step dropped these rows, skewing the calculation.
C.The AI is non-deterministic and hallucinated the results without performing the actual calculation.
D.The AI misinterpreted "profit margin" and calculated "total profit" instead, as this is a more common and simpler metric.
Correct Answer: The AI misinterpreted "profit margin" and calculated "total profit" instead, as this is a more common and simpler metric.
Explanation:
This scenario highlights a key challenge with natural language interfaces for data analysis: ambiguity. "Profit margin" can be interpreted in several ways (e.g., gross profit margin, net profit margin) or confused with a simpler metric like total profit (Sales - Cost). The AI latching onto the simpler, more common metric is a very plausible failure mode. While C is possible, it's a data issue; A is an AI interpretation issue, which is a core concept when using these tools. B and D are less likely for a standard calculation.
Incorrect! Try again.
46A team is building a recommendation engine for an e-commerce site. They have structured data (user purchase history, product ratings) and unstructured data (text from product reviews). To create a hybrid model, they generate embeddings from the review text using a transformer model. What is the most significant challenge when combining these text-based embeddings with the structured user/item features?
Working with structured and unstructured data
Hard
A.Transformer models for text embeddings are too slow for real-time recommendation systems and cannot be combined with structured data.
B.The high dimensionality of text embeddings (e.g., 768 dimensions for BERT) can dominate the lower-dimensional structured features, making the model insensitive to purchase history or ratings.
C.Structured data cannot be normalized to the same scale as text embeddings, leading to training instability.
D.It is impossible to concatenate feature vectors of different data types (numerical and text-based) into a single input for a machine learning model.
Correct Answer: The high dimensionality of text embeddings (e.g., 768 dimensions for BERT) can dominate the lower-dimensional structured features, making the model insensitive to purchase history or ratings.
Explanation:
This is a classic problem in multimodal learning. High-dimensional vectors from one modality (text) can numerically overwhelm features from another (structured data), causing the model's gradients to be dominated by the text features. This requires careful feature scaling, dimensionality reduction (e.g., PCA, autoencoders) on the embeddings, or using model architectures (like multi-headed attention) specifically designed to handle this imbalance, making it the most significant challenge.
Incorrect! Try again.
47What is the primary motivation for using a dedicated Feature Store in a mature MLOps organization with multiple data science teams?
Introduction to MLOps and lifecycle management
Hard
A.To serve as a version control system for model weights and artifacts, similar to Git.
B.To provide a centralized location for data scientists to visualize and explore raw data before feature engineering.
C.To prevent training-serving skew by ensuring the exact same feature engineering logic is used during both model training and real-time inference.
D.To automate the process of hyperparameter tuning for all models in the organization.
Correct Answer: To prevent training-serving skew by ensuring the exact same feature engineering logic is used during both model training and real-time inference.
Explanation:
The core value of a feature store is solving the training-serving skew problem. Often, features are generated in a batch process for training (e.g., using a Spark job) but need to be generated in real-time with low latency for serving. Discrepancies between these two code paths can lead to a significant drop in model performance. A feature store provides a unified, consistent source for features for both contexts, directly addressing this critical MLOps challenge. While it helps with discovery (B), it is distinct from model versioning (C) and hyperparameter tuning (D).
Incorrect! Try again.
48A startup is training a large language model on a custom dataset. They need to perform distributed training across multiple GPUs to reduce training time. They are considering different cloud services. Which of the following describes the most complex challenge they will face that is specific to distributed training in the cloud?
AI Model Environments & Lifecycle Basics: Cloud services
Hard
A.Installing the correct version of deep learning frameworks like PyTorch or TensorFlow on the cloud instances.
B.Setting up a secure network connection (VPC) to protect the training data from unauthorized access.
C.Managing inter-node communication bandwidth and latency, which can become a bottleneck and diminish the returns of adding more nodes.
D.Provisioning a single virtual machine with a sufficiently powerful GPU to handle the model's memory requirements.
Correct Answer: Managing inter-node communication bandwidth and latency, which can become a bottleneck and diminish the returns of adding more nodes.
Explanation:
In distributed training, the model's gradients and weights must be synchronized across all nodes/GPUs. This communication is highly sensitive to network performance. If the interconnect bandwidth is low or latency is high, the GPUs will spend more time waiting for data from other GPUs than performing computations. This communication overhead is often the primary bottleneck that prevents linear scaling and is a complex engineering problem to solve, often requiring specialized hardware (like AWS's EFA or NVIDIA's NVLink) and careful architecture choices.
Incorrect! Try again.
49A company wants to automate its invoice processing. The process involves receiving invoices as PDFs via email, extracting fields like invoice number, date, and total amount, and then entering this data into an ERP system. The invoices come from hundreds of different vendors, each with a unique template. Why would a traditional OCR + template-based RPA solution be inferior to an AI-powered Intelligent Document Processing (IDP) solution for this task?
AI process automation
Hard
A.An IDP solution uses natural language understanding and computer vision to identify fields contextually, making it robust to variations in invoice templates without needing a new template for each vendor.
B.A traditional RPA solution cannot interact with web-based ERP systems, whereas an IDP solution has native API connectors.
C.Standard OCR cannot read text from PDF documents, requiring an AI-based solution to digitize the text first.
D.RPA bots are not capable of performing conditional logic (if/then statements), which is required to validate the extracted invoice data.
Correct Answer: An IDP solution uses natural language understanding and computer vision to identify fields contextually, making it robust to variations in invoice templates without needing a new template for each vendor.
Explanation:
The core weakness of traditional, template-based automation is its brittleness. It requires a predefined template for every single invoice layout. When a new vendor is added or an existing vendor changes their template, the system breaks. An AI-powered IDP solution uses models trained to understand the meaning of an invoice (e.g., it looks for a field labeled "Total" or a monetary value near the bottom), making it adaptable to new and unseen templates. This contextual understanding is the key advantage.
Incorrect! Try again.
50In the context of a streaming data pipeline using a technology like Apache Kafka, what is the primary challenge associated with ensuring 'exactly-once' processing semantics?
Data pipelines and automation
Hard
A.Achieving high enough throughput to process messages as they arrive without creating a backlog in the Kafka topics.
B.Coordinating distributed transactions between the streaming processor (e.g., Flink, Spark) and the output data sink (e.g., a database) to handle both processing failures and network failures without data duplication or loss.
C.Ensuring that messages produced to Kafka are correctly serialized and deserialized by all consumers in the pipeline.
D.Encrypting the data in transit between Kafka brokers and consumers to meet security compliance requirements.
Correct Answer: Coordinating distributed transactions between the streaming processor (e.g., Flink, Spark) and the output data sink (e.g., a database) to handle both processing failures and network failures without data duplication or loss.
Explanation:
Exactly-once semantics is extremely difficult to achieve because it requires handling failure scenarios perfectly. The system must be able to distinguish between 'at-least-once' (duplicates are possible on retry) and 'at-most-once' (data loss is possible on retry). To get it exactly right, the system needs atomic commits across distributed components. This often involves techniques like two-phase commits or transactional writes, which are complex to implement correctly and robustly, especially in the face of machine or network failures. This coordination is the hardest part of the problem.
Incorrect! Try again.
51A classification model exhibits high accuracy (98%) on a test set, but a confusion matrix reveals it performs very poorly on the minority class (e.g., high false negatives for a 'fraud' class). When plotted, the ROC curve shows an AUC of 0.95. Why is the high AUC score misleading in this scenario?
Error identification, Troubleshooting
Hard
A.The test set was contaminated with data from the training set, artificially inflating all performance metrics including AUC.
B.The AUC calculation is mathematically incorrect when the number of positive samples is less than 10% of the total dataset.
C.A high AUC score only indicates that the model's predictions are well-calibrated, not that it has good discriminative power between classes.
D.AUC is sensitive to the class imbalance; it's calculated by integrating over all possible classification thresholds, and with a large number of easy-to-classify true negatives, the curve can be pulled up and to the left, masking poor performance on the small positive class.
Correct Answer: AUC is sensitive to the class imbalance; it's calculated by integrating over all possible classification thresholds, and with a large number of easy-to-classify true negatives, the curve can be pulled up and to the left, masking poor performance on the small positive class.
Explanation:
The Area Under the ROC Curve (AUC) is calculated from the True Positive Rate () and False Positive Rate (). In a highly imbalanced dataset, the number of true negatives (TN) is massive. This makes the FPR very small even if the number of false positives (FP) is significant compared to the number of true positives (TP). This can result in a high AUC score that looks impressive but hides the fact that the model is failing to identify the rare, important class. For such cases, metrics like Precision-Recall AUC (PR-AUC) are more informative.
Incorrect! Try again.
52You are creating a dashboard in Tableau to analyze customer churn. You want to display the churn rate for different customer segments, but also allow users to see how the churn rate would change if a hypothetical marketing intervention reduced churn by 15% for a user-selected segment. Which combination of Tableau features would be most effective for creating this interactive, what-if analysis?
Data analysis and visualization using AI tools (ChatGPT Advanced Data Analysis, Tableau)
Hard
A.A Story to create a sequence of dashboards, with one showing the actual churn and the next showing the manually calculated hypothetical churn.
B.A Quick Filter for segment selection and an AI-powered Forecasting model to project the future churn rate.
C.A Level of Detail (LOD) expression to fix the churn rate at the segment level and a data blend from a separate spreadsheet containing the reduction percentages.
D.A Parameter for segment selection, a Parameter for the churn reduction percentage, and a Calculated Field that uses these parameters to compute the hypothetical churn rate.
Correct Answer: A Parameter for segment selection, a Parameter for the churn reduction percentage, and a Calculated Field that uses these parameters to compute the hypothetical churn rate.
Explanation:
This is the ideal way to build interactive what-if analysis in Tableau. Parameters are user-driven variables that are not tied to the data source. One parameter allows the user to pick a segment, another allows them to input a value (like 15%). The Calculated Field can then use conditional logic (e.g., IF [Segment] = [Segment Parameter] THEN [Actual Churn] * (1 - [Reduction Parameter]) ELSE [Actual Churn] END) to display the hypothetical results dynamically. This is more flexible and interactive than the other options.
Incorrect! Try again.
53A key challenge in federated learning, where models are trained on decentralized edge devices (e.g., mobile phones) without data leaving the device, is the 'non-IID' (non-independent and identically distributed) nature of the data. What is the most severe consequence of this non-IID data distribution?
AI Model Environments & Lifecycle Basics: Edge deployment
Hard
A.The global model, aggregated from the local models, can diverge or converge to a poor-performing minimum because the weight updates from different devices pull the model in conflicting directions.
B.The edge devices may not have enough computational power to train the local model effectively.
C.It becomes impossible to ensure data privacy as the model updates inherently leak information about the local data.
D.The communication cost of sending model updates from the edge devices to the central server becomes prohibitively expensive.
Correct Answer: The global model, aggregated from the local models, can diverge or converge to a poor-performing minimum because the weight updates from different devices pull the model in conflicting directions.
Explanation:
The core assumption of many distributed optimization algorithms is that the data is IID. In federated learning, each user's data is unique (non-IID). If one user's data consists entirely of pictures of cats and another's is all dogs, their locally trained models will become highly specialized. When these specialized updates are averaged, they can conflict and destabilize the training of the global model, causing it to perform poorly for everyone. This is a central research problem in federated learning, addressed by algorithms like FedAvg and FedProx.
Incorrect! Try again.
54In a CI/CD/CT (Continuous Training) pipeline for MLOps, what is the most appropriate trigger for automatically initiating a full model retraining job?
Introduction to MLOps and lifecycle management
Hard
A.A statistical monitoring tool detects significant 'concept drift' where the statistical properties of the live input data have diverged from the training data distribution.
B.A fixed schedule, such as the first day of every month, to ensure the model is always up-to-date.
C.The model's predictive accuracy on the live inference data drops by more than 20% from its initial baseline.
D.A software engineer commits a change to the model's inference API code in the Git repository.
Correct Answer: A statistical monitoring tool detects significant 'concept drift' where the statistical properties of the live input data have diverged from the training data distribution.
Explanation:
This is the most sophisticated and correct trigger for continuous training. Retraining should happen when the world the model was trained on no longer matches the world it's operating in. Concept drift (or data drift) is the direct measure of this. Triggering on a code commit (B) is CI/CD, not CT. A fixed schedule (C) can be wasteful if the data hasn't changed, or too slow if it has. Waiting for a 20% accuracy drop (D) is a lagging indicator; the model has already been performing poorly for some time. Detecting drift is a proactive trigger.
Incorrect! Try again.
55You need to process a dataset of one million 100-page PDF documents to extract specific clauses for a legal AI system. The goal is to perform this task efficiently. Which of the following approaches represents the most scalable and computationally efficient architecture?
Working with structured and unstructured data
Hard
A.A single, powerful server with multiple CPU cores that iterates through each PDF, using a multithreaded Python script to process documents in parallel.
B.A distributed processing pipeline using a framework like Apache Spark, where each worker node processes a subset of PDFs. Each worker uses an OCR library to extract text and a pre-trained transformer model (running on a GPU if available on the worker) for clause identification.
C.A serverless architecture where each PDF upload triggers a cloud function (e.g., AWS Lambda). The function performs OCR and clause extraction for that single document.
D.Manually loading the PDFs into a specialized document analysis desktop application and using its built-in tools to extract the clauses, saving the results to a CSV file.
Correct Answer: A distributed processing pipeline using a framework like Apache Spark, where each worker node processes a subset of PDFs. Each worker uses an OCR library to extract text and a pre-trained transformer model (running on a GPU if available on the worker) for clause identification.
Explanation:
This problem requires processing a massive amount of data. A distributed framework like Spark is designed for this scale. It can partition the one million documents across a cluster of machines (workers) and process them in parallel, providing horizontal scalability. While a serverless approach (C) is good for event-driven tasks, it can become very expensive at this scale and may face execution time limits. A single server (B) will be a bottleneck and cannot scale effectively. Manual processing (D) is not feasible.
Incorrect! Try again.
56An organization is using Apache Airflow to orchestrate its daily ETL pipelines. A critical DAG (Directed Acyclic Graph) has a task that depends on a file arriving from an external partner in an S3 bucket. The file can arrive at any time between 2 AM and 5 AM. Which Airflow component is the most appropriate and efficient for handling this specific dependency?
Data pipelines and automation
Hard
A.Use a TriggerDagRunOperator in a separate 'poller' DAG that runs every minute to check for the file and then trigger the main DAG.
B.Write a Python function with a while True: loop and a time.sleep(60) call inside a PythonOperator to check for the file's existence.
C.Use a Sensor, specifically the S3KeySensor, which will periodically check for the existence of the file and only succeed when the file is found, allowing downstream tasks to run.
D.Run the DAG on a fixed schedule at 5:05 AM and assume the file has arrived. If it hasn't, the task will fail and the on-call engineer will be paged to re-run it manually.
Correct Answer: Use a Sensor, specifically the S3KeySensor, which will periodically check for the existence of the file and only succeed when the file is found, allowing downstream tasks to run.
Explanation:
Sensors are a core feature of Airflow designed for precisely this purpose: waiting for an external condition to be met. The S3KeySensor is a built-in operator that efficiently handles this use case. It occupies a worker slot only while it's actively checking (poking) and can be configured to 'reschedule' itself, freeing up the worker for other tasks between pokes, making it very efficient. The while loop (B) is a bad practice as it would tie up an Airflow worker slot continuously. The fixed schedule (C) is brittle and not event-driven. A separate DAG (D) is overly complex and less efficient than using a built-in sensor.
Incorrect! Try again.
57A financial services company is deploying a fraud detection model using a serverless inference endpoint on a major cloud provider (e.g., AWS SageMaker Serverless Inference, Google Vertex AI). What is the primary trade-off they must consider when choosing a serverless endpoint over a traditional, provisioned endpoint (a dedicated, always-on VM)?
AI Model Environments & Lifecycle Basics: Cloud services
Hard
A.They trade the support for deep learning models for exclusive support for traditional machine learning models like logistic regression.
B.They trade higher security and network isolation for the convenience of a publicly accessible API endpoint.
C.They trade lower cost for infrequent traffic and automatic scaling for potentially higher 'cold start' latency on the first request after a period of inactivity.
D.They trade the ability to use custom Docker containers for a simplified, no-code deployment process.
Correct Answer: They trade lower cost for infrequent traffic and automatic scaling for potentially higher 'cold start' latency on the first request after a period of inactivity.
Explanation:
The core value proposition of serverless is paying only for compute time used. For workloads with sporadic or unpredictable traffic, this is very cost-effective. However, the infrastructure is scaled down to zero when not in use. When a new request arrives, the cloud provider must provision a container, load the model, and initialize the environment. This 'cold start' process introduces latency that would not be present on a provisioned, always-on endpoint. This trade-off between cost/scalability and cold-start latency is the key consideration.
Incorrect! Try again.
58A hospital wants to use AI to automate the preliminary reading of chest X-rays to flag urgent cases for radiologists. The AI model has a 95% accuracy in identifying a specific condition. However, the legal and ethical implications of a misdiagnosis are severe. Which AI process automation design pattern is most appropriate for this high-stakes scenario?
AI process automation
Hard
A.A fully automated 'straight-through-processing' pattern, where the AI's positive predictions are immediately sent to the emergency department to save time.
B.An 'A/B testing' pattern, where 50% of X-rays are processed by the AI and 50% by radiologists to compare performance over time.
C.A 'Robotic Process Automation (RPA)' pattern, where a bot simply moves the X-ray files from one folder to another based on the AI model's output score.
D.A 'Human-in-the-loop' pattern, where the AI flags potential cases, but every single prediction (positive or negative) is reviewed and confirmed by a certified radiologist before any action is taken.
Correct Answer: A 'Human-in-the-loop' pattern, where the AI flags potential cases, but every single prediction (positive or negative) is reviewed and confirmed by a certified radiologist before any action is taken.
Explanation:
In high-stakes environments like healthcare, the cost of a false positive or (especially) a false negative is extremely high. The AI should be used as a tool to assist and augment human experts, not replace them. The 'Human-in-the-loop' pattern ensures that the AI's role is to prioritize and provide a preliminary analysis, but the final, accountable decision rests with a qualified professional. This balances the efficiency gains of AI with the need for safety and accountability.
Incorrect! Try again.
59When deploying a new version of a customer-facing recommendation model, an MLOps team decides to use a 'Canary Release' strategy instead of a simple A/B test. What is the primary advantage of a Canary Release in this context?
Introduction to MLOps and lifecycle management
Hard
A.It automatically rolls back the deployment if the new model's inference latency exceeds a predefined threshold, prioritizing system stability over model accuracy.
B.It allows for a gradual rollout of the new model to a small subset of users (e.g., 1%), minimizing the potential negative impact (the 'blast radius') if the new model has unforeseen issues, while monitoring its performance closely before a full rollout.
C.It ensures that the new model is only served to internal employees and beta testers before being released to the general public.
D.It allows for a statistically rigorous comparison of the new model against the old model by randomly assigning users to two equally sized groups, ensuring the results are not biased.
Correct Answer: It allows for a gradual rollout of the new model to a small subset of users (e.g., 1%), minimizing the potential negative impact (the 'blast radius') if the new model has unforeseen issues, while monitoring its performance closely before a full rollout.
Explanation:
The key concept of a canary release is risk mitigation. Unlike a classic A/B test which might split traffic 50/50, a canary release sends a very small fraction of traffic to the new version first. This allows the team to observe its performance on live traffic and metrics (latency, error rate, business KPIs). If problems arise, the impact is contained to a small user base, and the release can be easily rolled back. A/B testing (B) is more focused on comparison than on risk mitigation during the deployment phase itself.
Incorrect! Try again.
60A neural network model for time-series forecasting is consistently underperforming, with predictions that seem to lag behind the actual data by one time step. The model architecture is a standard LSTM network. What is the most probable cause of this specific 'lagging' behavior?
Error identification, Troubleshooting
Hard
A.The time-series data was not properly made stationary before being fed into the model, and the model is simply learning to predict the last observed value () as the forecast for the next step ().
B.The learning rate is too low, causing the model to converge very slowly and fail to capture the dynamic patterns in the data.
C.Data leakage, where the model was inadvertently trained to predict the next time step's value by using that same value as a feature (e.g., predicting using a feature set that includes ).
D.The model is suffering from vanishing gradients due to the long sequences, preventing it from learning long-term dependencies.
Correct Answer: The time-series data was not properly made stationary before being fed into the model, and the model is simply learning to predict the last observed value () as the forecast for the next step ().
Explanation:
This is a classic pitfall in time-series modeling. If a series has a strong trend or seasonality (is non-stationary), the simplest and often most effective short-term prediction is just the previous value. A model that hasn't been given stationary data (e.g., via differencing) will often learn this 'naive forecast' as its optimal strategy, as it minimizes short-term error. This manifests as a prediction that perfectly mirrors the actual data but is shifted by one time step. Data leakage (A) is a different problem that usually leads to unrealistically high performance, not lagging.