Unit 6 - Practice Quiz

INT428

1 What is the primary function of the 'Advanced Data Analysis' feature (formerly Code Interpreter) in ChatGPT?

A. To search the real-time internet for news articles
B. To execute Python code for data processing, analysis, and visualization
C. To generate photorealistic images directly from text
D. To translate spoken language into text in real-time

2 Which of the following best describes structured data?

A. Data that has no pre-defined format, such as emails or social media posts
B. Data organized in a defined format, typically rows and columns (e.g., SQL databases)
C. Data consisting purely of video and audio files
D. Data generated by sensors without any timestamps

3 In the context of data visualization tools like Tableau, what is a Dimension?

A. A quantitative numerical value that can be aggregated
B. A qualitative variable used to categorize or segment data (e.g., Region, Date)
C. A calculated field using complex Python scripts
D. The physical size of the dashboard on the screen

4 What does the acronym ETL stand for in data pipelines?

A. Extract, Transform, Load
B. Evaluate, Train, Learn
C. Estimate, Test, Launch
D. Encrypt, Transfer, Lock

5 Which of the following is an example of unstructured data?

A. A customer relationship management (CRM) database
B. A spreadsheet containing monthly sales figures
C. A collection of customer review images and text comments
D. A JSON file with strict schema validation

6 In AI deployment, what does Edge Computing refer to?

A. Running AI models on a centralized massive supercomputer
B. Processing data locally on a device (e.g., smartphone, IoT sensor) rather than in the cloud
C. Using the very latest 'cutting edge' algorithms only
D. Storing data on magnetic tapes for long-term archival

7 What is a primary advantage of using Cloud Services for AI model training compared to local hardware?

A. Zero latency in data transfer
B. Scalability and on-demand access to high-performance GPUs/TPUs
C. Guaranteed 100% data privacy without encryption
D. Requirement of no internet connection

8 In the context of machine learning errors, what is Overfitting?

A. When the model performs well on training data but poorly on new, unseen data
B. When the model is too simple to capture the underlying patterns in the data
C. When the model takes too long to train due to hardware limitations
D. When the model data is corrupted during the upload process

9 What is MLOps?

A. A specific algorithm for deep learning
B. A set of practices combining Machine Learning, DevOps, and Data Engineering to deploy and maintain ML systems
C. The process of manually inputting data into a spreadsheet
D. A programming language similar to Python

10 Which visualization would be most appropriate to show the correlation between two continuous variables, and ?

A. Pie Chart
B. Scatter Plot
C. Bar Chart
D. Gantt Chart

11 What is Data Drift in the context of AI lifecycle management?

A. Moving data physically from one server to another
B. The statistical change in model input data over time, potentially degrading model performance
C. The process of cleaning data before training
D. Losing data due to hard drive failure

12 In a data pipeline, what is the role of Data Ingestion?

A. Visualizing the final results
B. Importing data from various sources (streaming or batch) into a storage or processing system
C. Deleting old data to save space
D. Training the neural network

13 When troubleshooting an AI model, a Confusion Matrix is primarily used to:

A. Confuse the user with complex mathematics
B. Visualize the performance of a classification model by showing True Positives, False Positives, etc.
C. Measure the speed of the training process
D. Organize unstructured data into tables

14 Which LaTeX formula represents the calculation for Accuracy in a classification problem?

A.
B.
C.
D.

15 What is the primary benefit of Edge AI regarding privacy?

A. It makes data public to everyone on the internet
B. Sensitive data does not leave the local device, reducing the risk of interception during transmission
C. It encrypts data only when it reaches the cloud
D. It requires biometric authentication for every step

16 Which of the following is a characteristic of Batch Processing in data pipelines?

A. Data is processed in real-time as it arrives
B. Data is collected over a period and processed in chunks at scheduled intervals
C. It requires ultra-low latency
D. It is only used for image data

17 In the AI lifecycle, what happens during the Deployment phase?

A. The data is labeled by humans
B. The model is integrated into a production environment to make predictions on live data
C. The model architecture is designed on a whiteboard
D. The historical data is cleaned and normalized

18 What is Imputation in the context of data analysis?

A. Accusing a model of bias
B. The process of replacing missing data with substituted values (e.g., mean, median)
C. Deleting all rows with missing values
D. Encrypting data for security

19 Which cloud service model provides a platform allowing customers to develop, run, and manage applications without the complexity of building infrastructure (e.g., Google App Engine, Azure App Service)?

A. IaaS (Infrastructure as a Service)
B. PaaS (Platform as a Service)
C. SaaS (Software as a Service)
D. DaaS (Data as a Service)

20 Tableau uses a drag-and-drop interface to create visualizations. What is the 'Measure' in Tableau terminology?

A. A qualitative category
B. A numerical value that can be measured and aggregated (e.g., Sales, Profit)
C. The width of the columns
D. The time it takes to render a chart

21 What is Data Leakage?

A. When a database is hacked
B. When information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates
C. When water damages the server room
D. When the model forgets what it learned

22 Which Python library is commonly used within ChatGPT Advanced Data Analysis for manipulating structured data frames?

A. PyGame
B. Pandas
C. Requests
D. Flask

23 In AI process automation, what distinguishes RPA (Robotic Process Automation) from AI?

A. RPA uses neural networks; AI uses if-then rules
B. RPA is strictly for physical robots; AI is for software
C. RPA mimics rule-based human actions; AI simulates intelligent decision making and learning
D. There is no difference

24 What is the purpose of A/B Testing in model deployment?

A. To test the model on Alpha and Beta versions of an operating system
B. To compare two versions of a model in a live environment to see which performs better
C. To test the model only on data starting with A or B
D. To alternate between the CPU and GPU

25 When troubleshooting a model, if you find high Bias (Underfitting), what is a potential solution?

A. Reduce the number of features
B. Increase the complexity of the model (e.g., add more layers or parameters)
C. Get less training data
D. Add more regularization

26 What is a Dashboard in the context of data visualization?

A. The command line interface for coding
B. A single display that aggregates multiple visualizations to provide a comprehensive view of data
C. The hardware component that connects the monitor
D. A database table

27 Which of the following represents Semi-structured data?

A. A raw binary audio file
B. A relational database table
C. A JSON or XML file
D. A printed book

28 In the AI lifecycle, what is Feature Engineering?

A. Designing the physical look of the robot
B. The process of using domain knowledge to extract or create new variables (features) from raw data to improve model performance
C. The final marketing of the AI product
D. Fixing bugs in the Python compiler

29 What is the primary function of Containerization (e.g., Docker) in AI environments?

A. To compress files to make them smaller
B. To package the application and its dependencies together so it runs consistently across different computing environments
C. To store data in a specialized database
D. To create 3D visualizations

30 Which metric is best for evaluating a classification model where the classes are heavily imbalanced?

A. Accuracy
B. F1 Score
C. Mean Squared Error
D. R-squared

31 What is the risk of relying solely on automated AI tools for data analysis without human oversight?

A. The analysis will be too fast
B. The AI may hallucinate trends or misinterpret context that requires domain expertise
C. The data will become encrypted
D. The tools cannot handle large numbers

32 In a box plot, what does the box itself typically represent?

A. The full range of the data
B. The Interquartile Range (IQR) containing the middle 50% of the data
C. The outliers only
D. The standard deviation

33 What is Latency in the context of AI deployment?

A. The accuracy of the model
B. The time delay between a user's request and the model's response
C. The cost of the server
D. The size of the training dataset

34 Which command would you likely see in a Python script generated by ChatGPT to load a CSV file?

A. pd.read_csv('filename.csv')
B. load_data_now('filename.csv')
C. import csv_file
D. excel.open('filename.csv')

35 What is Concept Drift?

A. When the definition of the target variable changes over time (e.g., what defines 'spam' email changes)
B. When the data storage location changes
C. When the model is moved to a new cloud provider
D. When developers forget the concept of the project

36 Why is Data Cleaning considered the most time-consuming part of the AI lifecycle?

A. Computers are slow at deleting files
B. Real-world data is often incomplete, noisy, duplicated, or inconsistent
C. Data cleaning requires advanced calculus
D. AI models refuse to accept data until it is perfect

37 What is a Heatmap useful for?

A. Showing the geographical temperature only
B. Visualizing the magnitude of a phenomenon as color in two dimensions (e.g., correlation matrix)
C. Plotting a 3D object
D. Listing data in alphabetical order

38 What is Model Registry in MLOps?

A. A list of fashion models
B. A centralized repository to store, version, and manage trained machine learning models
C. A log of who accessed the computer
D. The registration fee for using cloud services

39 Which service type involves the provider managing the OS, middleware, and runtime, while you manage the data and applications?

A. On-Premises
B. IaaS
C. PaaS
D. SaaS

40 Mathematically, the Mean Squared Error (MSE) is calculated as:

A.
B.
C.
D.

41 What is Scalability in cloud AI infrastructure?

A. The weight of the servers
B. The ability to increase or decrease computing resources based on workload demand
C. The resolution of the screen
D. The ability to run only one specific algorithm

42 When automating data pipelines, what is a Workflow Orchestrator (e.g., Apache Airflow)?

A. A musical conductor for code
B. A tool to schedule, monitor, and manage the sequence of data processing tasks
C. A database for storing images
D. A type of neural network

43 What is a False Negative (Type II Error)?

A. The model correctly predicts the negative class
B. The model incorrectly predicts the positive class
C. The model incorrectly predicts the negative class (misses a detection)
D. The model crashes

44 Which of the following is a common format for storing big data in a 'Data Lake'?

A. Parquet
B. MS Word .doc
C. PowerPoint .ppt
D. Shortcut links

45 What is the purpose of Hyperparameter Tuning?

A. To change the training data
B. To optimize the internal configuration settings of the model (e.g., learning rate, tree depth) to improve performance
C. To fix hardware overheating
D. To clean the dataset

46 In the context of ChatGPT Advanced Data Analysis, what is a Sandbox?

A. A graphical interface for drawing
B. An isolated testing environment that prevents code execution from harming the host system
C. A cloud storage drive
D. A type of dataset

47 What is the key difference between Streaming data and Static data?

A. Streaming data is continuous and unbounded; Static data is fixed and bounded
B. Streaming data is always video; Static data is text
C. Streaming data is slower
D. Static data cannot be analyzed

48 Which chart type is best for visualizing the distribution of a single numerical variable?

A. Histogram
B. Scatter Plot
C. Pie Chart
D. Network Graph

49 What is the CI/CD pipeline in the context of MLOps?

A. Continuous Integration / Continuous Deployment
B. Code Input / Code Delete
C. Cloud Integration / Cloud Data
D. Calculated Index / Calculated Data

50 When troubleshooting, if a model has high accuracy but fails on specific demographic subgroups, this is an issue of:

A. Latency
B. Bias and Fairness
C. Overfitting
D. Data Ingestion