Correct Answer: To extract knowledge and insights from data
Explanation:
Data science is an interdisciplinary field focused on using scientific methods, processes, algorithms, and systems to extract valuable insights from structured and unstructured data.
Incorrect! Try again.
2In the context of Big Data, which of the '3Vs' refers to the speed at which data is generated and processed?
Big data and its 3Vs
Easy
A.Velocity
B.Veracity
C.Volume
D.Variety
Correct Answer: Velocity
Explanation:
Velocity refers to the high speed at which data is generated, collected, and processed. This is a key characteristic of Big Data, such as data from social media streams or sensor networks.
Incorrect! Try again.
3Which of the following tools is primarily used for data visualization and creating interactive dashboards?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Easy
A.Git
B.Apache Hadoop
C.Tableau
D.Microsoft Word
Correct Answer: Tableau
Explanation:
Tableau is a powerful and popular data visualization tool used in the Business Intelligence industry. It helps in simplifying raw data into an easily understandable visual format.
Incorrect! Try again.
4Which job role is primarily responsible for analyzing complex data to help a company make better business decisions?
Job roles and skillset for Data science and Big data
Easy
A.Network Administrator
B.Web Developer
C.Data Scientist
D.Graphic Designer
Correct Answer: Data Scientist
Explanation:
A Data Scientist's core function is to analyze large amounts of complex data, both structured and unstructured, to identify patterns and trends that can inform strategic business decisions.
Incorrect! Try again.
5What does the 'Volume' in the 3Vs of Big Data represent?
Big data and its 3Vs
Easy
A.The large amount of data
B.The accuracy of the data
C.The speed of data generation
D.The different types of data
Correct Answer: The large amount of data
Explanation:
Volume refers to the immense scale or quantity of data generated and stored. This is one of the defining characteristics of Big Data, often measured in terabytes, petabytes, or even exabytes.
Incorrect! Try again.
6An e-commerce website suggesting products to you based on your previous purchases is a common application of what?
Applications of data science/Big data
Easy
A.Network Security
B.Data Science
C.Database Administration
D.Software Testing
Correct Answer: Data Science
Explanation:
This is a classic example of a recommendation engine, which is a key application of data science. It uses algorithms to analyze user behavior and predict what a user might like.
Incorrect! Try again.
7What is generally considered the first step in the data science lifecycle?
Data science Lifecycle with use case
Easy
A.Business Understanding and Problem Definition
B.Model Deployment
C.Data Collection
D.Data Visualization
Correct Answer: Business Understanding and Problem Definition
Explanation:
Before any data is collected or analyzed, it's crucial to first understand the business problem you are trying to solve. This step defines the objectives and scope of the project.
Incorrect! Try again.
8What is the primary purpose of Apache Hadoop?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Easy
A.To write and compile C++ code
B.To manage email servers
C.To create visual graphics and art
D.To store and process very large datasets across clusters of computers
Correct Answer: To store and process very large datasets across clusters of computers
Explanation:
Hadoop is an open-source framework designed for distributed storage (using HDFS) and distributed processing (using MapReduce) of Big Data.
Incorrect! Try again.
9Which of the following is a major challenge associated with Big Data?
Challenges of Big data
Easy
A.Lack of available software
B.Ensuring data security and privacy
C.Computers being too fast
D.Having too little data to analyze
Correct Answer: Ensuring data security and privacy
Explanation:
Handling vast amounts of data, which often includes sensitive personal information, brings significant challenges in terms of storage, security, and adhering to privacy regulations.
Incorrect! Try again.
10The term 'Variety' in Big Data refers to:
Big data and its 3Vs
Easy
A.The number of users accessing the data
B.The financial value of the data
C.The many different types and sources of data (e.g., text, image, video)
D.The physical location where data is stored
Correct Answer: The many different types and sources of data (e.g., text, image, video)
Explanation:
Variety describes the different forms of data, including structured (like in a database), semi-structured (like XML files), and unstructured (like text documents, emails, videos, and images).
Incorrect! Try again.
11Which of the following is a fundamental programming skill for a data scientist?
Skill needed for Big data
Easy
A.Knowledge of a language like Python or R
B.Ability to design logos
C.Experience in hardware repair
D.Expertise in HTML and CSS
Correct Answer: Knowledge of a language like Python or R
Explanation:
Programming languages like Python and R are essential tools for data scientists for data manipulation, statistical analysis, and machine learning.
Incorrect! Try again.
12In the healthcare industry, what is a key use of Big Data?
Use of Big Data in different areas
Easy
A.Designing hospital architecture
B.Scheduling appointments manually
C.Predicting disease outbreaks and patient outcomes
D.Manufacturing surgical tools
Correct Answer: Predicting disease outbreaks and patient outcomes
Explanation:
By analyzing large-scale health data, organizations can identify patterns to predict the spread of diseases, personalize treatments, and improve overall patient care.
Incorrect! Try again.
13Which of the following is a programming language specifically popular for statistical computing and graphics?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Easy
A.R language
B.C#
C.Java
D.HTML
Correct Answer: R language
Explanation:
R is a language and environment widely used among statisticians and data miners for developing statistical software and data analysis.
Incorrect! Try again.
14What is a primary benefit of using cloud platforms like AWS or Azure for Big Data analytics?
Big Data on the Cloud
Easy
A.It is always free of charge
B.It offers scalability and pay-as-you-go pricing
C.It requires managing physical servers in-house
D.It works only with small datasets
Correct Answer: It offers scalability and pay-as-you-go pricing
Explanation:
Cloud computing allows organizations to easily scale their storage and computing resources up or down as needed, and they typically only pay for the resources they consume, which is cost-effective.
Incorrect! Try again.
15What is the main responsibility of a Data Engineer?
Job roles and skillset for Data science and Big data
Easy
A.Creating marketing campaigns
B.Building and maintaining the data pipelines and infrastructure
C.Providing customer support
D.Designing user interfaces for websites
Correct Answer: Building and maintaining the data pipelines and infrastructure
Explanation:
Data Engineers are responsible for designing, building, and managing the systems that collect, store, and process data at scale, making it available for data scientists to analyze.
Incorrect! Try again.
16After a data science model has been created and evaluated, what is the typical next step in the lifecycle?
Data science Lifecycle with use case
Easy
A.Deployment
B.Starting a new project
C.Business Understanding
D.Deleting all the data
Correct Answer: Deployment
Explanation:
Deployment involves integrating the model into a production environment where it can be used by the business to make decisions and provide value, such as a recommendation engine on a live website.
Incorrect! Try again.
17For quick and basic data entry, sorting, and creating simple charts, which desktop application is most commonly used?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Easy
A.SQL Server
B.Microsoft Excel
C.TensorFlow
D.Apache Spark
Correct Answer: Microsoft Excel
Explanation:
Excel is a powerful spreadsheet program that is excellent for managing smaller datasets, performing basic calculations, and creating straightforward charts and graphs.
Incorrect! Try again.
18Why is data quality a significant challenge in Big Data?
Challenges of Big data
Easy
A.Inaccurate or incomplete data leads to flawed insights and decisions
B.There is no way to measure data quality
C.High-quality data is too expensive to buy
D.High-quality data takes up too much storage space
Correct Answer: Inaccurate or incomplete data leads to flawed insights and decisions
Explanation:
The principle of 'Garbage In, Garbage Out' is critical in Big Data. If the data used for analysis is of poor quality (inaccurate, inconsistent, or incomplete), the results of the analysis will be unreliable.
Incorrect! Try again.
19Data science is described as an interdisciplinary field because it combines principles from:
Data science and its need
Easy
A.Literature, Music, and Philosophy
B.History, Geography, and Art
C.Manufacturing, Logistics, and Human Resources
D.Statistics, Computer Science, and Domain Expertise
Correct Answer: Statistics, Computer Science, and Domain Expertise
Explanation:
Data science sits at the intersection of multiple fields: it uses statistical techniques, computational methods from computer science, and requires specific knowledge of the business or scientific domain it's being applied to.
Incorrect! Try again.
20How do banks and financial institutions primarily use Big Data?
Use of Big Data in different areas
Easy
A.To design the interior of their branch offices
B.For fraud detection and risk assessment
C.To organize employee social events
D.To choose the color of their logo
Correct Answer: For fraud detection and risk assessment
Explanation:
By analyzing transaction data in real-time, Big Data systems can identify unusual patterns that may indicate fraudulent activity, helping to prevent financial losses and protect customers.
Incorrect! Try again.
21A retail company wants to use its historical sales data to forecast demand for the next quarter to optimize inventory. This scenario primarily demonstrates the need for data science to enable what kind of analytics?
Data science and its need
Medium
A.Descriptive Analytics
B.Prescriptive Analytics
C.Diagnostic Analytics
D.Predictive Analytics
Correct Answer: Predictive Analytics
Explanation:
Predictive analytics focuses on using historical data to make forecasts about future events. In this case, the company is predicting future demand, which is a classic application of predictive modeling in data science.
Incorrect! Try again.
22A social media platform analyzes text posts, images, and videos uploaded by its users to understand trending topics. The combination of these different data formats best illustrates which 'V' of Big Data?
Big data and its 3Vs
Medium
A.Variety
B.Velocity
C.Volume
D.Veracity
Correct Answer: Variety
Explanation:
Variety refers to the different types of data, including structured (like database tables), semi-structured (like XML/JSON), and unstructured (like text, images, videos). Analyzing these diverse formats is a key aspect of handling Big Data's variety.
Incorrect! Try again.
23In a project to build a customer churn prediction model, a data scientist spends significant time cleaning data, handling missing values, and creating new features like 'customer tenure'. Which phase of the data science lifecycle are they currently in?
Data science Lifecycle with use case
Medium
A.Model Building
B.Data Preparation (Wrangling)
C.Model Deployment
D.Business Understanding
Correct Answer: Data Preparation (Wrangling)
Explanation:
Data Preparation (also known as data wrangling or preprocessing) is the crucial step after data acquisition. It involves cleaning, transforming, and feature engineering to make the data suitable for modeling. This is often the most time-consuming phase.
Incorrect! Try again.
24A research team needs to perform complex statistical analysis and create custom visualizations for a scientific paper. They have a moderately sized dataset (a few hundred megabytes). Which tool provides the most flexibility and power for this specific task?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Medium
A.Apache Hadoop
B.R language
C.Tableau
D.Microsoft Excel
Correct Answer: R language
Explanation:
R is a programming language specifically designed for statistical computing and graphics. It offers a vast library of packages for advanced statistical tests and highly customizable visualizations, making it ideal for academic and research purposes. Excel has limitations in statistical depth, Hadoop is for distributed processing of huge datasets, and Tableau is primarily for interactive business intelligence dashboards.
Incorrect! Try again.
25A financial institution aggregates data from multiple sources to assess credit risk. They discover that data from one source uses a different currency format and has many spelling errors, leading to incorrect analysis. This problem is most closely related to which challenge of Big Data?
Challenges of Big data
Medium
A.Data Storage
B.Data Quality and Veracity
C.Data Processing Speed
D.Data Security
Correct Answer: Data Quality and Veracity
Explanation:
Veracity refers to the trustworthiness, quality, and accuracy of data. Inconsistent formats, errors, and discrepancies are all issues of data veracity, which is a major challenge when integrating data from disparate sources.
Incorrect! Try again.
26A team is working on a big data project. One member is responsible for designing, building, and maintaining the scalable data pipelines using tools like Spark and Kafka to move data from source systems to a data lake. What is this person's most likely job role?
Job roles and skillset for Data science and Big data
Medium
A.Data Engineer
B.Data Scientist
C.Data Analyst
D.Business Intelligence Developer
Correct Answer: Data Engineer
Explanation:
A Data Engineer's primary role is to build and manage the data infrastructure and architecture. This includes creating data pipelines (ETL/ELT processes) to ensure data is available and accessible for data scientists and analysts to work with.
Incorrect! Try again.
27A startup is launching a new application that is expected to generate massive amounts of user data, but the initial volume is small. Why would a cloud-based Big Data solution like AWS EMR or Google Cloud Dataproc be a more strategic choice for them than building an on-premise Hadoop cluster?
Big Data on the Cloud
Medium
A.It offers higher processing speeds for small data.
B.It provides better data security by default.
C.It allows for scalability and a pay-as-you-go model, reducing initial capital expenditure.
D.It eliminates the need for data scientists.
Correct Answer: It allows for scalability and a pay-as-you-go model, reducing initial capital expenditure.
Explanation:
Cloud platforms offer elasticity, meaning the startup can scale its resources up or down based on actual data volume and processing needs. The pay-as-you-go model avoids the large upfront investment (capital expenditure) required to build and maintain a physical, on-premise data center.
Incorrect! Try again.
28A city's transportation department uses real-time GPS data from buses and traffic sensors to dynamically adjust traffic light timings and reroute public transport to minimize congestion. This is a practical application of Big Data in which sector?
Use of Big Data in different areas
Medium
A.Smart Cities / Urban Planning
B.Finance
C.Retail
D.Healthcare
Correct Answer: Smart Cities / Urban Planning
Explanation:
This scenario is a hallmark of Smart City initiatives, where large volumes of real-time data (Big Data) from IoT devices and sensors are analyzed to improve urban services like transportation, resource management, and public safety.
Incorrect! Try again.
29An IoT-based weather monitoring system collects sensor readings (temperature, humidity, pressure) every second from thousands of distributed devices. This continuous, high-speed data generation primarily emphasizes which 'V' of Big Data?
Big data and its 3Vs
Medium
A.Value
B.Volume
C.Velocity
D.Variety
Correct Answer: Velocity
Explanation:
Velocity refers to the speed at which data is generated, collected, and processed. The key aspect of this scenario is the constant, high-frequency stream of data from thousands of sources, which requires systems capable of handling this rapid influx.
Incorrect! Try again.
30After building a predictive model, a data scientist presents the findings to stakeholders using visualizations and a summary report, explaining how the model can help achieve a 10% reduction in operational costs. This action is a key part of which lifecycle phase?
Data science Lifecycle with use case
Medium
A.Communication / Reporting
B.Model Building
C.Data Preparation
D.Data Acquisition
Correct Answer: Communication / Reporting
Explanation:
The data science lifecycle doesn't end with a functional model. The Communication phase is critical for translating the technical results into actionable business insights and demonstrating the value of the findings to non-technical stakeholders.
Incorrect! Try again.
31A data professional is tasked with analyzing unstructured text from customer reviews to identify common themes and sentiment. Which combination of skills is most essential for this task?
Skill needed for Big data
Medium
A.A/B Testing and Experimental Design
B.SQL and Database Management
C.Natural Language Processing (NLP) and Text Mining
D.ETL Pipeline Development and Data Warehousing
Correct Answer: Natural Language Processing (NLP) and Text Mining
Explanation:
Analyzing unstructured text requires specialized skills. NLP and text mining are specific fields within data science focused on enabling computers to understand, process, and derive meaning from human language, which is exactly what is needed for sentiment analysis of reviews.
Incorrect! Try again.
32A large corporation needs to process several petabytes of historical log data in a distributed and fault-tolerant manner. The primary goal is batch processing to generate aggregated reports. Which tool is specifically designed for this type of large-scale, distributed data processing?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Medium
Apache Hadoop is an open-source framework designed for distributed storage (HDFS) and distributed processing (MapReduce, Spark) of very large data sets across clusters of computers. It is the ideal tool for batch processing petabyte-scale data in a reliable way.
Incorrect! Try again.
33An e-commerce website shows a customer a personalized list of 'Products you may also like' based on their browsing history and previous purchases. This feature is a direct application of what data science technique?
Recommendation engines are a very common application of data science. They use techniques like collaborative filtering (finding users with similar tastes) or content-based filtering to predict what a user might be interested in, thereby personalizing their experience.
Incorrect! Try again.
34Why is data science considered an interdisciplinary field rather than a single, isolated subject?
Data science and its need
Medium
A.Because it only uses computer science principles.
B.Because it is only useful for businesses and not for scientific research.
C.Because it combines elements from statistics, computer science, and domain expertise.
D.Because it relies solely on creating data visualizations.
Correct Answer: Because it combines elements from statistics, computer science, and domain expertise.
Explanation:
Data science's power comes from its interdisciplinary nature. It requires statistical knowledge to build and evaluate models, computer science skills to handle and process data, and domain expertise to understand the context, ask the right questions, and interpret the results correctly.
Incorrect! Try again.
35A company wants to implement a big data analytics platform but is concerned about complying with regulations like GDPR and CCPA, which govern how customer data is collected, stored, and used. This represents which significant challenge of Big Data?
Challenges of Big data
Medium
A.Financial (Cost of Infrastructure)
B.Technological (Scalability)
C.Analytical (Finding insights)
D.Governance, Security, and Privacy
Correct Answer: Governance, Security, and Privacy
Explanation:
Data governance, security, and privacy are critical challenges. Handling large volumes of personal data makes companies responsible for protecting that data and complying with a complex web of legal and ethical regulations, such as GDPR.
Incorrect! Try again.
36What is the primary advantage of using a cloud data warehouse like Google BigQuery or Amazon Redshift over a traditional on-premise data warehouse for big data analytics?
Big Data on the Cloud
Medium
A.They offer weaker security features than on-premise solutions.
B.They separate storage and compute resources, allowing independent scaling.
C.They are only suitable for very small datasets.
D.They completely eliminate the need for SQL.
Correct Answer: They separate storage and compute resources, allowing independent scaling.
Explanation:
A key architectural advantage of modern cloud data warehouses is the decoupling of storage and compute. This allows a company to scale its storage capacity to hold massive amounts of data cheaply, while scaling its compute power up or down only when needed for running queries, optimizing costs and performance.
Incorrect! Try again.
37In precision agriculture, farmers use data from drones, soil sensors, and weather satellites to make decisions about irrigation, fertilization, and pest control for specific small sections of their fields. This practice demonstrates an application of Big Data to:
Use of Big Data in different areas
Medium
A.Analyze financial market trends for crop prices.
B.Optimize resource usage and increase crop yield.
C.Manage the logistics of food transportation.
D.Increase marketing effectiveness for farm products.
Correct Answer: Optimize resource usage and increase crop yield.
Explanation:
By collecting and analyzing a high volume and variety of data, precision agriculture allows farmers to move from uniform field treatment to highly optimized and localized care. This leads to more efficient use of resources like water and fertilizer, ultimately boosting crop yields and sustainability.
Incorrect! Try again.
38A manager needs a report summarizing last quarter's sales performance, including key metrics and charts. This person is not looking for a predictive model, but a clear explanation of what happened. Who is the most appropriate professional to handle this request?
Job roles and skillset for Data science and Big data
Medium
A.Data Analyst
B.Data Engineer
C.Machine Learning Engineer
D.Database Administrator
Correct Answer: Data Analyst
Explanation:
A Data Analyst's role is centered on descriptive and diagnostic analytics. They examine data to identify trends, create reports, and build dashboards to answer specific business questions about past performance, which is exactly what the manager is asking for.
Incorrect! Try again.
39A data scientist needs to explain the logic behind a complex model's prediction to a non-technical audience to gain their trust. Which skill is most crucial in this situation?
Skill needed for Big data
Medium
A.Distributed Computing
B.Deep Learning theory
C.Advanced Python programming
D.Data Storytelling and Visualization
Correct Answer: Data Storytelling and Visualization
Explanation:
Data storytelling is the ability to communicate complex insights from data in a clear, concise, and compelling narrative. Combined with effective visualization, it allows a data scientist to bridge the gap between technical analysis and business action, which is essential for stakeholder buy-in.
Incorrect! Try again.
40A credit card company develops a system that analyzes transactions in real-time. If a transaction pattern deviates significantly from a user's normal spending behavior (e.g., a large purchase in a foreign country), it is flagged for review. This is a classic application of data science for:
Applications of data science/Big data
Medium
A.Sentiment Analysis
B.Sales Forecasting
C.Fraud Detection
D.Customer Segmentation
Correct Answer: Fraud Detection
Explanation:
This system is an example of an anomaly detection model used for fraud detection. Data science techniques are used to learn a customer's normal behavior and then identify transactions that are highly improbable based on that learned pattern, indicating potential fraud.
Incorrect! Try again.
41A deployed churn prediction model for a subscription service suddenly shows a significant drop in performance (e.g., AUC from 0.85 to 0.60). The retraining pipeline, which runs weekly on new data, does not improve the score. Which of the following scenarios is the most likely root cause that would necessitate a full return to the Business Understanding phase of the data science lifecycle?
Data science Lifecycle with use case
Hard
A.The company launched a new 'annual subscription' plan, fundamentally changing the definition and drivers of customer churn.
B.The customer base grew rapidly, introducing data drift where new customer behavior differs from the training data.
C.A data pipeline feeding customer support ticket information into the model features broke, leading to null values.
D.The model is overfitting to the original training data and does not generalize well to the new weekly data.
Correct Answer: The company launched a new 'annual subscription' plan, fundamentally changing the definition and drivers of customer churn.
Explanation:
While options B, C, and D are serious issues, they can typically be addressed within the Data Preparation, Modeling, or Evaluation phases. A broken pipeline (B) is a data engineering issue. Data drift (C) and overfitting (D) are modeling problems solved by retraining with new data, different algorithms, or regularization. However, a change in the fundamental business process (A) invalidates the original problem definition. The very meaning of 'churn' has changed, requiring a complete restart from the Business Understanding phase to redefine the problem, goals, and success metrics.
Incorrect! Try again.
42A high-frequency trading (HFT) firm is developing an arbitrage detection system. The system must process millions of market data ticks per second from multiple exchanges and execute trades within microseconds. While all 3Vs are present, which 'V' poses the most significant algorithmic and architectural challenge for this specific use case?
Big data and its 3Vs
Hard
A.Veracity, because occasional bad ticks or data feed errors can trigger disastrously wrong trades, making data quality the paramount concern.
B.Volume, because storing petabytes of historical tick data for back-testing models is the most resource-intensive part of the HFT lifecycle.
C.Variety, because the data comes from different exchanges with slightly different formats, requiring complex data integration logic.
D.Velocity, because the core challenge is making complex decisions on streaming data under extreme low-latency constraints, which dictates the choice of in-memory processing and stream-based algorithms.
Correct Answer: Velocity, because the core challenge is making complex decisions on streaming data under extreme low-latency constraints, which dictates the choice of in-memory processing and stream-based algorithms.
Explanation:
In HFT, the competitive advantage lies in speed. The primary challenge is not just processing a large volume of data, but doing so in near real-time. This high velocity dictates the entire system architecture, favoring tools like Apache Flink or custom C++ applications over batch-processing systems like Hadoop. The algorithms must be ableto learn and decide 'on the fly' (online learning), which is a much harder problem than batch processing a large static volume. While the other Vs are valid challenges, Velocity is the defining constraint that shapes the entire solution.
Incorrect! Try again.
43A healthcare provider anonymizes two separate datasets: one with patient diagnoses and zip codes, and another with mobile phone location data (geohashes) and zip codes. They plan to merge these datasets on the 'zip code' field for a public health study. What is the most profound big data challenge this action creates?
Challenges of Big data
Hard
A.Re-identification risk and data privacy, as combining two anonymized datasets can create a rich, composite profile that makes it possible to de-anonymize individuals, violating privacy principles like GDPR.
B.Data integration, because matching zip codes between two large datasets from different systems can be computationally expensive and prone to formatting errors.
C.Data veracity, because location data from mobile phones can be inaccurate, leading to incorrect linkages with diagnostic data.
D.Data storage, as the merged dataset could become too large for traditional relational database systems.
Correct Answer: Re-identification risk and data privacy, as combining two anonymized datasets can create a rich, composite profile that makes it possible to de-anonymize individuals, violating privacy principles like GDPR.
Explanation:
This question highlights a sophisticated challenge beyond simple technical issues. The key problem is the 'mosaic effect', where multiple anonymized datasets, when joined, can reveal personal identities. A unique travel pattern (from location data) within a specific zip code could be linked to a rare diagnosis, effectively de-anonymizing an individual. This poses a massive ethical and legal challenge (e.g., under HIPAA and GDPR) that transcends the technical problems of integration, storage, or veracity.
Incorrect! Try again.
44A data science team needs to perform sentiment analysis on 10 terabytes of unstructured customer reviews stored as text files. The goal is to build a classification model and then create an interactive dashboard for the marketing team to explore sentiment trends by product and region. Which toolchain is most appropriately designed for this entire end-to-end task?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Hard
A.Tableau alone to connect directly to the text files, using its built-in calculation fields to perform sentiment analysis and create the dashboard.
B.Microsoft Excel with Power Query to import the text files, a VBA script for sentiment analysis, and PivotCharts for the dashboard.
C.Apache Hadoop (HDFS/MapReduce or Spark) for distributed processing of the text files, R/Python with NLP libraries for model building on a sampled or aggregated dataset, and Tableau for connecting to the aggregated results for visualization.
D.R language alone on a powerful server to read all 10TB into memory and perform the analysis and visualization using packages like shiny.
Correct Answer: Apache Hadoop (HDFS/MapReduce or Spark) for distributed processing of the text files, R/Python with NLP libraries for model building on a sampled or aggregated dataset, and Tableau for connecting to the aggregated results for visualization.
Explanation:
This option presents the most realistic and scalable architecture. The 10TB of unstructured data is far too large for Excel (B) or a single R session (C), making a distributed system like Hadoop/Spark essential for the initial heavy-lifting (ETL and feature extraction). Tableau (D) is a visualization tool and lacks the sophisticated NLP and modeling capabilities needed. The correct workflow is to use the right tool for each stage: Hadoop/Spark for big data processing, a statistical language like R/Python for specialized modeling, and a BI tool like Tableau for interactive visualization of the final, aggregated results.
Incorrect! Try again.
45A company's data science team has successfully developed a highly accurate fraud detection model in a Jupyter Notebook. The business now requires this model to be integrated into their live transaction processing system, which handles thousands of requests per second with a latency requirement of <50ms. The process of taking the model from the notebook to a scalable, low-latency, production-ready API is primarily the responsibility of which role?
Job roles and skillset for Data science and Big data
A.Big Data Architect, who designs the overall data storage and processing infrastructure.
B.Data Scientist, who created the model and is responsible for its accuracy and performance.
C.Machine Learning Engineer, who specializes in model deployment, automation, scalability, and MLOps practices.
D.Data Analyst, who is responsible for interpreting the model's output and creating performance reports.
Correct Answer: Machine Learning Engineer, who specializes in model deployment, automation, scalability, and MLOps practices.
Explanation:
This question highlights the critical distinction between roles. A Data Scientist's primary focus is on research, exploration, and building the model. A Machine Learning Engineer specializes in the 'ops' part of 'MLOps' – taking a validated model and engineering a robust, scalable, and efficient production service around it. This includes containerization (e.g., Docker), creating APIs (e.g., Flask/FastAPI), ensuring low latency, and setting up monitoring. While the Data Scientist provides the model artifact, the ML Engineer productionizes it.
Incorrect! Try again.
46A financial analytics firm needs to process a 50TB dataset of historical stock data. Their primary workload consists of complex, ad-hoc analytical queries from a team of 10 analysts. The queries are unpredictable and computationally intensive. They want a cloud solution that minimizes infrastructure management and operates on a pay-per-query pricing model. Which cloud big data solution best fits this requirement?
Big Data on the Cloud
Hard
A.Google BigQuery, because it's a serverless data warehouse that abstracts away infrastructure and charges based on the amount of data scanned by each query.
B.AWS Redshift, because it is a petabyte-scale, managed columnar data warehouse optimized for high-performance BI.
C.AWS EMR (Elastic MapReduce), which is a managed Hadoop service, to spin up clusters for specific jobs and then shut them down.
D.A self-managed Hadoop cluster on AWS EC2 instances, because it offers maximum control and customization over the processing environment.
Correct Answer: Google BigQuery, because it's a serverless data warehouse that abstracts away infrastructure and charges based on the amount of data scanned by each query.
Explanation:
The key requirements are: minimal management ('serverless'), ad-hoc complex queries, and a pay-per-query model. Google BigQuery is designed precisely for this use case. It separates storage and compute, allowing users to run SQL queries without provisioning or managing any clusters. A self-managed Hadoop cluster (B) is the opposite of minimal management. AWS EMR (C) is good for transient, batch-oriented jobs, not a persistent environment for ad-hoc queries. AWS Redshift (D) is a provisioned warehouse (not serverless) and is priced based on running cluster hours, which is less cost-effective for sporadic, unpredictable query workloads.
Incorrect! Try again.
47A data scientist builds a loan default prediction model with 99% accuracy on a historically biased dataset. When deployed, the model systematically denies loans to qualified applicants from minority groups. The modeler did not use protected attributes like race directly, but the model learned proxies (e.g., zip codes). This scenario reveals a critical deficiency in which specific data science skill?
Skill needed for Big data
Hard
A.Ethical judgment and bias detection, which involves proactively auditing data and models for fairness and unintended social impact.
B.Feature engineering, as the data scientist failed to create features that were uncorrelated with protected attributes.
C.Algorithm selection, as a different algorithm like a simple logistic regression might have been less biased.
D.Model evaluation, because accuracy was the wrong metric to use for an imbalanced dataset.
Correct Answer: Ethical judgment and bias detection, which involves proactively auditing data and models for fairness and unintended social impact.
Explanation:
While feature engineering, algorithm selection, and evaluation metrics are all relevant, the core failure is a lack of ethical consideration and fairness auditing. A skilled data scientist must understand that models can perpetuate and even amplify societal biases present in historical data. The critical skill is to recognize this risk before and during modeling, use fairness metrics (e.g., demographic parity, equalized odds), and implement mitigation techniques. It's a higher-level skill that governs the application of the more technical skills mentioned in the other options.
Incorrect! Try again.
48In precision agriculture, big data from IoT sensors, drones, and satellites is used to optimize crop yield. A key application is variable rate irrigation, where different parts of a field receive different amounts of water. What combination of Big Data characteristics makes this a particularly complex problem?
Use of Big Data in different areas
Hard
A.High Variety and Veracity: Integrating diverse data (soil moisture, drone imagery, weather forecasts) and dealing with sensor noise/failure is the core challenge.
B.High Volume only: The sheer amount of satellite imagery is the only significant big data challenge to overcome.
C.Low Volume and high Veracity: The data is small and clean, making it a simple analytical problem.
D.High Velocity only: The speed of data from real-time soil sensors is the most critical factor.
Correct Answer: High Variety and Veracity: Integrating diverse data (soil moisture, drone imagery, weather forecasts) and dealing with sensor noise/failure is the core challenge.
Explanation:
Precision agriculture is a classic example of a data fusion problem. The main difficulty isn't just the volume or speed, but the Variety of the data sources that must be integrated to make a single, coherent decision. You must combine point-in-time soil sensor data (structured), multispectral drone images (unstructured), and predictive weather model outputs (semi-structured). Furthermore, the Veracity is a major issue, as IoT sensors can fail or provide noisy readings, and satellite data can be affected by cloud cover, requiring sophisticated cleaning and imputation techniques.
Incorrect! Try again.
49A retail company's executive team asks their new data scientist to "Use AI to increase our profits." Why is this initial request a poor starting point for a data science project, and what does it demonstrate a need for?
Data science and its need
Hard
A.The request assumes AI is the solution. It demonstrates the need for the data scientist to have advanced machine learning skills to build a complex profit-optimization algorithm.
B.The request is too vague and lacks a specific, measurable business problem. It demonstrates the need for the data scientist to apply problem formulation and business acumen skills to translate a general goal into a concrete, solvable data science problem (e.g., 'reduce customer churn by 5%').
C.The request focuses on profit instead of customer satisfaction. It demonstrates the need for the company to have a stronger ethical framework.
D.The request is not technically feasible. It demonstrates the need for better data infrastructure before any AI projects can be started.
Correct Answer: The request is too vague and lacks a specific, measurable business problem. It demonstrates the need for the data scientist to apply problem formulation and business acumen skills to translate a general goal into a concrete, solvable data science problem (e.g., 'reduce customer churn by 5%').
Explanation:
Data science does not start with an algorithm; it starts with a well-defined problem. "Increase profits" is a business goal, not a data science problem. The critical, and often most difficult, first step is to collaborate with stakeholders to break down this goal into specific, quantifiable questions that can be answered with data, such as 'Can we predict which customers are likely to churn?' or 'Can we optimize inventory to reduce holding costs?'. This translation from a vague business objective to a specific analytical plan is a core skill for an effective data scientist.
Incorrect! Try again.
50A genomics research institute processes full human genomes. Each genome is ~100GB (high Volume). They are integrating this with unstructured clinical notes and patient-reported outcomes from a mobile app. A fourth 'V', Veracity, is often added to the 3Vs. In this specific context, what is the most critical implication of low Veracity?
Big data and its 3Vs
Hard
A.Processing delays: The sheer speed of data from sequencers (Velocity) is the primary bottleneck, not the data's accuracy.
B.Integration challenges: The diverse data formats (Variety) are much harder to handle than potential inaccuracies within the data.
C.Storage costs: The volume of genomic data is the only significant financial and technical hurdle.
D.False discoveries: Inaccurate gene sequencing or misinterpretation of clinical notes could lead to incorrect correlations between genes and diseases, invalidating research findings and potentially harming patients.
Correct Answer: False discoveries: Inaccurate gene sequencing or misinterpretation of clinical notes could lead to incorrect correlations between genes and diseases, invalidating research findings and potentially harming patients.
Explanation:
In scientific and medical research, the correctness and trustworthiness of the data (Veracity) are paramount. While Volume, Velocity, and Variety are technical challenges, low Veracity has profound real-world consequences. A single error in a gene sequence or a misclassified symptom from clinical notes could lead researchers down a wrong path for years, resulting in wasted resources and, more importantly, potentially leading to incorrect medical treatments or drug development targets. In this domain, the cost of being wrong is exceptionally high, making Veracity the most critical concern.
Incorrect! Try again.
51An analyst needs to investigate a potential data quality issue in a 2-billion-row dataset stored in a Hadoop cluster. They need to perform a series of complex aggregations and checks (e.g., find the count of nulls per column, calculate distributions, check for outliers). Writing a full Spark job in Python/Scala is too slow for this interactive, exploratory task. Which tool or approach would be most efficient for this specific scenario?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Hard
A.Writing a custom MapReduce job in Java to calculate the required statistics.
B.Using Tableau to connect directly to the Hadoop cluster and build a dashboard to find the anomalies.
C.Using an interactive SQL query engine like Apache Hive LLAP or Presto/Trino that sits on top of the Hadoop data lake.
D.Exporting a 1% sample of the data into a CSV file and analyzing it with Microsoft Excel.
Correct Answer: Using an interactive SQL query engine like Apache Hive LLAP or Presto/Trino that sits on top of the Hadoop data lake.
Explanation:
The key here is the need for interactive exploration on a massive dataset. A full Spark/MapReduce job (D) has high latency due to job submission overhead. Sampling to Excel (B) is dangerous because the data quality issue might be rare and missed in the sample. Tableau (C) is for visualization and may struggle or be slow with the raw, multi-billion-row dataset for the kind of deep data profiling required. Interactive SQL engines like Presto or Hive LLAP are specifically designed for this use case: providing a fast, SQL-based interface for data analysts to directly and interactively query massive datasets in place, without the boilerplate of a full programming job.
Incorrect! Try again.
52During the 'Model Deployment' phase of a data science project, the team discovers that the model's predictions, while accurate in offline tests, have a high variance in latency (from 50ms to 2000ms). This violates the service-level agreement (SLA) for the production application. This issue forces the team to revisit which earlier lifecycle phase most intensively?
Data science Lifecycle with use case
Hard
A.Model Evaluation, to choose a different accuracy metric that accounts for latency.
B.Data Collection, to acquire data that is faster to process.
C.Feature Engineering, as the latency variance is likely caused by complex features that are computationally expensive to generate in real-time.
D.Business Understanding, to renegotiate the SLA with the stakeholders.
Correct Answer: Feature Engineering, as the latency variance is likely caused by complex features that are computationally expensive to generate in real-time.
Explanation:
High variance in prediction latency in a production environment often points to the feature generation process. Some input data points might trigger complex, time-consuming calculations (e.g., a text feature requiring a slow NLP lookup, or a feature requiring a query to another slow service). This is not an issue with the model algorithm itself but with the data pipeline that feeds it. The team must go back to the Feature Engineering phase to redesign these features for consistent, low-latency computation, possibly by pre-calculating them or using simpler approximations.
Incorrect! Try again.
53A city is implementing a predictive policing system, which uses historical crime data to predict locations where crime is likely to occur. From an ethical and societal perspective, what is the most significant risk of deploying such a system, even if it is statistically accurate on historical data?
Applications of data science/Big data
Hard
A.Computational Cost: Processing years of crime data and real-time inputs would require a significant investment in big data infrastructure.
B.Feedback Loop Amplification: The model may create a self-fulfilling prophecy where police are sent to predicted hotspots, leading to more arrests in those areas, which then generates more data to confirm the original prediction, amplifying existing biases.
C.Lack of Model Interpretability: Using a complex model like a deep neural network would make it impossible to explain to the public why a certain area was targeted.
D.Data Security Risks: The historical crime data could be hacked, revealing sensitive information about past incidents and victims.
Correct Answer: Feedback Loop Amplification: The model may create a self-fulfilling prophecy where police are sent to predicted hotspots, leading to more arrests in those areas, which then generates more data to confirm the original prediction, amplifying existing biases.
Explanation:
This is a critical and subtle risk in many data science applications. The model's output influences real-world actions (police deployment), which in turn generate the very data the model is trained on. If historical data reflects past policing biases (e.g., over-policing in certain neighborhoods), the model will learn this bias. Its predictions will then direct more police to those same neighborhoods, leading to more arrests, which reinforces the model's bias in the next training cycle. This creates a dangerous feedback loop that can entrench and amplify inequality, regardless of the model's technical accuracy.
Incorrect! Try again.
54A global e-commerce company wants to create a unified customer view by combining data from its regional databases in the European Union, the United States, and China. Beyond the technical data integration challenges, what is the most significant 'soft' challenge they will face?
Challenges of Big data
Hard
A.Data Silos: The different regional IT teams may be unwilling to share their data and control with a central authority.
B.Network Latency: Moving large amounts of data between continents will be slow and expensive.
C.Language and Character Encoding: The data will be in different languages and character sets (e.g., UTF-8, GB2312), requiring complex text processing.
D.Data Sovereignty and Regulatory Compliance: Each region has different data privacy laws (e.g., GDPR in the EU, PIPL in China) that restrict how data can be transferred, stored, and processed across borders, making a unified view legally complex.
Correct Answer: Data Sovereignty and Regulatory Compliance: Each region has different data privacy laws (e.g., GDPR in the EU, PIPL in China) that restrict how data can be transferred, stored, and processed across borders, making a unified view legally complex.
Explanation:
This is a major challenge for multinational corporations. Data sovereignty refers to the legal principle that data is subject to the laws of the country in which it is located. Laws like GDPR impose strict rules on transferring EU citizens' data outside the EU. China's PIPL has similar data localization requirements. Navigating this complex web of international regulations is often a bigger hurdle than solving the technical problems of data transfer or format conversion, and it fundamentally impacts the architecture of global big data systems.
Incorrect! Try again.
55A data scientist in R is working with a data.frame named sales_df with 10 million rows. They need to calculate the mean price for each category. They run the following two code snippets. Why is the data.table approach significantly faster than the tapply approach?
Code 1: tapply(sales_dfcategory, mean)
Code 2: library(data.table); setDT(sales_df); sales_df[, mean(price), by = category]
R language
Hard
A.The tapply function is not designed for numeric data and performs slow type conversions internally, whereas data.table is specifically for numbers.
B.The setDT() function creates a physical copy of the data in a more efficient columnar format, which allows for faster access.
C.The data.table approach pre-compiles the aggregation logic into bytecode, while tapply is an interpreted function call, which is always slower.
D.The data.table package is written in C and is highly optimized for performance. It groups the data by reference using a radix sort on the grouping columns, avoiding the data copying and looping overhead inherent in base R functions like tapply.
Correct Answer: The data.table package is written in C and is highly optimized for performance. It groups the data by reference using a radix sort on the grouping columns, avoiding the data copying and looping overhead inherent in base R functions like tapply.
Explanation:
The performance difference is architectural. Base R functions like tapply or aggregate often involve internal loops and data copying, which become very inefficient on large datasets. data.table, on the other hand, is engineered for speed. Its by operation first performs an extremely fast radix sort to reorder the data's memory pointers (not the data itself) so that all group members are contiguous. It can then iterate through the groups and apply the function very efficiently. This, combined with the fact that its core operations are written in C, results in a massive performance gain for grouping operations on large data.
Incorrect! Try again.
56A company is deciding between a cloud data lake (e.g., storing files on AWS S3 and using Athena/Spark for queries) and a cloud data warehouse (e.g., Snowflake, Redshift, BigQuery). The company's data is a mix of structured transactional data and highly unstructured data like images and audio files. They prioritize schema flexibility and low-cost storage for raw data. Which architecture should they choose and why?
Big Data on the Cloud
Hard
A.Data Warehouse, because it provides a structured schema ('schema-on-write') which enforces data quality and delivers higher query performance for structured data.
B.A hybrid approach, using the data warehouse for structured data and the data lake for unstructured data, but this is architecturally impossible on major cloud platforms.
C.Data Lake, because it stores data in its native format ('schema-on-read') and decouples storage from compute, offering maximum flexibility for diverse data types and lower storage costs.
D.Neither, they should use a traditional on-premise relational database like Oracle, which can handle both structured and unstructured data using LOB types.
Correct Answer: Data Lake, because it stores data in its native format ('schema-on-read') and decouples storage from compute, offering maximum flexibility for diverse data types and lower storage costs.
Explanation:
The key requirements are handling diverse, unstructured data and schema flexibility. This is the primary use case for a data lake. Data warehouses require data to be cleaned, transformed, and loaded into a predefined schema (schema-on-write), which is inefficient or impossible for images and audio files. A data lake allows you to dump raw data of any type into low-cost object storage (like S3) and then define the schema at query time (schema-on-read). This provides the flexibility needed to handle a wide variety of current and future data sources. Option C is misleading; hybrid approaches are very common (e.g., a 'Lake House' architecture) but the primary choice based on the requirements is the data lake.
Incorrect! Try again.
57A team consists of a Data Analyst, a Data Scientist, and a Data Engineer. They are tasked with a project to analyze customer behavior. Which of the following correctly delineates the primary focus of each role in the initial phases of this project?
Job roles and skillset for Data science and Big data
Hard
A.Data Scientist: Designs and builds the data extraction pipelines. Data Engineer: Performs advanced statistical modeling. Data Analyst: Communicates the final model results to stakeholders.
B.Data Analyst: Is responsible for all the data cleaning. Data Engineer: Is responsible for all the feature engineering. Data Scientist: Is responsible only for choosing the final algorithm.
C.Data Engineer: Creates the final business-facing dashboards. Data Analyst: Deploys the machine learning models. Data Scientist: Is responsible for the cloud infrastructure budget.
D.Data Engineer: Builds robust, automated pipelines to extract and transport data. Data Scientist: Explores the raw data to formulate hypotheses and plan models. Data Analyst: Queries the processed data to create initial descriptive reports and dashboards.
Correct Answer: Data Engineer: Builds robust, automated pipelines to extract and transport data. Data Scientist: Explores the raw data to formulate hypotheses and plan models. Data Analyst: Queries the processed data to create initial descriptive reports and dashboards.
Explanation:
This option correctly describes the typical separation of concerns. The Data Engineer focuses on the infrastructure and movement of data (the 'pipes'). The Data Scientist performs the deep, open-ended exploration and predictive modeling. The Data Analyst typically works with cleaner, more structured data (often prepared by the engineer) to answer specific business questions and report on historical trends ('what happened'), while the scientist focuses more on prediction ('what will happen'). The other options incorrectly assign responsibilities across the roles.
Incorrect! Try again.
58When considering Apache Hadoop's core components, what is the fundamental architectural reason that YARN (Yet Another Resource Negotiator) was introduced to replace the resource management logic of the original MapReduce (MRv1)?
Tools usage like Apache Hadoop, Tableau, R language, Excel
Hard
A.To enable Hadoop to run on cloud platforms like AWS and Azure, as MRv1 was designed only for on-premise hardware.
B.To decouple resource management from the data processing framework, allowing different frameworks (like Spark, Flink, etc.), not just MapReduce, to run on the same Hadoop cluster.
C.To provide a better graphical user interface for monitoring Hadoop jobs, which was lacking in MRv1.
D.To improve the speed of the 'shuffle and sort' phase within MapReduce jobs by using a more efficient negotiation algorithm.
Correct Answer: To decouple resource management from the data processing framework, allowing different frameworks (like Spark, Flink, etc.), not just MapReduce, to run on the same Hadoop cluster.
Explanation:
In the original Hadoop MapReduce (MRv1), the JobTracker was responsible for both resource management (assigning map and reduce slots) and job lifecycle management. This tightly coupled the processing paradigm (MapReduce) to the cluster management. The key innovation of YARN was to separate these concerns. YARN's ResourceManager and NodeManagers handle the generic task of allocating resources (CPU, RAM) across the cluster, while an application-specific ApplicationMaster (e.g., one for MapReduce, one for Spark) negotiates for those resources and manages its own job's execution. This architectural shift transformed Hadoop from a single-application (MapReduce) system into a multi-purpose big data platform capable of running diverse workloads.
Incorrect! Try again.
59A data scientist presents a complex deep learning model to business stakeholders. The model is highly accurate, but the presentation is filled with technical jargon like 'ReLU activation functions,' 'dropout rates,' and 'backpropagation.' The stakeholders are confused and lose confidence in the project. This highlights a critical failure in which non-technical skill?
Skill needed for Big data
Hard
A.Domain Knowledge: The data scientist clearly did not understand the business domain well enough to build a useful model.
B.Scientific Method: The data scientist failed to form a proper hypothesis before building the model.
C.Storytelling and Communication: The ability to translate complex technical concepts and model results into a clear, concise narrative that connects to business impact and is understandable to a non-technical audience.
D.Data Visualization: The presentation probably lacked sufficient charts and graphs to explain the model's performance.
Correct Answer: Storytelling and Communication: The ability to translate complex technical concepts and model results into a clear, concise narrative that connects to business impact and is understandable to a non-technical audience.
Explanation:
A model, no matter how accurate, has zero value if it is not understood, trusted, and acted upon by the business. The failure described is purely one of communication. A critical skill for data scientists is the ability to abstract away the complex inner workings of a model and present its implications in the language of the business. This involves building a narrative: 'Here was the problem, here is how our solution addresses it, here is what it means for our customers/revenue, and here is how we know we can trust it.' Focusing on technical details without linking them to business outcomes is a common reason why data science projects fail to get adopted.
Incorrect! Try again.
60In the telecommunications industry, Big Data is used to analyze Call Detail Records (CDRs) for network optimization and churn prediction. A single CDR contains metadata like call duration, start/end time, and tower location, but not the call content. Why is analyzing the graph of connections (who calls whom) often more powerful for churn prediction than analyzing an individual's call statistics in isolation?
Use of Big Data in different areas
Hard
A.Because a customer's churn is heavily influenced by the churn of their social circle. If a person's most frequently called contacts start leaving the network (a property of the graph structure), that person is also highly likely to churn.
B.Because visualizing the call graph is the only way to understand network traffic patterns.
C.Because analyzing call duration and frequency for a single user (isolated statistics) provides no predictive power for churn.
D.Because graph databases like Neo4j are faster at processing CDR data than traditional relational databases.
Correct Answer: Because a customer's churn is heavily influenced by the churn of their social circle. If a person's most frequently called contacts start leaving the network (a property of the graph structure), that person is also highly likely to churn.
Explanation:
This question gets at a more advanced analytical concept: network effects. While an individual's calling patterns are useful, the real power comes from understanding their position within the social network. Graph analysis allows for the creation of powerful features like 'community churn influence.' A user might have stable calling habits, but if the graph reveals their entire 'community' of contacts is leaving, this is a massive red flag. This concept of social influence is a feature of the relationships between users, not the users themselves, and can only be captured by analyzing the data as a graph.