1Which of the following best defines Data Science?
A.The study of computer hardware manufacturing
B.A multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from data
C.The process of manually entering data into spreadsheets
D.The repair and maintenance of database servers
Correct Answer: A multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from data
Explanation:Data Science is an interdisciplinary field combining statistics, computer science, and domain expertise to extract meaningful insights from structured and unstructured data.
Incorrect! Try again.
2Data Science is often represented as the intersection of which three primary domains?
A.Physics, Chemistry, and Biology
B.Computer Science, Math/Statistics, and Business/Domain Knowledge
C.Networking, Hardware, and Software
D.Marketing, Sales, and HR
Correct Answer: Computer Science, Math/Statistics, and Business/Domain Knowledge
Explanation:Data Science requires a mix of hacking skills (CS), substantial math and statistical knowledge, and substantive expertise in the specific domain (Business).
Incorrect! Try again.
3Which of the following is NOT one of the original 3Vs of Big Data?
A.Volume
B.Velocity
C.Variety
D.Visualization
Correct Answer: Visualization
Explanation:The original 3Vs of Big Data are Volume (size), Velocity (speed), and Variety (format). Visualization is a technique used to present data, not a defining characteristic of Big Data itself.
Incorrect! Try again.
4In the context of Big Data, what does Velocity refer to?
A.The accuracy of the data
B.The sheer amount of data stored
C.The speed at which data is generated, processed, and analyzed
D.The different forms of data (images, text, video)
Correct Answer: The speed at which data is generated, processed, and analyzed
Explanation:Velocity refers to the rate at which data flows into an organization and the speed at which it must be processed to generate insights (e.g., real-time stock ticker data).
Incorrect! Try again.
5Social media posts, videos, and audio files are examples of what type of data?
A.Structured Data
B.Unstructured Data
C.Relational Data
D.Clean Data
Correct Answer: Unstructured Data
Explanation:Unstructured data lacks a pre-defined data model or organization. Unlike SQL databases (structured), items like emails, videos, and posts do not fit neatly into rows and columns.
Incorrect! Try again.
6Which phase of the Data Science Lifecycle involves handling missing values and correcting inconsistent data?
A.Model Building
B.Data Preparation / Cleaning
C.Model Deployment
D.Problem Definition
Correct Answer: Data Preparation / Cleaning
Explanation:Data Preparation (or Data Cleaning/Munging) involves scrubbing the data to handle missing values, remove duplicates, and correct errors before analysis.
Incorrect! Try again.
7What is Apache Hadoop primarily used for?
A.Creating real-time 3D video games
B.Distributed storage and processing of large datasets across clusters of computers
C.Editing high-resolution photos
D.Writing operating system kernels
Correct Answer: Distributed storage and processing of large datasets across clusters of computers
Explanation:Apache Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Incorrect! Try again.
8In the Hadoop ecosystem, what is the function of HDFS?
A.Hadoop Data Filtration System
B.High-Definition File Standard
C.Hadoop Distributed File System
D.Hyper-Data Fast Storage
Correct Answer: Hadoop Distributed File System
Explanation:HDFS stands for Hadoop Distributed File System. It provides high-throughput access to application data and is designed to run on commodity hardware.
Incorrect! Try again.
9Which programming language is specifically designed for statistical computing and graphics, widely used in Data Science?
A.HTML
B.C++
C.R
D.Assembly
Correct Answer: R
Explanation:R is a language and environment dedicated to statistical computing and graphics, making it a popular tool among statisticians and data miners.
Incorrect! Try again.
10What is the primary purpose of Tableau in a Data Science workflow?
A.Operating System management
B.Data Visualization and Business Intelligence
C.Writing low-level machine code
D.Database encryption
Correct Answer: Data Visualization and Business Intelligence
Explanation:Tableau is a powerful tool used for Data Visualization. It helps in simplifying raw data into an understandable format like dashboards and worksheets.
Incorrect! Try again.
11Which of the following is a significant challenge of Big Data?
A.Having too little data to analyze
B.Data Security and Privacy concerns
C.The low cost of storing data
D.Lack of algorithms
Correct Answer: Data Security and Privacy concerns
Explanation:Managing sensitive information, ensuring compliance with regulations (like GDPR), and protecting data from breaches are massive challenges in Big Data.
Incorrect! Try again.
12In the Data Science Lifecycle, what happens during the Model Building phase?
A.The business problem is defined
B.The results are presented to stakeholders
C.Machine learning algorithms are applied to training data to create a predictive model
D.The data is archived for long-term storage
Correct Answer: Machine learning algorithms are applied to training data to create a predictive model
Explanation:Model Building involves selecting appropriate algorithms (like regression or clustering) and training them on the prepared data to recognize patterns.
Incorrect! Try again.
13Which of the following scenarios is a common application of Data Science in E-commerce?
A.Managing warehouse physical security
B.Product Recommendation Engines (e.g., 'Customers who bought this also bought...')
C.Installing point-of-sale hardware
D.Designing the company logo
Correct Answer: Product Recommendation Engines (e.g., 'Customers who bought this also bought...')
Explanation:Recommendation engines use Big Data and machine learning to analyze user behavior and purchase history to suggest relevant products.
Incorrect! Try again.
14What does 'Veracity' refer to in the extended 5Vs of Big Data?
A.The speed of data transfer
B.The trustworthiness, quality, and accuracy of the data
C.The variety of data types
D.The economic value of data
Correct Answer: The trustworthiness, quality, and accuracy of the data
Explanation:Veracity refers to the reliability or truthfulness of the data. Poor quality or messy data creates a challenge regarding veracity.
Incorrect! Try again.
15Why is Cloud Computing essential for Big Data analytics?
A.It eliminates the need for internet access
B.It provides on-demand scalability and cost-effective storage/processing power
C.It forces companies to buy more physical hard drives
D.It reduces the speed of data processing
Correct Answer: It provides on-demand scalability and cost-effective storage/processing power
Explanation:Cloud platforms (like AWS, Azure, Google Cloud) allow organizations to scale resources up or down based on the massive workload requirements of Big Data without investing in physical infrastructure.
Incorrect! Try again.
16Which job role focuses primarily on building and maintaining the architecture (pipelines, databases) required for data generation?
A.Data Scientist
B.Data Engineer
C.Business Analyst
D.Graphic Designer
Correct Answer: Data Engineer
Explanation:A Data Engineer prepares the 'big data' infrastructure to be analyzed by Data Scientists. They design, build, and integrate data from various resources.
Incorrect! Try again.
17Which limitations does Microsoft Excel have regarding Big Data?
A.It cannot perform addition or subtraction
B.It has a row limit (approx. 1 million) and struggles with processing massive datasets efficiently
C.It requires a supercomputer to run
D.It does not support charts
Correct Answer: It has a row limit (approx. 1 million) and struggles with processing massive datasets efficiently
Explanation:While excellent for small data, Excel is not a Big Data tool because it cannot handle the volume (Petabytes) or velocity required for Big Data applications.
Incorrect! Try again.
18What is MapReduce?
A.A GPS navigation system
B.A programming model for processing large data sets with a parallel, distributed algorithm
C.A method to reduce the size of a map image
D.A database query language
Correct Answer: A programming model for processing large data sets with a parallel, distributed algorithm
Explanation:MapReduce is a core component of Hadoop. It splits tasks into parts, processes them in parallel (Map), and then combines the results (Reduce).
Incorrect! Try again.
19In the context of Data Science, what is Exploratory Data Analysis (EDA)?
A.Installing the database software
B.The initial investigation of data to discover patterns, spot anomalies, and check assumptions
C.The final presentation of the project
D.Writing the legal contract for data usage
Correct Answer: The initial investigation of data to discover patterns, spot anomalies, and check assumptions
Explanation:EDA is a critical step where analysts use summary statistics and graphical representations to understand the data before formal modeling.
Incorrect! Try again.
20Which skill is LEAST likely to be required for a Data Scientist?
A.Hardware circuit design
B.Statistical Analysis
C.Machine Learning
D.Data Visualization
Correct Answer: Hardware circuit design
Explanation:Data Scientists work with software, algorithms, and data. Hardware circuit design is an Electrical Engineering skill, not a core Data Science skill.
Incorrect! Try again.
21Which of the following describes Predictive Analytics?
A.Describing what happened in the past
B.Using historical data to forecast future outcomes
C.Reporting current data only
D.Manually organizing paper files
Correct Answer: Using historical data to forecast future outcomes
Explanation:Predictive analytics uses statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data.
Incorrect! Try again.
22The format of data defined as JSON (JavaScript Object Notation) is an example of:
A.Unstructured Data
B.Semi-structured Data
C.Strictly Relational Data
D.Binary Data
Correct Answer: Semi-structured Data
Explanation:JSON is semi-structured. It has tags/markers to separate elements (keys and values) but doesn't conform to the rigid structure of a relational database table.
Incorrect! Try again.
23How is Big Data used in the Healthcare industry?
A.To manufacture stethoscopes
B.For disease prediction, personalized medicine, and analyzing patient records
C.To replace doctors with robots completely
D.To increase the cost of insurance manually
Correct Answer: For disease prediction, personalized medicine, and analyzing patient records
Explanation:Big Data in healthcare is used to predict epidemics, cure diseases, improve quality of life via personalized medicine, and manage massive electronic health records (EHR).
Incorrect! Try again.
24In the Data Science Lifecycle, 'Operationalize' refers to:
A.Deleting the data
B.Deploying the model into a production environment for real-world use
C.Hiring operations managers
D.Buying new computers
Correct Answer: Deploying the model into a production environment for real-world use
Explanation:Operationalization is the final step where the developed model is integrated into existing business processes or software to deliver value continuously.
Incorrect! Try again.
25Which SQL command is most fundamental for extracting specific data from a database?
A.UPDATE
B.SELECT
C.DELETE
D.INSERT
Correct Answer: SELECT
Explanation:SQL (Structured Query Language) is vital for data science. The SELECT statement is used to query and retrieve data from a database.
Incorrect! Try again.
26What is a Data Lake?
A.A cooling system for servers
B.A centralized repository that allows you to store all your structured and unstructured data at any scale
C.A small spreadsheet
D.A visualization chart looking like water
Correct Answer: A centralized repository that allows you to store all your structured and unstructured data at any scale
Explanation:Unlike a Data Warehouse (which usually stores structured data), a Data Lake stores raw data in its native format until it is needed.
Incorrect! Try again.
27Which of the following is an example of Volume in Big Data?
A.Data arriving in milliseconds
B.Data containing video, text, and XML
C.An organization processing 500 Petabytes of data
D.Data having 90% accuracy
Correct Answer: An organization processing 500 Petabytes of data
Explanation:500 Petabytes represents the size or amount of data, which corresponds to the definition of Volume.
Incorrect! Try again.
28What is the primary difference between a Data Analyst and a Data Scientist?
A.Data Analysts do not use computers
B.Data Scientists generally deal with more complex modeling, machine learning, and future predictions, while Analysts focus more on describing past/current trends
C.Data Analysts earn more money
D.There is no difference
Correct Answer: Data Scientists generally deal with more complex modeling, machine learning, and future predictions, while Analysts focus more on describing past/current trends
Explanation:While skills overlap, Data Scientists typically possess stronger skills in advanced statistics, machine learning, and coding for predictive modeling.
Incorrect! Try again.
29Which 'V' represents the economic advantage a company gains from Big Data?
A.Velocity
B.Variety
C.Value
D.Volume
Correct Answer: Value
Explanation:Data itself is useless unless it can be turned into Value. This refers to the return on investment (ROI) or insights gained from the data.
Incorrect! Try again.
30Which tool is known for its spreadsheet capabilities but also supports basic data analysis with Pivot Tables?
A.Apache Spark
B.Hadoop
C.Microsoft Excel
D.Docker
Correct Answer: Microsoft Excel
Explanation:Excel is the most common tool for basic data entry, calculation, and small-scale analysis using features like Pivot Tables.
Incorrect! Try again.
31The mathematical equation is the basis for which common Data Science algorithm?
A.Linear Regression
B.K-Means Clustering
C.Decision Trees
D.Neural Networks
Correct Answer: Linear Regression
Explanation:Linear Regression attempts to model the relationship between two variables by fitting a linear equation () to observed data.
Incorrect! Try again.
32Which of the following is a soft skill necessary for a Data Science professional?
A.Python Programming
B.Calculus
C.Storytelling and Communication
D.Cloud Architecture
Correct Answer: Storytelling and Communication
Explanation:Data professionals must communicate complex findings to non-technical stakeholders (storytelling) to drive decision-making.
Incorrect! Try again.
33What is Churn Prediction in the context of business applications of Data Science?
A.Predicting how fast a butter churn moves
B.Identifying customers who are likely to stop using a service or product
C.Predicting the stock market
D.Calculating employee salaries
Correct Answer: Identifying customers who are likely to stop using a service or product
Explanation:Churn prediction helps companies identify at-risk customers so they can take action (marketing, discounts) to retain them.
Incorrect! Try again.
34What role does IoT (Internet of Things) play in Big Data?
A.It reduces the amount of data generated
B.It acts as a massive source of real-time data generation (Velocity and Volume)
C.It is a database software
D.It is used only for printing data
Correct Answer: It acts as a massive source of real-time data generation (Velocity and Volume)
Explanation:IoT devices (sensors, smart appliances) generate continuous streams of data, significantly contributing to the Volume and Velocity of Big Data.
Incorrect! Try again.
35Which library in Python is most famous for data manipulation and analysis (Dataframes)?
A.Pandas
B.PyGame
C.Django
D.Flask
Correct Answer: Pandas
Explanation:Pandas is the standard library in Python for data manipulation, offering data structures like Dataframes to handle structured data.
Incorrect! Try again.
36Why is Data Visualization important?
A.It makes the report file size larger
B.It allows the human brain to process information easier and identify patterns quickly
C.It hides the actual data values
D.It converts text to binary
Correct Answer: It allows the human brain to process information easier and identify patterns quickly
Explanation:Visualizations (charts, graphs) leverage human visual perception to spot trends and outliers that are hard to see in raw rows of numbers.
Incorrect! Try again.
37Which of the following is a risk associated with Data Bias?
A.The model becomes too fast
B.The data takes up less space
C.The AI/Model produces unfair or discriminatory results
D.The computer overheats
Correct Answer: The AI/Model produces unfair or discriminatory results
Explanation:If the training data is biased (e.g., historical hiring data favoring one gender), the resulting model will perpetuate that bias.
Incorrect! Try again.
38What is the 'Discovery' phase in the Data Science Lifecycle?
A.Finding a new planet
B.Acquiring resources, framing the business problem, and formulating initial hypotheses
C.Writing the final code
D.Installing software
Correct Answer: Acquiring resources, framing the business problem, and formulating initial hypotheses
Explanation:Discovery is the first phase where the team understands the project objectives, requirements, and available data.
Incorrect! Try again.
39A massive dataset containing log files from servers, clickstreams from a website, and sensor data is best stored in:
A.A paper notebook
B.A standard Excel file
C.A NoSQL database or Distributed File System (like HDFS)
D.A Word document
Correct Answer: A NoSQL database or Distributed File System (like HDFS)
Explanation:These are high-volume, often semi-structured data types that exceed the capacity of traditional tools and require Big Data solutions.
Incorrect! Try again.
40Which of the following best describes Business Intelligence (BI) vs Data Science?
A.BI looks backward (Descriptive); Data Science looks forward (Predictive)
B.BI uses Python; Data Science uses Calculator
C.BI is for unstructured data; Data Science is for structured data
D.They are exactly the same
Correct Answer: BI looks backward (Descriptive); Data Science looks forward (Predictive)
Explanation:BI typically focuses on reporting and analyzing past events (What happened?), whereas Data Science focuses on predicting future trends (What will happen?).
Incorrect! Try again.
41When discussing Big Data on the Cloud, what does SaaS stand for?
A.Storage as a Service
B.Software as a Service
C.System as a Solution
D.Speed as a Service
Correct Answer: Software as a Service
Explanation:SaaS is a cloud computing model where software is licensed on a subscription basis and is centrally hosted (e.g., Google Workspace, Salesforce).
Incorrect! Try again.
42Which statistical concept is used to find the 'center' of a dataset?
A.Standard Deviation
B.Mean (Average)
C.Correlation
D.Variance
Correct Answer: Mean (Average)
Explanation:Measures of central tendency include Mean, Median, and Mode. The Mean is the arithmetic average.
Incorrect! Try again.
43What is the main challenge regarding Heterogeneity in Big Data?
A.All data looks the same
B.Integrating data from diverse sources with different formats and standards
C.Data is too small
D.Computers are too fast
Correct Answer: Integrating data from diverse sources with different formats and standards
Explanation:Heterogeneity refers to the difficulty of merging data from incompatible sources (e.g., combining SQL tables with PDF documents).
Incorrect! Try again.
44Sentiment Analysis on Twitter data is an application of:
A.Image Processing
B.Natural Language Processing (NLP)
C.Audio Engineering
D.Database Administration
Correct Answer: Natural Language Processing (NLP)
Explanation:NLP is a branch of Data Science/AI focused on the interaction between computers and human language (text analysis).
Incorrect! Try again.
45Which component of Hadoop is responsible for resource management and job scheduling?
A.HDFS
B.MapReduce
C.YARN
D.Hive
Correct Answer: YARN
Explanation:YARN (Yet Another Resource Negotiator) is the cluster management layer of Hadoop.
Incorrect! Try again.
46In the context of the 3Vs, streaming data from a jet engine during flight represents high:
A.Velocity
B.Variety
C.Volume
D.Validity
Correct Answer: Velocity
Explanation:While it also has volume, the defining characteristic of streaming sensor data is the speed (Velocity) at which it is generated and must be processed.
Incorrect! Try again.
47Which chart type is best for showing the distribution of a single numerical variable?
A.Pie Chart
B.Histogram
C.Scatter Plot
D.Network Graph
Correct Answer: Histogram
Explanation:A histogram groups data into bins and displays the frequency of data points in each bin, showing the distribution.
Incorrect! Try again.
48What is the benefit of using Open Source tools like R and Hadoop?
A.They are always easier to learn
B.They prevent collaboration
C.They are free to use and have large community support
D.They only run on Windows
Correct Answer: They are free to use and have large community support
Explanation:Open source tools reduce licensing costs and benefit from community-driven innovation and troubleshooting.
Incorrect! Try again.
49Fraud detection in banking relies heavily on:
A.Outlier/Anomaly Detection
B.Graphic Design
C.Social Media Marketing
D.Data Compression
Correct Answer: Outlier/Anomaly Detection
Explanation:Fraud detection algorithms look for patterns that deviate significantly from the norm (anomalies) to flag suspicious transactions.
Incorrect! Try again.
50Which of the following is a step in Data Cleaning?
Explanation:Imputation is a statistical technique used to replace missing data with substituted values (like the mean or median) during the cleaning phase.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.