Unit 1 - Practice Quiz

CSE121 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. Which of the following domains intersects to form Data Science?

A. Computer Science, Mathematics/Statistics, and Business/Domain Knowledge
B. Physics, Chemistry, and Biology
C. History, Geography, and Civics
D. Networking, Hardware Engineering, and Electrical Engineering

2 What is the primary reason for the sudden surge in the need for Data Science in recent years?

A. Computers have become more expensive
B. The explosion of unstructured data generated by social media, IoT, and mobile devices
C. A decrease in the amount of data being produced globally
D. The discontinuation of traditional database systems

3 In the context of the 3Vs of Big Data, what does Velocity refer to?

A. The reliability and accuracy of the data
B. The speed at which data is generated, processed, and analyzed
C. The sheer size of the data being stored
D. The different forms and types of data

4 Which of the following best describes Variety in Big Data?

A. Data coming in strictly tabular formats like Excel
B. The different types of data, including structured, semi-structured, and unstructured data
C. The volume of data measured in Terabytes
D. The monetary value derived from data

5 Consider the Data Science Lifecycle. Which phase typically involves handling missing values, removing duplicates, and converting data types?

A. Model Building
B. Data Preparation / Data Cleaning
C. Deployment
D. Discovery

6 Which open-source software framework is primarily used for distributed storage and processing of big data using the MapReduce programming model?

A. Apache Hadoop
B. Microsoft Excel
C. Tableau
D. Adobe Photoshop

7 What is the primary function of HDFS in the Hadoop ecosystem?

A. To visualize data in charts and graphs
B. To perform statistical regression analysis
C. To store data across multiple machines in a distributed manner
D. To manage the project timeline

8 Which of the following tools is best known as a powerful Data Visualization and Business Intelligence tool?

A. Apache Spark
B. Tableau
C. MongoDB
D. Linux

9 The R programming language is most specifically designed for:

A. Operating System development
B. Web page styling (CSS)
C. Statistical computing and graphics
D. Game development

10 While Microsoft Excel is a versatile tool, what is a major limitation when dealing with Big Data?

A. It cannot perform basic arithmetic
B. It has a row limit (approx. 1 million rows), making it unsuitable for massive datasets
C. It does not support charts
D. It requires a command-line interface only

11 In the context of Big Data on the Cloud, what does Scalability (specifically Elasticity) allow organizations to do?

A. Use only one server forever
B. Automatically increase or decrease computing resources based on demand
C. Ensure data is never deleted
D. Prevent access to the internet

12 Which of the following is a major challenge associated with Big Data?

A. Data is too structured
B. Ensuring Data Security and Privacy
C. Lack of available hard drives in the market
D. Computers are too fast

13 In a Data Science Use Case for a bank, what is the most likely application of machine learning?

A. Designing the bank logo
B. Fraud Detection in credit card transactions
C. Counting the number of chairs in the lobby
D. Manually filing paper records

14 What is the specific role of a Data Engineer?

A. To build and maintain the data architecture, pipelines, and databases
B. To visualize the final report for the CEO
C. To perform hypothesis testing only
D. To manage the sales team

15 Which of the following represents Unstructured Data?

A. A SQL database table with rows and columns
B. An Excel spreadsheet
C. Video files, emails, and social media posts
D. A CSV file

16 If represents Volume, Velocity, and Variety, what is often considered the 4th V regarding the reliability/quality of data?

A. Victory
B. Veracity
C. Virtualization
D. Vendor

17 In the Data Science Lifecycle, what happens during the Model Planning phase?

A. The project is cancelled
B. Techniques and algorithms are selected, and variables are explored
C. The final dashboard is presented to stakeholders
D. Raw data is collected from the internet

18 Which skill is crucial for a Data Scientist to communicate findings to non-technical stakeholders?

A. Kernel hacking
B. Data Storytelling and Visualization
C. Assembly language programming
D. Hardware repair

19 How does Netflix primarily use Big Data?

A. To manufacture television sets
B. To provide personalized movie/show recommendations to users
C. To track the weather
D. To manage their office supplies

20 Mathematical notation often used in Data Science: In the linear regression equation , what does represent?

A. The intercept
B. The slope (or coefficient) of the line
C. The error term
D. The dependent variable

21 Which of the following is a benefit of using Cloud providers (like AWS, Azure, Google Cloud) for Big Data projects?

A. High upfront capital expenditure (CapEx)
B. Pay-as-you-go pricing models (OpEx)
C. Need to physically build a data center
D. Slower internet speeds

22 In the context of Hadoop, what is MapReduce?

A. A database for storing videos
B. A programming model for processing large data sets with a parallel, distributed algorithm
C. A visualization tool
D. A cloud storage service

23 Which of the following creates a 'Talent Gap' challenge in Big Data?

A. Too many people know how to code
B. Shortage of skilled professionals who understand both data analysis and business logic
C. Universities stopped teaching math
D. Software is becoming too easy to use

24 What is Data Mining?

A. Physically drilling the earth for hard drives
B. The process of discovering patterns, correlations, and anomalies in large datasets
C. Deleting data to save space
D. Encrypting data for security

25 Which sector uses Data Science for Predictive Maintenance to anticipate when machinery will fail?

A. Manufacturing
B. Education
C. Entertainment
D. Retail

26 What is the role of a Data Analyst compared to a Data Scientist?

A. Analysts typically focus on describing 'what happened' using current data, while Scientists focus on predicting 'what will happen'
B. Analysts build the physical servers
C. Analysts earn more than Scientists
D. There is no difference

27 Which mathematical concept is fundamental to understanding probability distributions in Data Science?

A. Calculus of Variations
B. Standard Deviation () and Mean ()
C. Euclidean Geometry
D. Trigonometry

28 In the context of Big Data storage, what does NoSQL stand for?

A. No SQL Allowed
B. Not Only SQL
C. New Operating System Query Language
D. Number Sequence Query Logic

29 Why is Python a popular tool for Data Science?

A. It is the only language that computers understand
B. It has rich libraries like Pandas, NumPy, and Scikit-learn
C. It is faster than C++
D. It comes pre-installed on every calculator

30 Which phase of the Data Science Lifecycle involves putting the model into a production environment to make real-world decisions?

A. Discovery
B. Operationalize / Deployment
C. Data Preparation
D. Model Planning

31 Big Data analytics in Retail (e.g., Walmart, Amazon) is heavily used for:

A. Inventory management and supply chain optimization
B. Patient diagnosis
C. Traffic routing
D. Seismic activity monitoring

32 Which of the following is considered a Soft Skill for a Data Scientist?

A. Knowledge of Hadoop
B. Proficiency in Python
C. Curiosity and Critical Thinking
D. Understanding Calculus

33 What does the term 'Data Integration' refer to in the challenges of Big Data?

A. Buying more computers
B. Combining data from different sources to provide a unified view
C. Deleting old data
D. Installing antivirus software

34 Apache Spark is often preferred over Hadoop MapReduce because:

A. It is older
B. It processes data in-memory, making it much faster
C. It only works on small data
D. It does not support SQL

35 Which logic is used when a Data Scientist splits data into Training Sets and Testing Sets?

A. To make the file smaller
B. To train the model on one part and validate its performance on unseen data
C. To confuse the computer
D. To use two different computers

36 Which symbol is typically used in R (and statistics) to denote an assignment or relationship?

A.
B.
C.
D. $$$$

37 The healthcare industry uses Big Data primarily for:

A. High-frequency trading
B. Predicting epidemics and personalized medicine
C. Route optimization
D. Inventory of office supplies

38 What is a Data Lake?

A. A cooling system for servers
B. A storage repository that holds a vast amount of raw data in its native format
C. A type of graph
D. A cleaning process

39 In the context of Smart Cities, how is Big Data used in Transportation?

A. To count the number of clouds
B. Real-time traffic management and route optimization
C. To paint roads
D. To manufacture tires

40 Which V of Big Data deals with the Volume of data?

A. Data generation speed
B. Data generated in Terabytes, Petabytes, or Zettabytes
C. Data trustworthiness
D. Data complexity

41 Which of the following is NOT a typical phase in the Data Science Lifecycle?

A. Data Discovery
B. Model Building
C. Hardware Assembly
D. Communicating Results

42 What is the result of in a mathematical computation often performed in programming?

A. 6
B. 8
C. 5
D. 9

43 When discussing Big Data tools, Open Source usually means:

A. The software costs $1 million
B. The source code is freely available to use and modify (e.g., Hadoop, R)
C. The software has no security
D. The software only works on weekends

44 The application of Sentiment Analysis on social media data helps companies to:

A. Delete user accounts
B. Understand public opinion about their brand or product
C. Increase internet speed
D. Change their passwords

45 Which job role typically requires the deepest knowledge of machine learning algorithms and statistical modeling?

A. Data Scientist
B. Data Entry Operator
C. Database Administrator
D. Project Manager

46 A dataset contains values: . What is the Mode of this dataset?

A. 2
B. 4
C. 10
D. 4.8

47 Why is Data Quality a challenge in Big Data?

A. Because data is always clean
B. Because data collected from diverse sources often contains errors, duplicates, and noise
C. Because high-quality data is too heavy to store
D. Because computers refuse to process good data

48 Which of these is a popular Python library specifically for data manipulation and analysis using DataFrames?

A. Pandas
B. Photoshop
C. jQuery
D. DirectX

49 How does Cloud Computing support the 'Velocity' aspect of Big Data?

A. By slowing down the network
B. By providing high-speed, distributed processing power on demand
C. By restricting data access
D. By printing data to paper

50 What is the ultimate goal of the Data Science Lifecycle?

A. To generate complex math equations
B. To create actionable business insights and value
C. To use as much electricity as possible
D. To write the longest code