Unit 1 - Notes
CSE121
Unit 1: Data Science & Big Data
1. Introduction to Data Science
Definition
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data. It combines expertise from various domains, including:
- Statistics and Mathematics: For modeling and analysis.
- Computer Science: For programming, algorithm design, and database management.
- Domain Knowledge: Understanding the context of the data (e.g., finance, healthcare).
The Need for Data Science
- Data Explosion: With the advent of the internet, social media, and IoT, the volume of data generated is exponential. Traditional tools cannot handle this influx.
- Decision Making: Organizations need to move from "gut-feeling" decisions to data-driven decisions to reduce risk and increase profitability.
- Pattern Recognition: To identify hidden patterns and trends (e.g., consumer behavior changes) that are not visible through simple observation.
- Predictive Capability: To forecast future events (e.g., stock market trends, disease outbreaks) based on historical data.
2. Applications of Data Science and Big Data
Data Science and Big Data are transforming industries across the board:
- Healthcare:
- Medical Image Analysis: Detecting tumors or anomalies in X-rays and MRIs.
- Genomics: Analyzing genetic sequences to understand diseases and personalize medicine.
- Drug Discovery: Simulating interactions to speed up the development of new drugs.
- Finance & Banking:
- Fraud Detection: Identifying unusual transaction patterns in real-time.
- Risk Assessment: Credit scoring and loan approval automation.
- Algorithmic Trading: Using data to execute high-speed trades.
- E-Commerce & Retail:
- Recommendation Engines: "Customers who bought this also bought..." (Amazon, Netflix).
- Inventory Management: Predicting demand to optimize stock levels.
- Transportation:
- Route Optimization: GPS navigation (Google Maps) predicting traffic.
- Self-Driving Cars: Processing sensor data to navigate safely.
- Social Media:
- Sentiment Analysis: Understanding public opinion on brands or topics.
- Targeted Advertising: Showing ads based on user behavior and preferences.
3. Data Science Lifecycle (with Use Case)
The Data Science Lifecycle enables a structured approach to solving problems.
Use Case Scenario: A Telecommunications company wants to predict which customers are likely to leave (Churn Prediction) to offer them retention deals.
Phase 1: Discovery (Problem Definition)
- Objective: Define the goal clearly.
- Action: Stakeholders determine they want to reduce customer churn by 10%.
- Output: Problem statement and hypothesis.
Phase 2: Data Preparation (Data Collection & Cleaning)
- Collection: Gathering data from SQL databases, call logs, and customer support tickets.
- Cleaning: Handling missing values (e.g., empty address fields), removing duplicates, and correcting errors.
- Transformation: Converting categorical data (e.g., "Male/Female") into numerical format for the model.
Phase 3: Exploratory Data Analysis (EDA)
- Objective: Understand the data structure and patterns.
- Action: Using histograms or scatter plots to see that customers with high monthly bills and low usage are the most likely to churn.
- Tools: Python (Pandas, Matplotlib), Tableau.
Phase 4: Model Building
- Objective: Create a mathematical model to predict the outcome.
- Action: Splitting data into Training and Testing sets. Using an algorithm like Logistic Regression or Random Forest to train the model on historical data.
Phase 5: Model Evaluation
- Objective: Test accuracy.
- Action: Testing the model on data it hasn't seen before. If the model predicts churn with 85% accuracy, it is deemed successful. If not, retrain with different parameters.
Phase 6: Operationalization (Deployment)
- Objective: Put the model into practice.
- Action: Integrating the model into the company's CRM system. When a high-risk customer calls, the support agent gets an alert to offer a discount.
Phase 7: Communication Results
- Objective: Reporting.
- Action: Presenting the ROI of the retention campaign to the CEO using dashboards.
4. Introduction to Big Data
Definition: Big Data refers to datasets that are too large or complex for traditional data-processing application software to deal with. It implies data that exceeds the processing capacity of conventional database systems.
The 3 Vs of Big Data
While there are often 5 or 7 Vs cited, the core three are:
- Volume:
- Refers to the sheer size of the data.
- Scale: Terabytes (TB), Petabytes (PB), and Zettabytes (ZB).
- Example: 500 hours of video are uploaded to YouTube every minute.
- Velocity:
- Refers to the speed at which data is generated and processed.
- Requirement: Real-time or near-real-time processing.
- Example: Sensor data from a jet engine during flight; stock market ticks.
- Variety:
- Refers to the different types of data.
- Structured: Rows and columns (Excel, SQL).
- Semi-structured: XML, JSON.
- Unstructured: Video, audio, text, images (making up the majority of Big Data).
5. Challenges of Big Data
Despite its value, Big Data presents significant hurdles:
- Storage Issues: Rapidly growing data requires scalable and expensive storage solutions.
- Processing Power: Analyzing petabytes of data requires immense computational power and distributed processing (parallel computing).
- Data Quality (Veracity): High volume often includes noise, errors, or incomplete data. "Dirty" data leads to bad insights.
- Security and Privacy: Storing vast amounts of personal user data attracts cyberattacks and raises compliance issues (GDPR, HIPAA).
- Talent Gap: There is a global shortage of skilled professionals who can manage complex Big Data architectures.
6. Tools Usage
Apache Hadoop
An open-source framework used for distributed storage and processing of datasets of big data across clusters of computers.
- HDFS (Hadoop Distributed File System): Splits files into blocks and stores them across multiple nodes (storage).
- MapReduce: The programming model that processes the data in parallel (processing).
- Usage: Used by Facebook and Yahoo for storing massive logs.
Tableau
A leading Business Intelligence (BI) and data visualization tool.
- Function: Converts raw data into interactive dashboards and graphs.
- Feature: Drag-and-drop interface; requires no coding.
- Usage: Creating executive dashboards to monitor KPIs.
R Language
A programming language and free software environment for statistical computing and graphics.
- Strengths: Excellent for heavy statistical analysis, academic research, and complex visualizations (using libraries like ggplot2).
- Usage: Used by statisticians for hypothesis testing and exploratory analysis.
Excel
The fundamental spreadsheet tool.
- Usage: Good for small datasets, quick calculations, and pivot tables.
- Limitation: Crashes with large datasets (over 1 million rows); not suitable for Big Data or automated machine learning pipelines.
Big Data on the Cloud
Cloud computing allows organizations to access Big Data tools without buying physical servers.
- Benefits: Scalability (pay for what you use), accessibility, and maintenance managed by the provider.
- Providers:
- AWS (Amazon EMR, Redshift)
- Google Cloud (BigQuery)
- Microsoft Azure (HDInsight)
7. Job Roles and Skillsets
The field is divided into several specialized roles.
1. Data Analyst
- Role: Interprets data, analyzes results using statistical techniques, and provides reports. Focuses on "What happened?" and "Why it happened?".
- Skills: Excel, SQL, Tableau/PowerBI, Basic Python/R, Communication.
2. Data Scientist
- Role: Builds models to predict future trends. Uses advanced math and machine learning. Focuses on "What will happen?".
- Skills: Advanced Python/R, Machine Learning algorithms, Statistics, Calculus, Data wrangling.
3. Data Engineer
- Role: Builds and maintains the "pipes" (pipelines) that allow data to flow. They ensure data is clean and accessible for the Data Scientists.
- Skills: Hadoop, Spark, SQL/NoSQL databases, Cloud platforms (AWS/Azure), ETL (Extract, Transform, Load) processes.
4. Big Data Architect
- Role: Designs the complete infrastructure for data management.
- Skills: System Architecture, Database design, Network security, High-performance computing.
General Skills Needed for Big Data
- Programming: Python (most popular), R, Java, Scala.
- Database Management: SQL (MySQL, PostgreSQL) and NoSQL (MongoDB, Cassandra).
- Big Data Frameworks: Hadoop, Apache Spark.
- Math/Stats: Linear algebra, probability, statistics.
- Soft Skills: Business acumen, critical thinking, storytelling with data.