Unit 2 - Notes
INT374
Unit 2: Connecting and Preparing Data
1. Power BI Architecture: Front-End vs. Back-End
Understanding the distinction between where data is transformed and where it is visualized is fundamental to Power BI workflow.
Power BI "Back-End" (Power Query Editor)
The "Kitchen" where raw ingredients (data) are prepared.
- Function: Dedicated to ETL (Extract, Transform, Load) processes.
- Language: Uses M (Mashup) language.
- Activities: Cleaning dirty data, changing data types, removing rows, splitting columns, merging tables.
- Outcome: A clean dataset ready for modeling.
- Performance: Transformations performed here are generally more efficient than calculated columns created in the front-end.
Power BI "Front-End" (Report & Data View)
The "Dining Room" where prepared food is presented.
- Function: Dedicated to Data Modeling, Visualization, and Reporting.
- Language: Uses DAX (Data Analysis Expressions).
- Activities: Creating relationships between tables, writing measures (KPIs), calculated columns, and building charts/graphs.
- Outcome: Interactive reports and dashboards.
2. Data Connectors
Power BI offers a vast library of native connectors to ingest data from various sources.
Categories of Connectors
- File: Excel, Text/CSV, XML, JSON, Folder, PDF.
- Database: SQL Server, Oracle, IBM Db2, MySQL, PostgreSQL.
- Power Platform: Power BI datasets, Dataflows, Dataverse.
- Azure: Azure SQL, Azure Blob Storage, Azure HDInsight.
- Online Services: SharePoint Online, Salesforce, Google Analytics, Dynamics 365.
- Other: Web, OData Feed, ODBC, R Script, Python Script.
Database Connections: Import vs. DirectQuery
When connecting to databases (e.g., SQL Server), you are often presented with two connectivity modes:
- Import Mode (Default/Recommended):
- Loads a copy of the data into the Power BI Desktop memory (cache).
- Pros: Fastest performance for visuals; full access to DAX and Power Query features.
- Cons: Data must be refreshed to see updates; limited by file size (1GB limit for Pro users per dataset).
- DirectQuery:
- Does not store data in Power BI; sends a query to the data source every time a visual interacts.
- Pros: Near real-time data; handles massive datasets (petabytes).
- Cons: Slower visual rendering; DAX limitations; strict dependency on source availability.
3. Extracting Data from the Web
Power BI allows users to scrape data from HTML pages without writing code.
The "Web" Connector
- URL Input: Enter the website address.
- Navigator Window: Power BI scans the HTML for
<table>tags and presents detected tables. - Add Table Using Examples: If Power BI cannot automatically detect the table structure, you can provide sample data (e.g., typing the first two rows manually), and Power BI’s AI will interpret the CSS selectors to build the table for you.
4. The Power Query Editor Interface
Upon clicking "Transform Data," the Power Query Editor opens in a separate window.
Key Components
- The Ribbon: Contains tabs (Home, Transform, Add Column, View) hosting transformation tools.
- Queries Pane (Left): Lists all tables (queries) currently connected.
- Data Preview (Center): Shows a sample of the data based on the current step.
- Formula Bar: Displays the M code generated by the UI actions.
- Query Settings / Applied Steps (Right):
- Records every action taken on the data chronologically.
- Acts as a "Time Machine"—you can click previous steps to see the data state at that moment.
- Steps can be deleted, reordered, or modified via the gear icon.
5. Data QA & Profiling Tools
Before transforming data, one must assess its quality. These tools are located in the View tab.
1. Column Quality
Displays a bar and percentages for valid, error, and empty values.
- Valid: Data matches the column type.
- Error: Data conflicts with the type (e.g., text "N/A" in a Date column).
- Empty: Null cells.
2. Column Distribution
Shows a histogram of value frequency.
- Distinct Values: Total number of different values.
- Unique Values: Number of values that appear only once. (High uniqueness suggests a primary key candidate).
3. Column Profile
Provides detailed statistics for the selected column, including:
- Min/Max values.
- Average and Standard Deviation (for numbers).
- Count of Empty/Null.
- Value distribution visualization.
6. Data Transformation Tools
Text Tools
- Split Column: Divide a column based on a delimiter (comma, space) or number of characters.
- Format: Convert to Uppercase, Lowercase, or Capitalize Each Word.
- Trim: Removes leading and trailing whitespace (crucial for matching keys).
- Clean: Removes non-printable characters (e.g., line breaks).
- Merge Columns: Concatenates two or more columns into one.
Numerical Tools
- Standard: Add, Multiply, Subtract, Divide specific values across a column.
- Statistics: Calculate Sum, Min, Max, Median, Average (usually done via Group By).
- Rounding: Round Up, Round Down, or Round to specific decimal places.
Date & Time Tools
Power Query creates new columns from a date field effortlessly.
- Date Components: Extract Year, Month Name, Quarter, Week Number, Day of Week.
- Age: Calculates the duration between the date in the row and the current system time.
- Earliest/Latest: Identifies range limits.
7. Advanced Transformations
Change Type with Locale
Standard type conversion often fails if the source region differs from the system region (e.g., interpreting "02/05/2023" as May 2nd (US) vs. Feb 5th (UK)).
- Action: Right-click column Change Type Using Locale.
- Usage: You specify the origin culture of the data so Power BI interprets the string correctly before converting it to a Date.
Index & Conditional Columns
- Index Column: Adds a sequential number to rows (starting at 0 or 1). Useful for creating unique keys for tables that lack them.
- Conditional Column: Creates a new column based on
IF-THEN-ELSElogic.- Example:
IF [Sales] > 1000 THEN "High" ELSE IF [Sales] > 500 THEN "Medium" ELSE "Low".
- Example:
Grouping & Aggregating
Aggregates data to a higher level of granularity (reducing the number of rows).
- Operation: Specify the "Group By" column (e.g., Region) and the "New Column Name" with an operation (e.g., Sum of Sales).
- Result: A summary table with one row per unique value in the grouping column.
Pivoting & Unpivoting
This alters the structure/shape of the table.
- Unpivot (Transforming Wide to Tall):
- Scenario: You have columns "Jan", "Feb", "Mar" with sales figures.
- Action: Select the Date columns Unpivot.
- Result: Creates two columns: "Attribute" (Month) and "Value" (Sales Amount). This is the required structure for Power BI visualization.
- Pivot (Transforming Tall to Wide):
- Scenario: You have a "Metric" column containing "Sales", "Cost", "Profit" and you want them as separate columns.
- Action: Select "Metric" column Pivot Column Select "Value" column for the values.
Merging Queries (Joins)
Combines two tables horizontally based on a common key (equivalent to SQL JOIN or Excel VLOOKUP).
- Join Kinds:
- Left Outer: Keep all rows from the first table, match from the second. (Most common).
- Inner: Keep only matching rows.
- Right Outer: Keep all rows from the second table.
- Full Outer: Keep all rows from both tables.
Appending Queries (Unions)
Combines two tables vertically (stacking them).
- Requirement: Column headers must share the exact same names.
- Usage: Combining "Sales 2022" and "Sales 2023" files into a single "Total Sales" table.
Data Source Settings
Located in the "Home" tab under "Data Source Settings".
- Path Management: Update file paths if the source file moves (e.g., from C:/User/Desktop to a Shared Drive).
- Permissions: Clear or Edit permissions/credentials (e.g., updating a database password or signing into a different organizational account).