Unit 4 - Notes
Unit 4: Statistical Data and Central Tendency
1. Introduction to Statistical Data
Statistical data refers to the collection of facts, figures, and measurements that are gathered, analyzed, interpreted, and presented for a specific purpose. In business, data is crucial for decision-making, forecasting, performance evaluation, and strategic planning. Data can be classified based on the number of variables being studied simultaneously.
1.1 Univariate Data
Definition: Univariate data is a type of data that consists of observations on a single variable. The primary purpose of univariate analysis is to describe the data, find patterns, and summarize its central tendency, dispersion, and shape.
- Variable: A single characteristic or attribute is measured for each subject or item.
- Focus: Describing the characteristics of that single variable. It does not deal with causes or relationships between variables.
- Analysis Techniques:
- Frequency Distribution Tables
- Graphs: Histograms, Bar Charts, Pie Charts, Line Charts, Box Plots
- Measures of Central Tendency: Mean, Median, Mode
- Measures of Dispersion: Range, Variance, Standard Deviation
Business Examples:
- The monthly sales figures for a single product over a year. (Variable:
Monthly Sales) - The salaries of all employees in a specific department. (Variable:
Salary) - The number of units produced by a factory each day for a month. (Variable:
Daily Production) - The ages of customers who purchased a new service. (Variable:
Customer Age)
1.2 Bivariate Data
Definition: Bivariate data is data in which observations are made on two variables for each subject or item. The primary purpose is to explore the relationship or association between these two variables.
- Variables: Two characteristics are measured simultaneously for each unit of observation.
- Focus: Determining the strength and direction of the relationship between the two variables. One variable is often considered the independent (or explanatory) variable, and the other is the dependent (or response) variable.
- Analysis Techniques:
- Scatter Plots to visualize the relationship.
- Correlation Analysis (e.g., Pearson's correlation coefficient) to measure the strength and direction of a linear relationship.
- Simple Linear Regression to model the relationship and make predictions.
Business Examples:
- A company's advertising expenditure and its monthly sales revenue. (Variables:
Advertising Spend,Sales Revenue) - An employee's years of experience and their annual salary. (Variables:
Years of Experience,Annual Salary) - The price of a product and the number of units sold. (Variables:
Price,Units Sold) - The temperature in a retail store and the sales of ice cream. (Variables:
Temperature,Ice Cream Sales)
1.3 Multivariate Data
Definition: Multivariate data consists of observations on three or more variables for each subject or item. It is used to understand the complex relationships among multiple variables simultaneously.
- Variables: Three or more characteristics are measured for each unit of observation.
- Focus: Analyzing the interactions and interdependencies between multiple variables. This type of analysis is common in real-world business scenarios where outcomes are influenced by many factors.
- Analysis Techniques:
- Multiple Regression Analysis to predict a dependent variable based on several independent variables.
- Factor Analysis to identify underlying variables or factors that explain the pattern of correlations within a set of observed variables.
- Cluster Analysis to group subjects or items based on similarities across several variables.
- MANOVA (Multivariate Analysis of Variance).
Business Examples:
- Predicting house prices based on size, number of bedrooms, location, and age of the property. (Variables:
Price,Size,Bedrooms,Location,Age) - Analyzing customer satisfaction based on product quality, price, customer service, and delivery time. (Variables:
Satisfaction Score,Quality Rating,Price,Service Rating,Delivery Time) - Assessing the risk of a loan applicant based on their income, credit score, age, and existing debt. (Variables:
Loan Risk,Income,Credit Score,Age,Debt Level)
2. Measures of Central Tendency
Measures of central tendency are single values that attempt to describe a set of data by identifying the central or typical value within that set. They are also known as measures of location or averages. The most common measures are the arithmetic mean, median, and mode.
2.1 Arithmetic Mean (Mean)
The Arithmetic Mean, or simply the mean, is the most common measure of central tendency. It is calculated by summing all the values in a dataset and dividing by the number of values.
A. Mean for Ungrouped (Raw) Data
This applies to a simple list of numbers.
-
Formula:
TEXTx̄ = (Σx) / n
Where:x̄(pronounced "x-bar") is the sample mean.Σxis the sum of all individual values.nis the number of values in the sample.
-
Example: A sales manager records the number of sales made by a salesperson over 5 days: 8, 12, 5, 10, 15.
- Sum of values (Σx) = 8 + 12 + 5 + 10 + 15 = 50
- Number of values (n) = 5
- Mean (x̄) = 50 / 5 = 10 sales per day.
B. Mean for Discrete Frequency Distribution
This is used when data is presented in a table with frequencies.
-
Formula:
TEXTx̄ = (Σfx) / (Σf)
Where:xis the value of the variable.fis the frequency of each value.Σfxis the sum of the products of each value and its frequency.Σfis the sum of all frequencies (which equalsn).
-
Example: Number of defects found in a batch of 100 products.
| Defects (x) | No. of Products (f) | fx |
|---|---|---|
| 0 | 45 | 0 * 45 = 0 |
| 1 | 30 | 1 * 30 = 30 |
| 2 | 15 | 2 * 15 = 30 |
| 3 | 10 | 3 * 10 = 30 |
| Total | Σf = 100 | Σfx = 90 |
- Mean (x̄) = Σfx / Σf = 90 / 100 = 0.9 defects per product.
C. Mean for Continuous Frequency Distribution (Grouped Data)
This is used for data grouped into class intervals.
-
Formula:
TEXTx̄ = (Σfm) / (Σf)
Where:fis the frequency of each class.mis the midpoint of each class interval. Midpoint = (Lower Limit + Upper Limit) / 2.Σfmis the sum of the products of each midpoint and its frequency.Σfis the total frequency (n).
-
Example: Weekly wages of 50 employees.
| Weekly Wages ($) | No. of Employees (f) | Midpoint (m) | fm |
|---|---|---|---|
| 100 - 120 | 10 | 110 | 10 * 110 = 1100 |
| 120 - 140 | 15 | 130 | 15 * 130 = 1950 |
| 140 - 160 | 20 | 150 | 20 * 150 = 3000 |
| 160 - 180 | 5 | 170 | 5 * 170 = 850 |
| Total | Σf = 50 | Σfm = 6900 |
- Mean (x̄) = Σfm / Σf = 6900 / 50 = $138.
D. Combined Mean
The combined arithmetic mean is used to calculate the mean of a combined group from the means of its subgroups.
-
Formula (for two groups):
TEXTx̄_c = (n₁x̄₁ + n₂x̄₂) / (n₁ + n₂)
Where:x̄_cis the combined mean.n₁,n₂are the number of observations in group 1 and group 2.x̄₁,x̄₂are the means of group 1 and group 2.
-
Example: A company has two departments. The production department has 40 employees with an average salary of 4,000.
- n₁ = 40, x̄₁ = 3000
- n₂ = 10, x̄₂ = 4000
- x̄_c = (40 3000 + 10 4000) / (40 + 10)
- x̄_c = (120000 + 40000) / 50
- x̄_c = 160000 / 50 = $3,200.
- The average salary for the entire company is $3,200.
Properties, Pros, and Cons of the Mean
- Properties:
- It is unique for a given dataset.
- The sum of deviations of the items from their mean is always zero (Σ(x - x̄) = 0).
- It is calculated using every value in the dataset.
- Pros:
- It is rigidly defined and easy to calculate and understand.
- It is suitable for further algebraic treatment.
- Cons:
- It is highly affected by extreme values (outliers). For instance, if one salesperson had sales of 100 instead of 15, the mean would be significantly skewed upwards.
- It cannot be calculated for open-ended class intervals.
- It cannot be used for qualitative data (e.g., brand preferences).
2.2 Median
The Median is the positional middle value of a dataset that has been arranged in ascending or descending order. It divides the data into two equal halves.
A. Median for Ungrouped (Raw) Data
- Arrange the data in ascending order.
- Find the position of the median using the formula:
(n + 1) / 2. - Identify the value at that position.
-
Case 1: Odd number of observations (n is odd)
- Data: 5, 8, 10, 12, 15 (already sorted)
- n = 5
- Position = (5 + 1) / 2 = 3rd value.
- Median = 10.
-
Case 2: Even number of observations (n is even)
- Data: 5, 8, 10, 12, 15, 20 (already sorted)
- n = 6
- Position = (6 + 1) / 2 = 3.5. This means the median is the average of the 3rd and 4th values.
- Median = (10 + 12) / 2 = 11.
B. Median for Continuous Frequency Distribution (Grouped Data)
- Calculate the cumulative frequencies (cf).
- Determine the median class by finding the position
N/2(where N = Σf). The median class is the class whose cumulative frequency is just greater than or equal to N/2. - Apply the interpolation formula.
-
Formula:
TEXTMedian = L + [ (N/2 - cf) / f ] * h
Where:L= Lower class boundary of the median class.N= Total frequency (Σf).cf= Cumulative frequency of the class preceding the median class.f= Frequency of the median class.h= Class width of the median class.
-
Example: Using the same weekly wages data.
| Weekly Wages ($) | Frequency (f) | Cumulative Freq. (cf) |
|---|---|---|
| 100 - 120 | 10 | 10 |
| 120 - 140 | 15 | 25 |
| 140 - 160 | 20 | 45 |
| 160 - 180 | 5 | 50 |
| Total | N = 50 |
- Median Position: N/2 = 50/2 = 25.
- Median Class: The class with a cumulative frequency just greater than or equal to 25 is
120 - 140. - Identify Values:
- L = 120
- N = 50
- cf = 10 (cf of the class before the median class)
- f = 15 (frequency of the median class)
- h = 140 - 120 = 20
- Calculate:
- Median = 120 + [ (25 - 10) / 15 ] * 20
- Median = 120 + [ 15 / 15 ] * 20
- Median = 120 + 1 * 20 = $140.
Properties, Pros, and Cons of the Median
- Properties:
- It is a positional average.
- It is not affected by extreme values.
- Pros:
- Best measure for skewed distributions (e.g., income, house prices).
- Can be calculated for open-ended class intervals.
- Easy to understand.
- Cons:
- It is not based on all values in the dataset.
- It is not suitable for further algebraic treatment.
- Requires data to be sorted, which can be time-consuming for large datasets.
2.3 Mode
The Mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), more than two modes (multimodal), or no mode at all.
A. Mode for Ungrouped and Discrete Data
- Simply identify the value with the highest frequency.
- Example (Ungrouped): Data: 5, 8, 10, 8, 12, 8, 15.
- The value
8appears 3 times, more than any other value. - Mode = 8.
- The value
- Example (Discrete): Using the product defects table. The defect count of
0has the highest frequency (45).- Mode = 0 defects.
B. Mode for Continuous Frequency Distribution (Grouped Data)
- Identify the modal class, which is the class interval with the highest frequency.
- Apply the interpolation formula.
-
Formula:
TEXTMode = L + [ (f₁ - f₀) / (2f₁ - f₀ - f₂) ] * h
Where:L= Lower class boundary of the modal class.f₁= Frequency of the modal class.f₀= Frequency of the class preceding the modal class.f₂= Frequency of the class succeeding the modal class.h= Class width of the modal class.
-
Example: Using the same weekly wages data.
| Weekly Wages ($) | Frequency (f) |
|---|---|
| 100 - 120 | 10 (f₀) |
| 140 - 160 | 20 (f₁) |
| 160 - 180 | 5 (f₂) |
| ... | ... |
- Modal Class: The class
140 - 160has the highest frequency (20). - Identify Values:
- L = 140
- f₁ = 20
- f₀ = 15 (frequency of the preceding class
120-140) - f₂ = 5 (frequency of the succeeding class
160-180) - h = 160 - 140 = 20
- Calculate:
- Mode = 140 + [ (20 - 15) / (220 - 15 - 5) ] 20
- Mode = 140 + [ 5 / (40 - 20) ] * 20
- Mode = 140 + [ 5 / 20 ] * 20
- Mode = 140 + 5 = $145.
Properties, Pros, and Cons of the Mode
- Properties:
- A dataset may have no mode or multiple modes.
- It is not affected by extreme values.
- Pros:
- The only measure of central tendency that can be used for nominal (categorical) data (e.g., most popular car color).
- Easy to understand and locate in a frequency distribution.
- Cons:
- It is not rigidly defined; it may not exist or may not be unique.
- It is not based on all values in the dataset.
- It is not suitable for further algebraic treatment.
- Can be unstable, as a small change in the data can significantly change the mode.