Unit 4 - Notes

QTT201 10 min read

Unit 4: Statistical Data and Central Tendency

1. Introduction to Statistical Data

Statistical data refers to the collection of facts, figures, and measurements that are gathered, analyzed, interpreted, and presented for a specific purpose. In business, data is crucial for decision-making, forecasting, performance evaluation, and strategic planning. Data can be classified based on the number of variables being studied simultaneously.

1.1 Univariate Data

Definition: Univariate data is a type of data that consists of observations on a single variable. The primary purpose of univariate analysis is to describe the data, find patterns, and summarize its central tendency, dispersion, and shape.

Variable: A single characteristic or attribute is measured for each subject or item.
Focus: Describing the characteristics of that single variable. It does not deal with causes or relationships between variables.
Analysis Techniques:
- Frequency Distribution Tables
- Graphs: Histograms, Bar Charts, Pie Charts, Line Charts, Box Plots
- Measures of Central Tendency: Mean, Median, Mode
- Measures of Dispersion: Range, Variance, Standard Deviation

Business Examples:

The monthly sales figures for a single product over a year. (Variable: Monthly Sales)
The salaries of all employees in a specific department. (Variable: Salary)
The number of units produced by a factory each day for a month. (Variable: Daily Production)
The ages of customers who purchased a new service. (Variable: Customer Age)

1.2 Bivariate Data

Definition: Bivariate data is data in which observations are made on two variables for each subject or item. The primary purpose is to explore the relationship or association between these two variables.

Variables: Two characteristics are measured simultaneously for each unit of observation.
Focus: Determining the strength and direction of the relationship between the two variables. One variable is often considered the independent (or explanatory) variable, and the other is the dependent (or response) variable.
Analysis Techniques:
- Scatter Plots to visualize the relationship.
- Correlation Analysis (e.g., Pearson's correlation coefficient) to measure the strength and direction of a linear relationship.
- Simple Linear Regression to model the relationship and make predictions.

Business Examples:

A company's advertising expenditure and its monthly sales revenue. (Variables: Advertising Spend, Sales Revenue)
An employee's years of experience and their annual salary. (Variables: Years of Experience, Annual Salary)
The price of a product and the number of units sold. (Variables: Price, Units Sold)
The temperature in a retail store and the sales of ice cream. (Variables: Temperature, Ice Cream Sales)

1.3 Multivariate Data

Definition: Multivariate data consists of observations on three or more variables for each subject or item. It is used to understand the complex relationships among multiple variables simultaneously.

Variables: Three or more characteristics are measured for each unit of observation.
Focus: Analyzing the interactions and interdependencies between multiple variables. This type of analysis is common in real-world business scenarios where outcomes are influenced by many factors.
Analysis Techniques:
- Multiple Regression Analysis to predict a dependent variable based on several independent variables.
- Factor Analysis to identify underlying variables or factors that explain the pattern of correlations within a set of observed variables.
- Cluster Analysis to group subjects or items based on similarities across several variables.
- MANOVA (Multivariate Analysis of Variance).

Business Examples:

Predicting house prices based on size, number of bedrooms, location, and age of the property. (Variables: Price, Size, Bedrooms, Location, Age)
Analyzing customer satisfaction based on product quality, price, customer service, and delivery time. (Variables: Satisfaction Score, Quality Rating, Price, Service Rating, Delivery Time)
Assessing the risk of a loan applicant based on their income, credit score, age, and existing debt. (Variables: Loan Risk, Income, Credit Score, Age, Debt Level)

2. Measures of Central Tendency

Measures of central tendency are single values that attempt to describe a set of data by identifying the central or typical value within that set. They are also known as measures of location or averages. The most common measures are the arithmetic mean, median, and mode.

2.1 Arithmetic Mean (Mean)

The Arithmetic Mean, or simply the mean, is the most common measure of central tendency. It is calculated by summing all the values in a dataset and dividing by the number of values.

A. Mean for Ungrouped (Raw) Data

This applies to a simple list of numbers.

Formula:
TEXT
```
    x̄ = (Σx) / n
    
```
Where:
- x̄ (pronounced "x-bar") is the sample mean.
- Σx is the sum of all individual values.
- n is the number of values in the sample.
Example: A sales manager records the number of sales made by a salesperson over 5 days: 8, 12, 5, 10, 15.
- Sum of values (Σx) = 8 + 12 + 5 + 10 + 15 = 50
- Number of values (n) = 5
- Mean (x̄) = 50 / 5 = 10 sales per day.

B. Mean for Discrete Frequency Distribution

This is used when data is presented in a table with frequencies.

Formula:
TEXT
```
    x̄ = (Σfx) / (Σf)
    
```
Where:
- x is the value of the variable.
- f is the frequency of each value.
- Σfx is the sum of the products of each value and its frequency.
- Σf is the sum of all frequencies (which equals n).
Example: Number of defects found in a batch of 100 products.

Defects (x)	No. of Products (f)	fx
0	45	0 * 45 = 0
1	30	1 * 30 = 30
2	15	2 * 15 = 30
3	10	3 * 10 = 30
Total	Σf = 100	Σfx = 90

Mean (x̄) = Σfx / Σf = 90 / 100 = 0.9 defects per product.

C. Mean for Continuous Frequency Distribution (Grouped Data)

This is used for data grouped into class intervals.

Formula:
TEXT
```
    x̄ = (Σfm) / (Σf)
    
```
Where:
- f is the frequency of each class.
- m is the midpoint of each class interval. Midpoint = (Lower Limit + Upper Limit) / 2.
- Σfm is the sum of the products of each midpoint and its frequency.
- Σf is the total frequency (n).
Example: Weekly wages of 50 employees.

Weekly Wages ($)	No. of Employees (f)	Midpoint (m)	fm
100 - 120	10	110	10 * 110 = 1100
120 - 140	15	130	15 * 130 = 1950
140 - 160	20	150	20 * 150 = 3000
160 - 180	5	170	5 * 170 = 850
Total	Σf = 50		Σfm = 6900

Mean (x̄) = Σfm / Σf = 6900 / 50 = $138.

D. Combined Mean

The combined arithmetic mean is used to calculate the mean of a combined group from the means of its subgroups.

Formula (for two groups):
TEXT
```
    x̄_c = (n₁x̄₁ + n₂x̄₂) / (n₁ + n₂)
    
```
Where:
- x̄_c is the combined mean.
- n₁, n₂ are the number of observations in group 1 and group 2.
- x̄₁, x̄₂ are the means of group 1 and group 2.
Example: A company has two departments. The production department has 40 employees with an average salary of $3,000. The marketing department has 10 employees with an average salary of$ 4,000.
- n₁ = 40, x̄₁ = 3000
- n₂ = 10, x̄₂ = 4000
- x̄_c = (40 3000 + 10 4000) / (40 + 10)
- x̄_c = (120000 + 40000) / 50
- x̄_c = 160000 / 50 = $3,200.
- The average salary for the entire company is $3,200.

Properties, Pros, and Cons of the Mean

Properties:
- It is unique for a given dataset.
- The sum of deviations of the items from their mean is always zero (Σ(x - x̄) = 0).
- It is calculated using every value in the dataset.
Pros:
- It is rigidly defined and easy to calculate and understand.
- It is suitable for further algebraic treatment.
Cons:
- It is highly affected by extreme values (outliers). For instance, if one salesperson had sales of 100 instead of 15, the mean would be significantly skewed upwards.
- It cannot be calculated for open-ended class intervals.
- It cannot be used for qualitative data (e.g., brand preferences).

2.2 Median

The Median is the positional middle value of a dataset that has been arranged in ascending or descending order. It divides the data into two equal halves.

A. Median for Ungrouped (Raw) Data

Arrange the data in ascending order.
Find the position of the median using the formula: (n + 1) / 2.
Identify the value at that position.

Case 1: Odd number of observations (n is odd)
- Data: 5, 8, 10, 12, 15 (already sorted)
- n = 5
- Position = (5 + 1) / 2 = 3rd value.
- Median = 10.
Case 2: Even number of observations (n is even)
- Data: 5, 8, 10, 12, 15, 20 (already sorted)
- n = 6
- Position = (6 + 1) / 2 = 3.5. This means the median is the average of the 3rd and 4th values.
- Median = (10 + 12) / 2 = 11.

B. Median for Continuous Frequency Distribution (Grouped Data)

Calculate the cumulative frequencies (cf).
Determine the median class by finding the position N/2 (where N = Σf). The median class is the class whose cumulative frequency is just greater than or equal to N/2.
Apply the interpolation formula.

Formula:
TEXT
```
    Median = L + [ (N/2 - cf) / f ] * h
    
```
Where:
- L = Lower class boundary of the median class.
- N = Total frequency (Σf).
- cf = Cumulative frequency of the class preceding the median class.
- f = Frequency of the median class.
- h = Class width of the median class.
Example: Using the same weekly wages data.

Weekly Wages ($)	Frequency (f)	Cumulative Freq. (cf)
100 - 120	10	10
120 - 140	15	25
140 - 160	20	45
160 - 180	5	50
Total	N = 50

Median Position: N/2 = 50/2 = 25.
Median Class: The class with a cumulative frequency just greater than or equal to 25 is 120 - 140.
Identify Values:
- L = 120
- N = 50
- cf = 10 (cf of the class before the median class)
- f = 15 (frequency of the median class)
- h = 140 - 120 = 20
Calculate:
- Median = 120 + [ (25 - 10) / 15 ] * 20
- Median = 120 + [ 15 / 15 ] * 20
- Median = 120 + 1 * 20 = $140.

Properties, Pros, and Cons of the Median

Properties:
- It is a positional average.
- It is not affected by extreme values.
Pros:
- Best measure for skewed distributions (e.g., income, house prices).
- Can be calculated for open-ended class intervals.
- Easy to understand.
Cons:
- It is not based on all values in the dataset.
- It is not suitable for further algebraic treatment.
- Requires data to be sorted, which can be time-consuming for large datasets.

2.3 Mode

The Mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), more than two modes (multimodal), or no mode at all.

A. Mode for Ungrouped and Discrete Data

Simply identify the value with the highest frequency.
Example (Ungrouped): Data: 5, 8, 10, 8, 12, 8, 15.
- The value 8 appears 3 times, more than any other value.
- Mode = 8.
Example (Discrete): Using the product defects table. The defect count of 0 has the highest frequency (45).
- Mode = 0 defects.

B. Mode for Continuous Frequency Distribution (Grouped Data)

Identify the modal class, which is the class interval with the highest frequency.
Apply the interpolation formula.

Formula:
TEXT
```
    Mode = L + [ (f₁ - f₀) / (2f₁ - f₀ - f₂) ] * h
    
```
Where:
- L = Lower class boundary of the modal class.
- f₁ = Frequency of the modal class.
- f₀ = Frequency of the class preceding the modal class.
- f₂ = Frequency of the class succeeding the modal class.
- h = Class width of the modal class.
Example: Using the same weekly wages data.

Weekly Wages ($)	Frequency (f)
100 - 120	10 (f₀)
140 - 160	20 (f₁)
160 - 180	5 (f₂)
...	...

Modal Class: The class 140 - 160 has the highest frequency (20).
Identify Values:
- L = 140
- f₁ = 20
- f₀ = 15 (frequency of the preceding class 120-140)
- f₂ = 5 (frequency of the succeeding class 160-180)
- h = 160 - 140 = 20
Calculate:
- Mode = 140 + [ (20 - 15) / (220 - 15 - 5) ] 20
- Mode = 140 + [ 5 / (40 - 20) ] * 20
- Mode = 140 + [ 5 / 20 ] * 20
- Mode = 140 + 5 = $145.

Properties, Pros, and Cons of the Mode

Properties:
- A dataset may have no mode or multiple modes.
- It is not affected by extreme values.
Pros:
- The only measure of central tendency that can be used for nominal (categorical) data (e.g., most popular car color).
- Easy to understand and locate in a frequency distribution.
Cons:
- It is not rigidly defined; it may not exist or may not be unique.
- It is not based on all values in the dataset.
- It is not suitable for further algebraic treatment.
- Can be unstable, as a small change in the data can significantly change the mode.

Unit 3

Unit 5

Unit 4 - Notes

Table of Contents

Unit 4: Statistical Data and Central Tendency

1. Introduction to Statistical Data

1.1 Univariate Data

1.2 Bivariate Data

1.3 Multivariate Data

2. Measures of Central Tendency

2.1 Arithmetic Mean (Mean)

A. Mean for Ungrouped (Raw) Data

B. Mean for Discrete Frequency Distribution

C. Mean for Continuous Frequency Distribution (Grouped Data)

D. Combined Mean

Properties, Pros, and Cons of the Mean

2.2 Median

A. Median for Ungrouped (Raw) Data

B. Median for Continuous Frequency Distribution (Grouped Data)

Properties, Pros, and Cons of the Median

2.3 Mode

A. Mode for Ungrouped and Discrete Data

B. Mode for Continuous Frequency Distribution (Grouped Data)

Properties, Pros, and Cons of the Mode