Ch1. Descriptive Statistics — Summarizing Data with Mean, Variance, and Standard Deviation
What Is Statistics?
Statistics: The discipline of collecting, organizing, analyzing, and interpreting data to support decision-making under uncertainty.
Descriptive Statistics: Summarizes and describes a dataset
Inferential Statistics: Uses sample data to draw conclusions about a population
Measures of Central Tendency
Indicators that show where data tends to cluster.
Mean (Arithmetic Mean)
Arithmetic Mean = (Sum of all values) / (Number of data points)
Advantage: Intuitive and easy to compute
Disadvantage: Sensitive to extreme values (outliers)
Median
The middle value when data is sorted in order.
Odd count: The middle-positioned value
Even count: Average of the two middle values
Advantage: Robust to outliers
Usage: Real estate prices, income distributions (e.g., US Census median household income)
Mode
The value that appears most frequently. There can be multiple modes.
Usage: Categorical data (clothing sizes, preferred colors)
Relationship Between Mean, Median, and Mode
| Distribution Shape | Relationship |
|---|---|
| Normal distribution (symmetric) | Mean = Median = Mode |
| Right-skewed (positive skew) | Mode < Median < Mean |
| Left-skewed (negative skew) | Mean < Median < Mode |
Measures of Dispersion
Indicators that show how spread out the data is.
Range
Range = Maximum value − Minimum value
Simple but highly sensitive to outliers.
Variance
Population variance σ² = Σ(Xᵢ − μ)² / N
Sample variance s² = Σ(xᵢ − x̄)² / (n−1)
The average of squared deviations. Difficult to interpret because the unit is squared.
Standard Deviation
σ = √Variance (population)
s = √Sample variance (sample)
The square root of variance. Has the same units as the original data.
Applications:
- Stock market volatility (higher SD = higher risk)
- Quality control (defect rate within specification ±3σ)
Coefficient of Variation (CV)
CV = (Standard Deviation / Mean) × 100%
Used to compare dispersion between two groups with different units.
Quartiles and Box Plots
Quartiles:
- Q1 (25th percentile): lower 25% boundary
- Q2 (50th percentile): median
- Q3 (75th percentile): upper 25% boundary
- IQR = Q3 − Q1 (Interquartile Range)
Outlier Detection: below Q1 − 1.5×IQR or above Q3 + 1.5×IQR
Skewness and Kurtosis
Skewness: The degree of asymmetry in the distribution
- Positive skew (+): longer right tail
- Negative skew (−): longer left tail
Kurtosis: The degree of peakedness
- Normal distribution kurtosis = 3 (excess kurtosis = 0)
- Kurtosis > 3: more peaked with heavier tails (concentrated risk)
Key Concept Cards
Mean vs Median ★★★★★ : When outliers are present, the median is a better representative value. This is why median household income (reported by the US Census Bureau) is preferred over mean income. Memory tip: outliers present → median is more appropriate
Standard Deviation ★★★★★ : How far, on average, data points deviate from the mean. The larger the SD, the more spread out the data. Memory tip: SD = the magnitude of average deviation from the mean
IQR (Interquartile Range) ★★★★☆ : Q3 − Q1. The range of the middle 50% of data. Used for outlier detection and drawing box plots. Memory tip: IQR = Q3 − Q1; outlier threshold = ±1.5×IQR
Practice Questions
Q. A company’s employee salaries are [48,000; 45,750; $225,000]. Which is the better representative measure — mean or median?
The median (225,000 (CEO) is an extreme outlier that greatly inflates the mean. The median is unaffected by outliers.
Q. Compare the risk-return efficiency of Stock A (mean return 10%, SD 5%) and Stock B (mean return 20%, SD 8%) using the coefficient of variation.
CV of A = 5/10 = 50%; CV of B = 8/20 = 40%. Stock B has lower variability per unit of return, making it relatively more efficient.
OIYO Editorial
Content Editor지식 인큐베이터이자 전문 콘텐츠 크리에이터. 경영, 경제, 법률 및 실생활에 유용한 실무/자격증 중심의 깊이 있는 정보를 연구하고 공유합니다.