Monday, July 1st

Today's Lesson:

  • Pandas Recap
  • Describing Data Sets
  • Data Visualization

Warm-Up

http://gg.gg/1b9wxp

Statistics for Describing Datasets

The Development of Statistics

Early Statistics

How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

How Data Happened by Chris Wiggins and Matthew L Jones
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Early Data Collection

  • In the 18th century, as empires expanded and governance became more complex, rulers needed a way to understand their domains.
  • They began recording data on population, land, and resources.
  • "Statistics" was a term that initially meant knowledge about the state.
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Adolphe Quetelet

  • Nineteenth-century Belgian astronomer
    • Body Mass Index (BMI)
    • The "Average Man" (l’homme moyen)

"The founder of the most important science in the whole world."
– Florence Nightingale

How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
  • 1834, Quetelet travels to the Paris observatory and meets Laplace.
  • Learns techniques to resolving the multiple observations of a star's position into a single value.
  • Quetelet has the insight to take these ideas and apply them to social data such as crime and suicide rates.

Despite “the fluctuation of numbers,” there is “really a number whose value we seek to determine, whether it is the height of an individual . . . , or the right ascension of the polar star.

How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Normal Distribution

Bell Curve / Gaussian distribution

  • Symmetrical
  • mean == median == mode
    • 68% in ±σ
    • 95% in ±2σ
    • 99.7% in ±3σ
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Fancis Galton

  • Expands on Quetelet's ideas.
  • Introduces regression and correlation.
  • Aimed to rank individuals within distributions, influencing modern testing. His work led to the eugenics movement; advocating for selective breeding to improve human genetics.

“We want abler commanders, statesmen, thinkers, inventors, and artists.”

How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Regression

  • Model the relationship between dependent variables and one or more independent variables.
  • Linear regression is the most common type (line best fit).
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics

Correlation

  • Defines the strength and direction of a relationship between two variables.
  • Galton introduced the concept when studying heights of parents and their children.
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
The Development of Statistics
How Data Happened | Chris Wiggins, Matthew L Jones
How Should I Look at a Dataset

How Should I Look at a Dataset

How Should I Look at a Dataset

1. Ask yourself, What Kind of Data Do I Have?

  • Nominal: Categories (e.g., colors)
  • Ordinal: Ordered categories (e.g., ratings)
  • Interval: Numeric but no true zero (e.g., temperature in °C)
  • Ratio/Metric: Numeric w/ true zero (e.g., height, weight)
How Should I Look at a Dataset
How Should I Look at a Dataset

2. Measures of Central Tendency

  • Mean:
  • Median: Middle value when data is ordered
  • Mode: Most frequent value
How Should I Look at a Dataset

2. Measures of Central Tendency in Python

values = [7, 2, 3, 4, 5, 6, 7, 7, 9, 4]

Mean

mean = sum(values) / len(values)
6.4

Median

values = sorted(values)
n = len(values)
if n % 2 == 0:
    lo_med = values[n//2 - 1]
    hi_med = values[n//2]
    median = (lo_med + hi_med) / 2
else:
    median = values[n//2]
6.5

Mode

freqs = {}
for value in values:
    if value in freqs:
        freqs[value] += 1
    else:
        freqs[value] = 1
mode = -1
hi_freq = 0
for value, freq in freqs.items():
    if freq > hi_freq:
        mode = value
        mode_freq = hi_freq
7
How Should I Look at a Dataset

2. Measures of Central Tendency w/ Pandas

import pandas as pd
values = pd.Series([7, 2, 3, 4, 5, 6, 7, 7, 9, 4])

Mean

mean = values.mean()
6.4

Median

median = values.median()
6.5

Mode

mode = values.mode()
7
How Should I Look at a Dataset

3. Measures of Dispersion

  • Range:
  • Standard Deviation:
  • Interquartile Range:
How Should I Look at a Dataset

3. Measures of Dispersion in Python

values = [7, 2, 3, 4, 5, 6, 7, 7, 9, 4]

Range

rng = max(values) - min(values)
7

Standard Deviation

mean = sum(values) / len(values)
sum_2 = 0
for value in values:
    sum_2 += (value - mean) ** 2
variance = sum_2 / (len(values) - 1)
std_dev = variance ** (1/2)
5.822222222222222

Interquartile Range

values = sorted(values)
n = len(values)
q1 = values[n//4]
q3 = values[3*n//4]
iqr = q3 - q1
2.414866761849468
How Should I Look at a Dataset

3. Measures of Dispersion w/ Pandas

values = [7, 2, 3, 4, 5, 6, 7, 7, 9, 4]

Range

rng = values.max() - values.min()
7

Standard Deviation

std_dev = values.std()
2.414866761849468

Interquartile Range

q1 = values.quantile(0.25)
q3 = values.quantile(0.75)
iqr = q3 - q1
2.5
How Should I Look at a Dataset

4. Percentiles and Quartiles

  • Percentile:
  • Quartiles: (25th percentile), (median), (75th percentile)
How Should I Look at a Dataset

4. Percentiles and Quartiles w/ Pandas

values = pd.Series([7, 2, 3, 4, 5, 6, 7, 7, 9, 4])

p05 = values.quantile(0.05)
p25 = values.quantile(0.25)
p50 = values.quantile(0.50)  # Same as median
p75 = values.quantile(0.75)
p95 = values.quantile(0.95)
How Should I Look at a Dataset

5. Data Distribution

  • Is it normal?
    • There are a few ways to check normality but they're not foolproof. Popular methods include Q-Q plots and the Shapiro-Wilk test.
  • Skewness
    • Indicates asymmetry (> 0: right-skewed, < 0: left-skewed)
  • Kurtosis
    • Measures tail heaviness
How Should I Look at a Dataset

6. Correlation and Covariance

  • Correlation:
  • Covariance:
How Should I Look at a Dataset

6. Correlation and Covariance w/ Pandas

import pandas as pd
data = {
    'x': [1, 2, 3, 4, 5],
    'y': [2, 4, 6, 8, 10]
}
df = pd.DataFrame(data)

corr = df['x'].corr(df['y'])
cov = df['x'].cov(df['y'])
How Should I Look at a Dataset

9. Outliers

  • Is it significantly outside the inner fences?

  • We can also look at the z-score to determine if it's an outlier.

    • or
  • Outliers are only outliers until they're not.

Francis Galton introduced the concept of "regression" to describe the tendency of offspring to revert towards the average characteristics of their parents, a phenomenon he observed in the heights of parents and their children.