Parsing a CSV (The Old Fashioned Way)¶

In this project, you'll use Python's csv module to parse a csv. Each row in the CSV is a "block" of colocated people and their housing (and average of 1425.5 people per entry).

You can read about what a CSV is here:

https://en.wikipedia.org/wiki/Comma-separated_value

And you can read about this particular data set here:

Part 1¶

For part 1, you'll parse the CSV and compute some general aggregates.

For each column, calculate:

The minimum
The maximum
The average

Part 2¶

For part two, you will continue the work from Part 1 but calculate aggregates for different income brackets.

Calculate the same values but for each of the following 5 income bracks:
- less than 50,000
- 50,000 to 100,000
- 100,000+

Important: The median_income column in this dataset is in tens of thousands of USD! A value of 6.6085 means $66,085!

Extra Challenge¶

The above is already some very real data analysis, but if you're up for a challenge...

Compute the median of each column.
Compute the quartiles of each column
Compute the standard deviation of each column defined as: $s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$
Encapsulate all of the above as functions

Starter Code¶

To start, I've layed out some code to help you:

Open and read the lines of the CSV file.
Separate out the headers.
Read the first vaue of the first line of the CSV.

In [5]:

Copied!





import csv

with open("sample_data/california_housing_test.csv") as fp:
    rows = list(csv.DictReader(fp))

# rows = [{"median_age": 26.4, "latitude": 1234, ...}, ...]


# ABOVE THIS LINE, YOU PROBABLY WANT TO LEAVE AS IS
# ------------------------------------------------------------------------------
# BELOW THIS LINE IS EXAMPLE CODE, FEEL FREE TO CHANGE

# Each entry in rows represents one row in the CSV.
num_rows = len(rows)
print(f"This CSV has {num_rows} entries!")

# Each entry is a dictonary. The dictionary keys came from the header row.
headers = list(rows[0].keys())
print(f"The headers are {headers}")

# And we can access any value in any entry by using the correct header name.
first_row_median_age = rows[0]['housing_median_age']
print(f"The housing median age of the first entry is {first_row_median_age}")
import csv

with open("sample_data/california_housing_test.csv") as fp:
    rows = list(csv.DictReader(fp))

# rows = [{"median_age": 26.4, "latitude": 1234, ...}, ...]


# ABOVE THIS LINE, YOU PROBABLY WANT TO LEAVE AS IS
# ------------------------------------------------------------------------------
# BELOW THIS LINE IS EXAMPLE CODE, FEEL FREE TO CHANGE

# Each entry in rows represents one row in the CSV.
num_rows = len(rows)
print(f"This CSV has {num_rows} entries!")

# Each entry is a dictonary. The dictionary keys came from the header row.
headers = list(rows[0].keys())
print(f"The headers are {headers}")

# And we can access any value in any entry by using the correct header name.
first_row_median_age = rows[0]['housing_median_age']
print(f"The housing median age of the first entry is {first_row_median_age}")

This CSV has 3000 entries!
The headers are ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']
The housing median age of the first entry is 27.000000

In [6]:

Copied!





my_values = [17, 40, 23, 9, 18, 11, 21]

min_value = 1000000000000
for value in my_values:
    if value < min_value:
        min_value = value

min_value
my_values = [17, 40, 23, 9, 18, 11, 21]

min_value = 1000000000000
for value in my_values:
    if value < min_value:
        min_value = value

min_value

Out[6]:

In [21]:

Copied!





def min_of_col(col_name):
    min_value = 10000000
    for row in rows:
        value = float(row[col_name])
        if value < min_value:
            min_value = value
    return min_value

headers = list(rows[0].keys())
for header in headers:
    min_value_of_col = min_of_col(header)
    print(f"{header} min is {min_value_of_col}")
def min_of_col(col_name):
    min_value = 10000000
    for row in rows:
        value = float(row[col_name])
        if value < min_value:
            min_value = value
    return min_value

headers = list(rows[0].keys())
for header in headers:
    min_value_of_col = min_of_col(header)
    print(f"{header} min is {min_value_of_col}")

longitude min is -124.18
latitude min is 32.56
housing_median_age min is 1.0
total_rooms min is 6.0
total_bedrooms min is 2.0
population min is 5.0
households min is 2.0
median_income min is 0.4999
median_house_value min is 22500.0

In [38]:

Copied!





def min_of_values(values):
    min_value = 1000000
    for value in values:
        if value < min_value:
            min_value = value
    return min_value

def max_of_values(values):
    max_value = -1000000
    for value in values:
        if value > max_value:
            max_value = value
    return max_value

def avg_of_values(values):
    sum = 0
    for value in values:
        sum += value
    avg = sum / len(values)
    return avg

def get_col_values(target_rows, col_name):
    value_list = []
    for row in target_rows:
        value = float(row[col_name])
        value_list.append(value)
    return value_list

for header in headers:
    col_values = get_col_values(rows, header)
    min_value = min_of_values(col_values)
    max_value = max_of_values(col_values)
    avg_value = avg_of_values(col_values)
    print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
def min_of_values(values):
    min_value = 1000000
    for value in values:
        if value < min_value:
            min_value = value
    return min_value

def max_of_values(values):
    max_value = -1000000
    for value in values:
        if value > max_value:
            max_value = value
    return max_value

def avg_of_values(values):
    sum = 0
    for value in values:
        sum += value
    avg = sum / len(values)
    return avg

def get_col_values(target_rows, col_name):
    value_list = []
    for row in target_rows:
        value = float(row[col_name])
        value_list.append(value)
    return value_list

for header in headers:
    col_values = get_col_values(rows, header)
    min_value = min_of_values(col_values)
    max_value = max_of_values(col_values)
    avg_value = avg_of_values(col_values)
    print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")

For longitude, min is -124.18, max is -114.49, avg is -119.58920000000029
For latitude, min is 32.56, max is 41.92, avg is 35.63538999999999
For housing_median_age, min is 1.0, max is 52.0, avg is 28.845333333333333
For total_rooms, min is 6.0, max is 30450.0, avg is 2599.578666666667
For total_bedrooms, min is 2.0, max is 5419.0, avg is 529.9506666666666
For population, min is 5.0, max is 11935.0, avg is 1402.7986666666666
For households, min is 2.0, max is 4930.0, avg is 489.912
For median_income, min is 0.4999, max is 15.0001, avg is 3.807271799999998
For median_house_value, min is 22500.0, max is 500001.0, avg is 205846.275

In [40]:

Copied!





# Just for houses w/ median_income between 100,000 and 200,000

rows_between_100_and_200 = []
for row in rows:
    row_median_income = float(row["median_income"])
    if row_median_income >= 10 and row_median_income < 20:
        rows_between_100_and_200.append(row)

for header in headers:
    col_values = get_col_values(rows_between_100_and_200, header)
    min_value = min_of_values(col_values)
    max_value = max_of_values(col_values)
    avg_value = avg_of_values(col_values)
    print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
# Just for houses w/ median_income between 100,000 and 200,000

rows_between_100_and_200 = []
for row in rows:
    row_median_income = float(row["median_income"])
    if row_median_income >= 10 and row_median_income < 20:
        rows_between_100_and_200.append(row)

for header in headers:
    col_values = get_col_values(rows_between_100_and_200, header)
    min_value = min_of_values(col_values)
    max_value = max_of_values(col_values)
    avg_value = avg_of_values(col_values)
    print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")

For longitude, min is -122.47, max is -117.18, avg is -119.85999999999997
For latitude, min is 32.83, max is 37.9, avg is 35.446250000000006
For housing_median_age, min is 2.0, max is 52.0, avg is 29.075
For total_rooms, min is 126.0, max is 10111.0, avg is 3095.05
For total_bedrooms, min is 20.0, max is 1295.0, avg is 395.95
For population, min is 25.0, max is 3599.0, avg is 1077.5
For households, min is 9.0, max is 1257.0, avg is 380.45
For median_income, min is 10.0549, max is 15.0001, avg is 11.942279999999995
For median_house_value, min is 225000.0, max is 500001.0, avg is 485230.875

In [ ]: