Parsing a CSV (The Old Fashioned Way)¶
In this project, you'll use Python's csv
module to parse a csv. Each row in the CSV is a "block" of colocated people and their housing (and average of 1425.5 people per entry).
You can read about what a CSV is here:
And you can read about this particular data set here:
- https://www.kaggle.com/datasets/camnugent/california-housing-prices
- https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
Part 1¶
For part 1, you'll parse the CSV and compute some general aggregates.
For each column, calculate:
- The minimum
- The maximum
- The average
Part 2¶
For part two, you will continue the work from Part 1 but calculate aggregates for different income brackets.
- Calculate the same values but for each of the following 5 income bracks:
- less than 50,000
- 50,000 to 100,000
- 100,000+
Important: The median_income
column in this dataset is in tens of thousands of USD! A value of 6.6085
means $66,085
!
Extra Challenge¶
The above is already some very real data analysis, but if you're up for a challenge...
- Compute the median of each column.
- Compute the quartiles of each column
- Compute the standard deviation of each column defined as: $s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$
- Encapsulate all of the above as functions
Starter Code¶
To start, I've layed out some code to help you:
- Open and read the lines of the CSV file.
- Separate out the headers.
- Read the first vaue of the first line of the CSV.
In [5]:
Copied!
import csv
with open("sample_data/california_housing_test.csv") as fp:
rows = list(csv.DictReader(fp))
# rows = [{"median_age": 26.4, "latitude": 1234, ...}, ...]
# ABOVE THIS LINE, YOU PROBABLY WANT TO LEAVE AS IS
# ------------------------------------------------------------------------------
# BELOW THIS LINE IS EXAMPLE CODE, FEEL FREE TO CHANGE
# Each entry in rows represents one row in the CSV.
num_rows = len(rows)
print(f"This CSV has {num_rows} entries!")
# Each entry is a dictonary. The dictionary keys came from the header row.
headers = list(rows[0].keys())
print(f"The headers are {headers}")
# And we can access any value in any entry by using the correct header name.
first_row_median_age = rows[0]['housing_median_age']
print(f"The housing median age of the first entry is {first_row_median_age}")
import csv
with open("sample_data/california_housing_test.csv") as fp:
rows = list(csv.DictReader(fp))
# rows = [{"median_age": 26.4, "latitude": 1234, ...}, ...]
# ABOVE THIS LINE, YOU PROBABLY WANT TO LEAVE AS IS
# ------------------------------------------------------------------------------
# BELOW THIS LINE IS EXAMPLE CODE, FEEL FREE TO CHANGE
# Each entry in rows represents one row in the CSV.
num_rows = len(rows)
print(f"This CSV has {num_rows} entries!")
# Each entry is a dictonary. The dictionary keys came from the header row.
headers = list(rows[0].keys())
print(f"The headers are {headers}")
# And we can access any value in any entry by using the correct header name.
first_row_median_age = rows[0]['housing_median_age']
print(f"The housing median age of the first entry is {first_row_median_age}")
This CSV has 3000 entries! The headers are ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'] The housing median age of the first entry is 27.000000
In [6]:
Copied!
my_values = [17, 40, 23, 9, 18, 11, 21]
min_value = 1000000000000
for value in my_values:
if value < min_value:
min_value = value
min_value
my_values = [17, 40, 23, 9, 18, 11, 21]
min_value = 1000000000000
for value in my_values:
if value < min_value:
min_value = value
min_value
Out[6]:
9
In [21]:
Copied!
def min_of_col(col_name):
min_value = 10000000
for row in rows:
value = float(row[col_name])
if value < min_value:
min_value = value
return min_value
headers = list(rows[0].keys())
for header in headers:
min_value_of_col = min_of_col(header)
print(f"{header} min is {min_value_of_col}")
def min_of_col(col_name):
min_value = 10000000
for row in rows:
value = float(row[col_name])
if value < min_value:
min_value = value
return min_value
headers = list(rows[0].keys())
for header in headers:
min_value_of_col = min_of_col(header)
print(f"{header} min is {min_value_of_col}")
longitude min is -124.18 latitude min is 32.56 housing_median_age min is 1.0 total_rooms min is 6.0 total_bedrooms min is 2.0 population min is 5.0 households min is 2.0 median_income min is 0.4999 median_house_value min is 22500.0
In [38]:
Copied!
def min_of_values(values):
min_value = 1000000
for value in values:
if value < min_value:
min_value = value
return min_value
def max_of_values(values):
max_value = -1000000
for value in values:
if value > max_value:
max_value = value
return max_value
def avg_of_values(values):
sum = 0
for value in values:
sum += value
avg = sum / len(values)
return avg
def get_col_values(target_rows, col_name):
value_list = []
for row in target_rows:
value = float(row[col_name])
value_list.append(value)
return value_list
for header in headers:
col_values = get_col_values(rows, header)
min_value = min_of_values(col_values)
max_value = max_of_values(col_values)
avg_value = avg_of_values(col_values)
print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
def min_of_values(values):
min_value = 1000000
for value in values:
if value < min_value:
min_value = value
return min_value
def max_of_values(values):
max_value = -1000000
for value in values:
if value > max_value:
max_value = value
return max_value
def avg_of_values(values):
sum = 0
for value in values:
sum += value
avg = sum / len(values)
return avg
def get_col_values(target_rows, col_name):
value_list = []
for row in target_rows:
value = float(row[col_name])
value_list.append(value)
return value_list
for header in headers:
col_values = get_col_values(rows, header)
min_value = min_of_values(col_values)
max_value = max_of_values(col_values)
avg_value = avg_of_values(col_values)
print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
For longitude, min is -124.18, max is -114.49, avg is -119.58920000000029 For latitude, min is 32.56, max is 41.92, avg is 35.63538999999999 For housing_median_age, min is 1.0, max is 52.0, avg is 28.845333333333333 For total_rooms, min is 6.0, max is 30450.0, avg is 2599.578666666667 For total_bedrooms, min is 2.0, max is 5419.0, avg is 529.9506666666666 For population, min is 5.0, max is 11935.0, avg is 1402.7986666666666 For households, min is 2.0, max is 4930.0, avg is 489.912 For median_income, min is 0.4999, max is 15.0001, avg is 3.807271799999998 For median_house_value, min is 22500.0, max is 500001.0, avg is 205846.275
In [40]:
Copied!
# Just for houses w/ median_income between 100,000 and 200,000
rows_between_100_and_200 = []
for row in rows:
row_median_income = float(row["median_income"])
if row_median_income >= 10 and row_median_income < 20:
rows_between_100_and_200.append(row)
for header in headers:
col_values = get_col_values(rows_between_100_and_200, header)
min_value = min_of_values(col_values)
max_value = max_of_values(col_values)
avg_value = avg_of_values(col_values)
print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
# Just for houses w/ median_income between 100,000 and 200,000
rows_between_100_and_200 = []
for row in rows:
row_median_income = float(row["median_income"])
if row_median_income >= 10 and row_median_income < 20:
rows_between_100_and_200.append(row)
for header in headers:
col_values = get_col_values(rows_between_100_and_200, header)
min_value = min_of_values(col_values)
max_value = max_of_values(col_values)
avg_value = avg_of_values(col_values)
print(f"For {header}, min is {min_value}, max is {max_value}, avg is {avg_value}")
For longitude, min is -122.47, max is -117.18, avg is -119.85999999999997 For latitude, min is 32.83, max is 37.9, avg is 35.446250000000006 For housing_median_age, min is 2.0, max is 52.0, avg is 29.075 For total_rooms, min is 126.0, max is 10111.0, avg is 3095.05 For total_bedrooms, min is 20.0, max is 1295.0, avg is 395.95 For population, min is 25.0, max is 3599.0, avg is 1077.5 For households, min is 9.0, max is 1257.0, avg is 380.45 For median_income, min is 10.0549, max is 15.0001, avg is 11.942279999999995 For median_house_value, min is 225000.0, max is 500001.0, avg is 485230.875
In [ ]:
Copied!