Visualization¶
For this exercise, you'll load a dataset and try visualizating it using different plots from Seaborn.
# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')
import pandas as pd
# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
#dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
dataset = "star_wars_character_dataset.csv"
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
name | height | mass | hair_color | skin_color | eye_color | birth_year | sex | gender | homeworld | species | films | vehicles | starships | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Luke Skywalker | 172.0 | 77.0 | blond | fair | blue | 19.0 | male | masculine | Tatooine | Human | The Empire Strikes Back, Revenge of the Sith, ... | Snowspeeder, Imperial Speeder Bike | X-wing, Imperial shuttle |
1 | C-3PO | 167.0 | 75.0 | NaN | gold | yellow | 112.0 | none | masculine | Tatooine | Droid | The Empire Strikes Back, Attack of the Clones,... | NaN | NaN |
2 | R2-D2 | 96.0 | 32.0 | NaN | white, blue | red | 33.0 | none | masculine | Naboo | Droid | The Empire Strikes Back, Attack of the Clones,... | NaN | NaN |
3 | Darth Vader | 202.0 | 136.0 | none | white | yellow | 41.9 | male | masculine | Tatooine | Human | The Empire Strikes Back, Revenge of the Sith, ... | NaN | TIE Advanced x1 |
4 | Leia Organa | 150.0 | 49.0 | brown | light | brown | 19.0 | female | feminine | Alderaan | Human | The Empire Strikes Back, Revenge of the Sith, ... | Imperial Speeder Bike | NaN |
Line Plots¶
Line Plots are used to plot the relationship of some dependent variable (on the y-axis) and some independent variable (on the x-axis) with a line.
My example from the slides was:
# Line plot of Sp. Atk over generations
plt.figure(figsize=(12, 6))
sns.lineplot(x='Generation', y='Sp. Atk', data=df)
plt.title('Spc. Atk Over Generations')
plt.xlabel('Generation')
plt.ylabel('Spc. Atk')
plt.xticks(rotation=45)
plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# This sets the size of the figure
plt.figure(figsize=(12, 6))
# Set x and y to strings of the columns you want to plot
sns.lineplot(x="birth_year", y="height", data=df[df["birth_year"] < 100])
# Use these to set the titles and lables
plt.title('Height Age')
plt.xlabel('age')
plt.ylabel('height')
# Show the plot (this is like the print() function but for plots)
plt.show()
Bar Plots¶
Bar plots are used to compare the relative amounts of different categories.
The code from my example was:
# Bar plot of the number of pokemon per Type 1
plt.figure(figsize=(12, 6))
sns.countplot(x='Type 1', data=df)
plt.title('Number of Pokemon per Type')
plt.xlabel('Type 1')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(12, 6))
sns.countplot(x="species", data=df)
plt.xticks(rotation=45)
plt.show()
Scatter Plots¶
Scatter plots are used to explore the relationship between two variables that may not have a direct line relationship.
The code from my example was:
# Scatter plot of YouTube views vs Spotify streams
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Spotify Streams', y='YouTube Views', data=df)
plt.title('YouTube Views vs Spotify Streams')
plt.xlabel('Spotify Streams')
plt.ylabel('YouTube Views')
plt.show()
plt.figure(figsize=(12, 6))
sns.scatterplot(x='mass', y='height', hue="species", data=df)
plt.show()
df["density"] = df["mass"] /df["height"]
df.sort_values(by="density", ascending=False).head()
name | height | mass | hair_color | skin_color | eye_color | birth_year | sex | gender | homeworld | species | films | vehicles | starships | density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | Jabba Desilijic Tiure | 175.0 | 1358.0 | NaN | green-tan, brown | orange | 600.0 | hermaphroditic | masculine | Nal Hutta | Hutt | The Phantom Menace, Return of the Jedi, A New ... | NaN | NaN | 7.760000 |
76 | Grievous | 216.0 | 159.0 | none | brown, white | green, yellow | NaN | male | masculine | Kalee | Kaleesh | Revenge of the Sith | Tsmeu-6 personal wheel bike | Belbullab-22 starfighter | 0.736111 |
21 | IG-88 | 200.0 | 140.0 | none | metal | red | 15.0 | none | masculine | NaN | Droid | The Empire Strikes Back | NaN | NaN | 0.700000 |
5 | Owen Lars | 178.0 | 120.0 | brown, grey | light | blue | 52.0 | male | masculine | Tatooine | Human | Attack of the Clones, Revenge of the Sith, A N... | NaN | NaN | 0.674157 |
3 | Darth Vader | 202.0 | 136.0 | none | white | yellow | 41.9 | male | masculine | Tatooine | Human | The Empire Strikes Back, Revenge of the Sith, ... | NaN | TIE Advanced x1 | 0.673267 |
Histograms¶
Used to understand the distribution of a single variable.
The cide from my example was:
# Histogram of HP
plt.figure(figsize=(12, 6))
sns.histplot(df['HP'], bins=30, kde=True)
plt.title('Distribution of HP')
plt.xlabel('HP')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(12, 6))
sns.histplot(df['height'], bins=30, kde=True)
plt.show()
Box Plots¶
Box plots are used to display the distribution of data based on a five-number summary.
The code from my example was:
# Box plot of Defense by Pokémon type
plt.figure(figsize=(12, 6))
sns.boxplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()
# Box plot of Defense by Pokémon type
# plt.figure(figsize=(12, 6))
# sns.violinplot(x='species', y='height', data=df,)
# plt.title('Height by Species')
# plt.xlabel('Species')
# plt.ylabel('Height')
# plt.xticks(rotation=45)
# plt.show()
# Box plot of Defense by Pokémon type
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.violinplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.boxplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()
Heatmaps¶
Used to visualize matrix-like data, showing correlation between variables.
The code from my example was:
# Heatmap of count of pokemon that share types
df['Type 2'].fillna(df['Type 1'], inplace=True)
type_counts = df.groupby(['Type 1', 'Type 2']).size().unstack()
plt.figure(figsize=(12, 6))
sns.heatmap(type_counts, cmap='coolwarm', annot=True)
plt.title('Count of Pokemon With Type 1 and Type 2')
plt.xlabel('Type 2')
plt.ylabel('Type 1')
plt.show()
dataset = "star_wars_character_dataset.csv"
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
plt.figure(figsize=(12, 6))
counts = df.groupby(['species', 'homeworld']).size().unstack()
sns.heatmap(counts, cmap="coolwarm", annot=True)
plt.show()
Pick Your Own¶
Seaborn has a gallery of different visualizations: https://seaborn.pydata.org/examples/index.html
Pick one, look at the code, and attempt to use the visualization for your dataset below.