Visualization¶

For this exercise, you'll load a dataset and try visualizating it using different plots from Seaborn.

In [11]:

Copied!





# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')

import pandas as pd

# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
#dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
dataset = "star_wars_character_dataset.csv"

df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()
# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')

import pandas as pd

# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
#dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
dataset = "star_wars_character_dataset.csv"

df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

Out[11]:

	name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species	films	vehicles	starships
0	Luke Skywalker	172.0	77.0	blond	fair	blue	19.0	male	masculine	Tatooine	Human	The Empire Strikes Back, Revenge of the Sith, ...	Snowspeeder, Imperial Speeder Bike	X-wing, Imperial shuttle
1	C-3PO	167.0	75.0	NaN	gold	yellow	112.0	none	masculine	Tatooine	Droid	The Empire Strikes Back, Attack of the Clones,...	NaN	NaN
2	R2-D2	96.0	32.0	NaN	white, blue	red	33.0	none	masculine	Naboo	Droid	The Empire Strikes Back, Attack of the Clones,...	NaN	NaN
3	Darth Vader	202.0	136.0	none	white	yellow	41.9	male	masculine	Tatooine	Human	The Empire Strikes Back, Revenge of the Sith, ...	NaN	TIE Advanced x1
4	Leia Organa	150.0	49.0	brown	light	brown	19.0	female	feminine	Alderaan	Human	The Empire Strikes Back, Revenge of the Sith, ...	Imperial Speeder Bike	NaN

Line Plots¶

Line Plots are used to plot the relationship of some dependent variable (on the y-axis) and some independent variable (on the x-axis) with a line.

My example from the slides was:

# Line plot of Sp. Atk over generations
plt.figure(figsize=(12, 6))
sns.lineplot(x='Generation', y='Sp. Atk', data=df)
plt.title('Spc. Atk Over Generations')
plt.xlabel('Generation')
plt.ylabel('Spc. Atk')
plt.xticks(rotation=45)
plt.show()

In [13]:

Copied!





import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# This sets the size of the figure
plt.figure(figsize=(12, 6))

# Set x and y to strings of the columns you want to plot
sns.lineplot(x="birth_year", y="height", data=df[df["birth_year"] < 100])

# Use these to set the titles and lables
plt.title('Height Age')
plt.xlabel('age')
plt.ylabel('height')

# Show the plot (this is like the print() function but for plots)
plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# This sets the size of the figure
plt.figure(figsize=(12, 6))

# Set x and y to strings of the columns you want to plot
sns.lineplot(x="birth_year", y="height", data=df[df["birth_year"] < 100])

# Use these to set the titles and lables
plt.title('Height Age')
plt.xlabel('age')
plt.ylabel('height')

# Show the plot (this is like the print() function but for plots)
plt.show()

No description has been provided for this image

Bar Plots¶

Bar plots are used to compare the relative amounts of different categories.

The code from my example was:

# Bar plot of the number of pokemon per Type 1
plt.figure(figsize=(12, 6))
sns.countplot(x='Type 1', data=df)
plt.title('Number of Pokemon per Type')
plt.xlabel('Type 1')
plt.ylabel('Count')
plt.show()

In [17]:

Copied!





plt.figure(figsize=(12, 6))
sns.countplot(x="species", data=df)
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(12, 6))
sns.countplot(x="species", data=df)
plt.xticks(rotation=45)
plt.show()

Scatter Plots¶

Scatter plots are used to explore the relationship between two variables that may not have a direct line relationship.

The code from my example was:

# Scatter plot of YouTube views vs Spotify streams
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Spotify Streams', y='YouTube Views', data=df)
plt.title('YouTube Views vs Spotify Streams')
plt.xlabel('Spotify Streams')
plt.ylabel('YouTube Views')
plt.show()

In [32]:

Copied!





plt.figure(figsize=(12, 6))
sns.scatterplot(x='mass', y='height', hue="species", data=df)
plt.show()

df["density"] = df["mass"] /df["height"]
df.sort_values(by="density", ascending=False).head()
plt.figure(figsize=(12, 6))
sns.scatterplot(x='mass', y='height', hue="species", data=df)
plt.show()

df["density"] = df["mass"] /df["height"]
df.sort_values(by="density", ascending=False).head()

Out[32]:

	name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species	films	vehicles	starships	density
15	Jabba Desilijic Tiure	175.0	1358.0	NaN	green-tan, brown	orange	600.0	hermaphroditic	masculine	Nal Hutta	Hutt	The Phantom Menace, Return of the Jedi, A New ...	NaN	NaN	7.760000
76	Grievous	216.0	159.0	none	brown, white	green, yellow	NaN	male	masculine	Kalee	Kaleesh	Revenge of the Sith	Tsmeu-6 personal wheel bike	Belbullab-22 starfighter	0.736111
21	IG-88	200.0	140.0	none	metal	red	15.0	none	masculine	NaN	Droid	The Empire Strikes Back	NaN	NaN	0.700000
5	Owen Lars	178.0	120.0	brown, grey	light	blue	52.0	male	masculine	Tatooine	Human	Attack of the Clones, Revenge of the Sith, A N...	NaN	NaN	0.674157
3	Darth Vader	202.0	136.0	none	white	yellow	41.9	male	masculine	Tatooine	Human	The Empire Strikes Back, Revenge of the Sith, ...	NaN	TIE Advanced x1	0.673267

Histograms¶

Used to understand the distribution of a single variable.

The cide from my example was:

# Histogram of HP
plt.figure(figsize=(12, 6))
sns.histplot(df['HP'], bins=30, kde=True)
plt.title('Distribution of HP')
plt.xlabel('HP')
plt.ylabel('Frequency')
plt.show()

In [25]:

Copied!

plt.figure(figsize=(12, 6))
sns.histplot(df['height'], bins=30, kde=True)
plt.show()
plt.figure(figsize=(12, 6))
sns.histplot(df['height'], bins=30, kde=True)
plt.show()

Box Plots¶

Box plots are used to display the distribution of data based on a five-number summary.

The code from my example was:

# Box plot of Defense by Pokémon type
plt.figure(figsize=(12, 6))
sns.boxplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()

In [29]:

Copied!





# Box plot of Defense by Pokémon type
# plt.figure(figsize=(12, 6))
# sns.violinplot(x='species', y='height', data=df,)
# plt.title('Height by Species')
# plt.xlabel('Species')
# plt.ylabel('Height')
# plt.xticks(rotation=45)
# plt.show()

# Box plot of Defense by Pokémon type
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.violinplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()


df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.boxplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()
# Box plot of Defense by Pokémon type
# plt.figure(figsize=(12, 6))
# sns.violinplot(x='species', y='height', data=df,)
# plt.title('Height by Species')
# plt.xlabel('Species')
# plt.ylabel('Height')
# plt.xticks(rotation=45)
# plt.show()

# Box plot of Defense by Pokémon type
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.violinplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()


df = pd.read_csv(f'/content/gdrive/My Drive/datasets/pokemon.csv')
plt.figure(figsize=(12, 6))
sns.boxplot(x='Type 1', y='Defense', data=df)
plt.title('Defense by Pokémon Type')
plt.xlabel('Type 1')
plt.ylabel('Defense')
plt.xticks(rotation=45)
plt.show()

Heatmaps¶

Used to visualize matrix-like data, showing correlation between variables.

The code from my example was:

# Heatmap of count of pokemon that share types
df['Type 2'].fillna(df['Type 1'], inplace=True)
type_counts = df.groupby(['Type 1', 'Type 2']).size().unstack()
plt.figure(figsize=(12, 6))
sns.heatmap(type_counts, cmap='coolwarm', annot=True)
plt.title('Count of Pokemon With Type 1 and Type 2')
plt.xlabel('Type 2')
plt.ylabel('Type 1')
plt.show()

In [31]:

Copied!





dataset = "star_wars_character_dataset.csv"
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')

plt.figure(figsize=(12, 6))
counts = df.groupby(['species', 'homeworld']).size().unstack()
sns.heatmap(counts, cmap="coolwarm", annot=True)
plt.show()
dataset = "star_wars_character_dataset.csv"
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')

plt.figure(figsize=(12, 6))
counts = df.groupby(['species', 'homeworld']).size().unstack()
sns.heatmap(counts, cmap="coolwarm", annot=True)
plt.show()

Pick Your Own¶

Seaborn has a gallery of different visualizations: https://seaborn.pydata.org/examples/index.html

Pick one, look at the code, and attempt to use the visualization for your dataset below.

In [ ]: