Descibing a Dataset¶

In this exercise we're going to use the tequniques we just learned to explore a dataset.

In [1]:

Copied!





# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')

import pandas as pd

# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
#dataset = "star_wars_character_dataset.csv"

df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()
# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')

import pandas as pd

# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
#dataset = "star_wars_character_dataset.csv"

df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()

Mounted at /content/gdrive

Out[1]:

	All Time Rank	Track	Album Name	Artist	Release Date	ISRC	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	SiriusXM Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Soundcloud Streams	Shazam Counts	TIDAL Popularity	Explicit Track
0	1	MILLION DOLLAR BABY	Million Dollar Baby - Single	Tommy Richman	2024-04-26	QM24S2402528	725.4	3.904709e+08	30716.0	196631588.0	...	684.0	62.0	17598718.0	114.0	18004655.0	22,931	4818457.0	2669262.0	NaN	0
1	2	Not Like Us	Not Like Us	Kendrick Lamar	2024-05-04	USUG12400910	545.9	3.237039e+08	28113.0	174597137.0	...	3.0	67.0	10422430.0	111.0	7780028.0	28,444	6623075.0	1118279.0	NaN	1
2	3	i like the way you kiss me	I like the way you kiss me	Artemas	2024-03-19	QZJ842400387	538.4	6.013093e+08	54331.0	211607669.0	...	536.0	136.0	36321847.0	172.0	5022621.0	5,639	7208651.0	5285340.0	NaN	0
3	4	Flowers	Flowers - Single	Miley Cyrus	2023-01-12	USSM12209777	444.9	2.031281e+09	269802.0	136569078.0	...	2182.0	264.0	24684248.0	210.0	190260277.0	203,384	NaN	11822942.0	NaN	0
4	5	Houdini	Houdini	Eminem	2024-05-31	USUG12403398	423.3	1.070349e+08	7223.0	151469874.0	...	1.0	82.0	17660624.0	105.0	4493884.0	7,006	207179.0	457017.0	NaN	1

5 rows × 29 columns

Why Kind of Data is it?¶

Run: df.info() and df.head() to look at the column types and a sample of the data.

For each column, ask yourself if the column is:

Nominal: Categories (e.g., colors)
Ordinal: Ordered categories (e.g., ratings)
Interval: Numeric but no true zero (e.g., temperature in °C)
Ratio/Metric: Numeric w/ true zero (e.g., height, weight)

In [3]:

Copied!

df["Explicit Track"] = df["Explicit Track"].astype("bool")

print(df.info())
df.head()
df["Explicit Track"] = df["Explicit Track"].astype("bool")

print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   All Time Rank               4600 non-null   int64  
 1   Track                       4600 non-null   object 
 2   Album Name                  4600 non-null   object 
 3   Artist                      4595 non-null   object 
 4   Release Date                4600 non-null   object 
 5   ISRC                        4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   float64
 8   Spotify Playlist Count      4530 non-null   float64
 9   Spotify Playlist Reach      4528 non-null   float64
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   float64
 12  YouTube Likes               4285 non-null   float64
 13  TikTok Posts                3427 non-null   float64
 14  TikTok Likes                3620 non-null   float64
 15  TikTok Views                3619 non-null   float64
 16  YouTube Playlist Reach      3591 non-null   float64
 17  Apple Music Playlist Count  4039 non-null   float64
 18  AirPlay Spins               4102 non-null   float64
 19  SiriusXM Spins              2477 non-null   float64
 20  Deezer Playlist Count       3679 non-null   float64
 21  Deezer Playlist Reach       3672 non-null   float64
 22  Amazon Playlist Count       3545 non-null   float64
 23  Pandora Streams             3494 non-null   float64
 24  Pandora Track Stations      3332 non-null   object 
 25  Soundcloud Streams          1267 non-null   float64
 26  Shazam Counts               4023 non-null   float64
 27  TIDAL Popularity            0 non-null      float64
 28  Explicit Track              4600 non-null   bool   
dtypes: bool(1), float64(21), int64(1), object(6)
memory usage: 1010.9+ KB
None

Out[3]:

	All Time Rank	Track	Album Name	Artist	Release Date	ISRC	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	SiriusXM Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Soundcloud Streams	Shazam Counts	TIDAL Popularity	Explicit Track
0	1	MILLION DOLLAR BABY	Million Dollar Baby - Single	Tommy Richman	2024-04-26	QM24S2402528	725.4	3.904709e+08	30716.0	196631588.0	...	684.0	62.0	17598718.0	114.0	18004655.0	22,931	4818457.0	2669262.0	NaN	False
1	2	Not Like Us	Not Like Us	Kendrick Lamar	2024-05-04	USUG12400910	545.9	3.237039e+08	28113.0	174597137.0	...	3.0	67.0	10422430.0	111.0	7780028.0	28,444	6623075.0	1118279.0	NaN	True
2	3	i like the way you kiss me	I like the way you kiss me	Artemas	2024-03-19	QZJ842400387	538.4	6.013093e+08	54331.0	211607669.0	...	536.0	136.0	36321847.0	172.0	5022621.0	5,639	7208651.0	5285340.0	NaN	False
3	4	Flowers	Flowers - Single	Miley Cyrus	2023-01-12	USSM12209777	444.9	2.031281e+09	269802.0	136569078.0	...	2182.0	264.0	24684248.0	210.0	190260277.0	203,384	NaN	11822942.0	NaN	False
4	5	Houdini	Houdini	Eminem	2024-05-31	USUG12403398	423.3	1.070349e+08	7223.0	151469874.0	...	1.0	82.0	17660624.0	105.0	4493884.0	7,006	207179.0	457017.0	NaN	True

5 rows × 29 columns

Central Tendancy¶

Choose a few metric columns to dig into. For these, calculate:

Mean: df[col_name].mean()
Median: df[col_name].median()
Mode: df[col_name].mode()

Ask yourself:

How do these compare to one another?
What do their similarities or differences say about the data?

In [12]:

Copied!





s_mean = df["Spotify Streams"].mean()
s_median = df["Spotify Streams"].median()
s_mode = df["Spotify Streams"].mode()
print(s_mean, s_median, s_mode)

df["Spotify Streams"].plot(kind="hist", x="All Time Rank", y="Spotify Streams", bins=100)
s_mean = df["Spotify Streams"].mean()
s_median = df["Spotify Streams"].median()
s_mode = df["Spotify Streams"].mode()
print(s_mean, s_median, s_mode)

df["Spotify Streams"].plot(kind="hist", x="All Time Rank", y="Spotify Streams", bins=100)

447387314.7459327 239850720.0 0    1.655575e+09
Name: Spotify Streams, dtype: float64

Out[12]:

<Axes: ylabel='Frequency'>

No description has been provided for this image

Measures of Dispersion¶

For the same metric columns calculate:

Range: df[col_name].max() - df[col_name].min()
Std Deviation: df[col_name].std()
Interquartile Range (IQR): df[col_name].quantile(0.75) - df[col_name].quantile(0.25)

How do these differe between columns? What does it say about your dataset?

In [16]:

Copied!





print(df["Spotify Streams"].max())
print(df["Spotify Streams"].min())
print(df["Spotify Streams"].std())
print(df["Spotify Streams"].quantile(0.75))
print(df["Spotify Streams"].quantile(0.25))
print(df["Spotify Streams"].max())
print(df["Spotify Streams"].min())
print(df["Spotify Streams"].std())
print(df["Spotify Streams"].quantile(0.75))
print(df["Spotify Streams"].quantile(0.25))

4281468720.0
1071.0
538443905.2812704
628363805.0
70386298.5

Percentiles and Quartiles¶

For the same metric columns calculate:

5th Percentile: df[col_name].quantile(0.05)
25th Percentile: df[col_name].quantile(0.25)
50th Percentile: df[col_name].quantile(0.50)
75th Percentile: df[col_name].quantile(0.75)
95th Percentile: df[col_name].quantile(0.95)

What do these say about the column?

In [ ]:

Outliers by Sorting the Data¶

The easiest way to start looking for outliers is to just sort the data by some column and look at what's at the top.

df = df.sort_values(col_name, ascending=False)

The ascending=False argument to the sort_values method tells the sort to put the largest items first.

Outliers using Quartiles¶

Try and find some outliers in your dataset based on one of the columns.

One way to do this is using the quartiles. Generally $x$ is considered an outlier if one of:

$x < Q_1 - 1.5 * IQR$
$x > Q_3 + 1.5 * IQR$

To calculate this and filter the pandas dataframe this way in python do:

q1 = df[col_name].quantile(0.25)
q3 = df[col_name].quantile(0.75)
iqr = q3 - q1
df[df[col_name] > q3 + iqr * 1.5]

There's a good chance this will return an empty dataframe. If it does, try making the 1.5 smaller (like 1.25 or 1) or choose a different col

In [30]:

Copied!

(pd.Timestamp.now() - pd.to_datetime(df["Release Date"])).apply(lambda x: pd.Timedelta(x).days).plot(kind="hist")
(pd.Timestamp.now() - pd.to_datetime(df["Release Date"])).apply(lambda x: pd.Timedelta(x).days).plot(kind="hist")

Out[30]:

<Axes: ylabel='Frequency'>

Outliers Using Stddev¶

We can also look at the the z-score (how many standard deviations above or below the mean it is) and then order the dataset by this.

z_score = (df[col_name] - mean) / stddev
df["Z Score"] = z_score
df.sort_values("Z Score", ascending=False)

In [27]:

Copied!





streams_mean = df["Spotify Streams"].mean()
streams_std = df["Spotify Streams"].std()
z_score_srs = (df["Spotify Streams"] - streams_mean) / streams_std
df["Z Score"] = z_score_srs

df.sort_values("Z Score", ascending=False)
streams_mean = df["Spotify Streams"].mean()
streams_std = df["Spotify Streams"].std()
z_score_srs = (df["Spotify Streams"] - streams_mean) / streams_std
df["Z Score"] = z_score_srs

df.sort_values("Z Score", ascending=False)

Out[27]:

	All Time Rank	Track	Album Name	Artist	Release Date	ISRC	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Soundcloud Streams	Shazam Counts	TIDAL Popularity	Explicit Track	Z Score
55	56	Blinding Lights	Blinding Lights	The Weeknd	2019-11-29	USUG11904206	194.0	4.281469e+09	590392.0	165544011.0	...	564.0	11701129.0	184.0	489299762.0	319,634	28659471.0	28994660.0	NaN	False	7.120670
4254	4229	Blinding Lights	Blinding Lights	xSyborg	2021-03-13	QZLBC2302779	20.5	4.261328e+09	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	7.083265
53	54	Shape of You	Shape of You	Ed Sheeran	2017-01-06	GBAHS1600463	195.1	3.909459e+09	316378.0	148037882.0	...	512.0	2710931.0	115.0	733565655.0	651,557	78149375.0	1146246.0	NaN	False	6.429772
1637	1807	Shape of You	Shape of You	xSyborg	2021-02-20	QZLBC2302789	36.3	3.888356e+09	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	6.390580
132	133	Someone You Loved	Breach	Lewis Capaldi	2018-11-08	DEUM71807062	129.7	3.427499e+09	338763.0	160553919.0	...	502.0	4834088.0	110.0	427212257.0	357,540	NaN	29266297.0	NaN	False	5.534674
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4534	4506	Mast Malang Jhoom (From "Bade Miyan Chote Miya...	Mast Malang Jhoom (From "Bade Miyan Chote Miya...	Vishal Mishra	2024-03-01	INZ031414619	19.6	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	False	NaN
4541	4498	Hood By Air Freestyle	Hood By Air Freestyle - Single	Baby Flaco	2024-04-26	QZLMG2000292	19.6	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1526440.0	NaN	False	NaN
4560	4552	Supera	Sertanejo Para Danï¿½ï	Marï¿½ï¿½lia Mendo	2021-03-05	FR0Z50062905	19.5	NaN	2.0	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	NaN
4572	4536	TV Shapes	TV Shapes	Bubbatrees	2022-08-19	UKSWX1900031	19.5	NaN	3.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	False	NaN
4578	4550	Freezing	Freezing - EP	purpl	2023-02-02	FR10S2303559	19.5	NaN	8.0	565.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	NaN

4600 rows × 30 columns

Correlation and Covariance¶

Consder two columns you think might be related and calculate:

Correlation: df[col_name_1].corr(df[col_name_2])
Covariance: df[col_name_1].corr(df[col_name_2])

In [35]:

Copied!

#df["Spotify Streams"].corr(df["Spotify Playlist Count"])
df["Apple Music Playlist Count"].corr(df["Spotify Playlist Count"])
#df["Spotify Streams"].corr(df["Spotify Playlist Count"])
df["Apple Music Playlist Count"].corr(df["Spotify Playlist Count"])

Out[35]:

0.6858856622525479

Filtering the Dataset¶

Pandas allows us a couple convenient ways to slice our dataset. Many of which are detailed in the Pandas Chapter of the Data Science Handbook.

One easy one is to filter down your dataset based on some boolean condition.

For example, the following code loads the pokemon dataset and then filters it down to only Water Pokemon:

df = pd.read_csv("/content/gdrive/My Drive/datasets/pokemon.csv')
water_df = df[df["Type 1"] == "Water"]

Try a filter with your dataset and repeat the exercises from above. Ask youself what differences you see.

In [ ]:

In [41]:

Copied!

df[df["Artist"] == "Taylor Swift"]["Spotify Streams"].median()
df[df["Artist"] == "Taylor Swift"]["Spotify Streams"].median()

Out[41]:

395433400.0

In [44]:

Copied!

df[df["Artist"] == "Drake"]["Spotify Streams"].median()
df[df["Artist"] == "Drake"]["Spotify Streams"].median()

Out[44]:

353267786.5

In [51]:

Copied!

df["Streams per Playlist"] = df["Spotify Streams"] / df["Spotify Playlist Count"]
df.sort_values(by="Streams per Playlist", ascending=True).head(10)
df["Streams per Playlist"] = df["Spotify Streams"] / df["Spotify Playlist Count"]
df.sort_values(by="Streams per Playlist", ascending=True).head(10)

Out[51]:

	All Time Rank	Track	Album Name	Artist	Release Date	ISRC	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Soundcloud Streams	Shazam Counts	TIDAL Popularity	Explicit Track	Z Score	Steams per Playlist	Streams per Playlist
1935	1929	Ten by Ten	Lazy Mechanics	Carrier	2023-02-24	UKMDH2100069	33.0	2265.0	37.0	2401.0	...	NaN	NaN	NaN	NaN	51.0	NaN	False	-0.830885	61.216216	61.216216
1609	1603	I Am Not Okay	I Am Not Okay	Jelly Roll	2024-06-12	USUG12403674	36.6	4522.0	63.0	11074150.0	...	10.0	277638.0	157	1863.0	7472.0	NaN	False	-0.830881	71.777778	71.777778
3824	3793	Happy Birthday	Happy Birthday	Jovanie	2019-03-22	QM4TX1990759	21.9	37866.0	449.0	53427.0	...	NaN	15747.0	604	NaN	60542.0	NaN	False	-0.830819	84.334076	84.334076
1990	1974	Sunny Day	Sunny Day	Matteo Rossanese	2021-05-28	QZHN92130190	32.4	5936.0	57.0	785.0	...	NaN	NaN	NaN	NaN	122323.0	NaN	False	-0.830878	104.140351	104.140351
468	467	Sabka Malik Ek	Sabka Malik Ek	Dada Black Sheep	2022-08-12	FR9W12228852	71.3	2754.0	22.0	441.0	...	NaN	NaN	NaN	NaN	NaN	NaN	False	-0.830884	125.181818	125.181818
2671	2666	A ver, a ver - Ric Piccolo Edit	A ver, a ver (Ric Piccolo Edit)	Donald	2022-08-24	GBUQH2200069	27.3	67852.0	531.0	98875.0	...	NaN	NaN	NaN	NaN	NaN	NaN	False	-0.830763	127.781544	127.781544
82	83	Peekaboo	Glide - Single	Dave Angel	2023-01-27	GBLTF2300027	167.2	8082.0	63.0	32758.0	...	NaN	NaN	NaN	NaN	442.0	NaN	False	-0.830874	128.285714	128.285714
2218	2213	Lullaby - BadWolf Remix	Lullaby (BadWolf Remix)	TR-MEET	2023-03-24	DEEE72301691	30.4	18204.0	113.0	5777.0	...	NaN	NaN	NaN	NaN	964.0	NaN	False	-0.830856	161.097345	161.097345
2534	2534	Halloween	Halloween	Blood Dance	2022-10-06	CAENV2255241	28.2	7332.0	45.0	1692.0	...	NaN	NaN	NaN	NaN	152.0	NaN	False	-0.830876	162.933333	162.933333
3449	3438	ECHOES: Cala Boca Menino	Echoes	Fire! Orchestra	2023-04-14	NOFGR2331100	23.3	11960.0	69.0	4543.0	...	NaN	NaN	NaN	NaN	NaN	NaN	False	-0.830867	173.333333	173.333333

10 rows × 32 columns

In [ ]: