Descibing a Dataset¶
In this exercise we're going to use the tequniques we just learned to explore a dataset.
# First, let's load the dataset into a dataframe, df
from google.colab import drive
drive.mount('/content/gdrive')
import pandas as pd
# Choose one of the following datasets to load into a dataframe
#dataset = "pokemon.csv"
dataset = "fixed_most_streamed_spotify_songs_2024.csv"
#dataset = "nba_stats_2023_2024.csv"
#dataset = "star_wars_character_dataset.csv"
df = pd.read_csv(f'/content/gdrive/My Drive/datasets/{dataset}')
df.head()
Mounted at /content/gdrive
All Time Rank | Track | Album Name | Artist | Release Date | ISRC | Track Score | Spotify Streams | Spotify Playlist Count | Spotify Playlist Reach | ... | SiriusXM Spins | Deezer Playlist Count | Deezer Playlist Reach | Amazon Playlist Count | Pandora Streams | Pandora Track Stations | Soundcloud Streams | Shazam Counts | TIDAL Popularity | Explicit Track | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | MILLION DOLLAR BABY | Million Dollar Baby - Single | Tommy Richman | 2024-04-26 | QM24S2402528 | 725.4 | 3.904709e+08 | 30716.0 | 196631588.0 | ... | 684.0 | 62.0 | 17598718.0 | 114.0 | 18004655.0 | 22,931 | 4818457.0 | 2669262.0 | NaN | 0 |
1 | 2 | Not Like Us | Not Like Us | Kendrick Lamar | 2024-05-04 | USUG12400910 | 545.9 | 3.237039e+08 | 28113.0 | 174597137.0 | ... | 3.0 | 67.0 | 10422430.0 | 111.0 | 7780028.0 | 28,444 | 6623075.0 | 1118279.0 | NaN | 1 |
2 | 3 | i like the way you kiss me | I like the way you kiss me | Artemas | 2024-03-19 | QZJ842400387 | 538.4 | 6.013093e+08 | 54331.0 | 211607669.0 | ... | 536.0 | 136.0 | 36321847.0 | 172.0 | 5022621.0 | 5,639 | 7208651.0 | 5285340.0 | NaN | 0 |
3 | 4 | Flowers | Flowers - Single | Miley Cyrus | 2023-01-12 | USSM12209777 | 444.9 | 2.031281e+09 | 269802.0 | 136569078.0 | ... | 2182.0 | 264.0 | 24684248.0 | 210.0 | 190260277.0 | 203,384 | NaN | 11822942.0 | NaN | 0 |
4 | 5 | Houdini | Houdini | Eminem | 2024-05-31 | USUG12403398 | 423.3 | 1.070349e+08 | 7223.0 | 151469874.0 | ... | 1.0 | 82.0 | 17660624.0 | 105.0 | 4493884.0 | 7,006 | 207179.0 | 457017.0 | NaN | 1 |
5 rows × 29 columns
Why Kind of Data is it?¶
Run: df.info()
and df.head()
to look at the column types and a sample of the data.
For each column, ask yourself if the column is:
- Nominal: Categories (e.g., colors)
- Ordinal: Ordered categories (e.g., ratings)
- Interval: Numeric but no true zero (e.g., temperature in °C)
- Ratio/Metric: Numeric w/ true zero (e.g., height, weight)
df["Explicit Track"] = df["Explicit Track"].astype("bool")
print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4600 entries, 0 to 4599 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 All Time Rank 4600 non-null int64 1 Track 4600 non-null object 2 Album Name 4600 non-null object 3 Artist 4595 non-null object 4 Release Date 4600 non-null object 5 ISRC 4600 non-null object 6 Track Score 4600 non-null float64 7 Spotify Streams 4487 non-null float64 8 Spotify Playlist Count 4530 non-null float64 9 Spotify Playlist Reach 4528 non-null float64 10 Spotify Popularity 3796 non-null float64 11 YouTube Views 4292 non-null float64 12 YouTube Likes 4285 non-null float64 13 TikTok Posts 3427 non-null float64 14 TikTok Likes 3620 non-null float64 15 TikTok Views 3619 non-null float64 16 YouTube Playlist Reach 3591 non-null float64 17 Apple Music Playlist Count 4039 non-null float64 18 AirPlay Spins 4102 non-null float64 19 SiriusXM Spins 2477 non-null float64 20 Deezer Playlist Count 3679 non-null float64 21 Deezer Playlist Reach 3672 non-null float64 22 Amazon Playlist Count 3545 non-null float64 23 Pandora Streams 3494 non-null float64 24 Pandora Track Stations 3332 non-null object 25 Soundcloud Streams 1267 non-null float64 26 Shazam Counts 4023 non-null float64 27 TIDAL Popularity 0 non-null float64 28 Explicit Track 4600 non-null bool dtypes: bool(1), float64(21), int64(1), object(6) memory usage: 1010.9+ KB None
All Time Rank | Track | Album Name | Artist | Release Date | ISRC | Track Score | Spotify Streams | Spotify Playlist Count | Spotify Playlist Reach | ... | SiriusXM Spins | Deezer Playlist Count | Deezer Playlist Reach | Amazon Playlist Count | Pandora Streams | Pandora Track Stations | Soundcloud Streams | Shazam Counts | TIDAL Popularity | Explicit Track | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | MILLION DOLLAR BABY | Million Dollar Baby - Single | Tommy Richman | 2024-04-26 | QM24S2402528 | 725.4 | 3.904709e+08 | 30716.0 | 196631588.0 | ... | 684.0 | 62.0 | 17598718.0 | 114.0 | 18004655.0 | 22,931 | 4818457.0 | 2669262.0 | NaN | False |
1 | 2 | Not Like Us | Not Like Us | Kendrick Lamar | 2024-05-04 | USUG12400910 | 545.9 | 3.237039e+08 | 28113.0 | 174597137.0 | ... | 3.0 | 67.0 | 10422430.0 | 111.0 | 7780028.0 | 28,444 | 6623075.0 | 1118279.0 | NaN | True |
2 | 3 | i like the way you kiss me | I like the way you kiss me | Artemas | 2024-03-19 | QZJ842400387 | 538.4 | 6.013093e+08 | 54331.0 | 211607669.0 | ... | 536.0 | 136.0 | 36321847.0 | 172.0 | 5022621.0 | 5,639 | 7208651.0 | 5285340.0 | NaN | False |
3 | 4 | Flowers | Flowers - Single | Miley Cyrus | 2023-01-12 | USSM12209777 | 444.9 | 2.031281e+09 | 269802.0 | 136569078.0 | ... | 2182.0 | 264.0 | 24684248.0 | 210.0 | 190260277.0 | 203,384 | NaN | 11822942.0 | NaN | False |
4 | 5 | Houdini | Houdini | Eminem | 2024-05-31 | USUG12403398 | 423.3 | 1.070349e+08 | 7223.0 | 151469874.0 | ... | 1.0 | 82.0 | 17660624.0 | 105.0 | 4493884.0 | 7,006 | 207179.0 | 457017.0 | NaN | True |
5 rows × 29 columns
Central Tendancy¶
Choose a few metric columns to dig into. For these, calculate:
- Mean:
df[col_name].mean()
- Median:
df[col_name].median()
- Mode:
df[col_name].mode()
Ask yourself:
- How do these compare to one another?
- What do their similarities or differences say about the data?
s_mean = df["Spotify Streams"].mean()
s_median = df["Spotify Streams"].median()
s_mode = df["Spotify Streams"].mode()
print(s_mean, s_median, s_mode)
df["Spotify Streams"].plot(kind="hist", x="All Time Rank", y="Spotify Streams", bins=100)
447387314.7459327 239850720.0 0 1.655575e+09 Name: Spotify Streams, dtype: float64
<Axes: ylabel='Frequency'>
Measures of Dispersion¶
For the same metric columns calculate:
- Range:
df[col_name].max() - df[col_name].min()
- Std Deviation:
df[col_name].std()
- Interquartile Range (IQR):
df[col_name].quantile(0.75) - df[col_name].quantile(0.25)
How do these differe between columns? What does it say about your dataset?
print(df["Spotify Streams"].max())
print(df["Spotify Streams"].min())
print(df["Spotify Streams"].std())
print(df["Spotify Streams"].quantile(0.75))
print(df["Spotify Streams"].quantile(0.25))
4281468720.0 1071.0 538443905.2812704 628363805.0 70386298.5
Percentiles and Quartiles¶
For the same metric columns calculate:
- 5th Percentile:
df[col_name].quantile(0.05)
- 25th Percentile:
df[col_name].quantile(0.25)
- 50th Percentile:
df[col_name].quantile(0.50)
- 75th Percentile:
df[col_name].quantile(0.75)
- 95th Percentile:
df[col_name].quantile(0.95)
What do these say about the column?
Outliers by Sorting the Data¶
The easiest way to start looking for outliers is to just sort the data by some column and look at what's at the top.
df = df.sort_values(col_name, ascending=False)
The ascending=False
argument to the sort_values
method tells the sort to put the largest items first.
Outliers using Quartiles¶
Try and find some outliers in your dataset based on one of the columns.
One way to do this is using the quartiles. Generally $x$ is considered an outlier if one of:
- $x < Q_1 - 1.5 * IQR$
- $x > Q_3 + 1.5 * IQR$
To calculate this and filter the pandas dataframe this way in python do:
q1 = df[col_name].quantile(0.25)
q3 = df[col_name].quantile(0.75)
iqr = q3 - q1
df[df[col_name] > q3 + iqr * 1.5]
There's a good chance this will return an empty dataframe. If it does, try making the 1.5
smaller (like 1.25
or 1
) or choose a different col
(pd.Timestamp.now() - pd.to_datetime(df["Release Date"])).apply(lambda x: pd.Timedelta(x).days).plot(kind="hist")
<Axes: ylabel='Frequency'>
Outliers Using Stddev¶
We can also look at the the z-score (how many standard deviations above or below the mean it is) and then order the dataset by this.
z_score = (df[col_name] - mean) / stddev
df["Z Score"] = z_score
df.sort_values("Z Score", ascending=False)
streams_mean = df["Spotify Streams"].mean()
streams_std = df["Spotify Streams"].std()
z_score_srs = (df["Spotify Streams"] - streams_mean) / streams_std
df["Z Score"] = z_score_srs
df.sort_values("Z Score", ascending=False)
All Time Rank | Track | Album Name | Artist | Release Date | ISRC | Track Score | Spotify Streams | Spotify Playlist Count | Spotify Playlist Reach | ... | Deezer Playlist Count | Deezer Playlist Reach | Amazon Playlist Count | Pandora Streams | Pandora Track Stations | Soundcloud Streams | Shazam Counts | TIDAL Popularity | Explicit Track | Z Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | 56 | Blinding Lights | Blinding Lights | The Weeknd | 2019-11-29 | USUG11904206 | 194.0 | 4.281469e+09 | 590392.0 | 165544011.0 | ... | 564.0 | 11701129.0 | 184.0 | 489299762.0 | 319,634 | 28659471.0 | 28994660.0 | NaN | False | 7.120670 |
4254 | 4229 | Blinding Lights | Blinding Lights | xSyborg | 2021-03-13 | QZLBC2302779 | 20.5 | 4.261328e+09 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | 7.083265 |
53 | 54 | Shape of You | Shape of You | Ed Sheeran | 2017-01-06 | GBAHS1600463 | 195.1 | 3.909459e+09 | 316378.0 | 148037882.0 | ... | 512.0 | 2710931.0 | 115.0 | 733565655.0 | 651,557 | 78149375.0 | 1146246.0 | NaN | False | 6.429772 |
1637 | 1807 | Shape of You | Shape of You | xSyborg | 2021-02-20 | QZLBC2302789 | 36.3 | 3.888356e+09 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | 6.390580 |
132 | 133 | Someone You Loved | Breach | Lewis Capaldi | 2018-11-08 | DEUM71807062 | 129.7 | 3.427499e+09 | 338763.0 | 160553919.0 | ... | 502.0 | 4834088.0 | 110.0 | 427212257.0 | 357,540 | NaN | 29266297.0 | NaN | False | 5.534674 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4534 | 4506 | Mast Malang Jhoom (From "Bade Miyan Chote Miya... | Mast Malang Jhoom (From "Bade Miyan Chote Miya... | Vishal Mishra | 2024-03-01 | INZ031414619 | 19.6 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | False | NaN |
4541 | 4498 | Hood By Air Freestyle | Hood By Air Freestyle - Single | Baby Flaco | 2024-04-26 | QZLMG2000292 | 19.6 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1526440.0 | NaN | False | NaN |
4560 | 4552 | Supera | Sertanejo Para Dan�ï | Mar��lia Mendo | 2021-03-05 | FR0Z50062905 | 19.5 | NaN | 2.0 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | NaN |
4572 | 4536 | TV Shapes | TV Shapes | Bubbatrees | 2022-08-19 | UKSWX1900031 | 19.5 | NaN | 3.0 | 9.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | False | NaN |
4578 | 4550 | Freezing | Freezing - EP | purpl | 2023-02-02 | FR10S2303559 | 19.5 | NaN | 8.0 | 565.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | NaN |
4600 rows × 30 columns
Correlation and Covariance¶
Consder two columns you think might be related and calculate:
- Correlation:
df[col_name_1].corr(df[col_name_2])
- Covariance:
df[col_name_1].corr(df[col_name_2])
#df["Spotify Streams"].corr(df["Spotify Playlist Count"])
df["Apple Music Playlist Count"].corr(df["Spotify Playlist Count"])
0.6858856622525479
Filtering the Dataset¶
Pandas allows us a couple convenient ways to slice our dataset. Many of which are detailed in the Pandas Chapter of the Data Science Handbook.
One easy one is to filter down your dataset based on some boolean condition.
For example, the following code loads the pokemon dataset and then filters it down to only Water Pokemon:
df = pd.read_csv("/content/gdrive/My Drive/datasets/pokemon.csv')
water_df = df[df["Type 1"] == "Water"]
Try a filter with your dataset and repeat the exercises from above. Ask youself what differences you see.
df[df["Artist"] == "Taylor Swift"]["Spotify Streams"].median()
395433400.0
df[df["Artist"] == "Drake"]["Spotify Streams"].median()
353267786.5
df["Streams per Playlist"] = df["Spotify Streams"] / df["Spotify Playlist Count"]
df.sort_values(by="Streams per Playlist", ascending=True).head(10)
All Time Rank | Track | Album Name | Artist | Release Date | ISRC | Track Score | Spotify Streams | Spotify Playlist Count | Spotify Playlist Reach | ... | Amazon Playlist Count | Pandora Streams | Pandora Track Stations | Soundcloud Streams | Shazam Counts | TIDAL Popularity | Explicit Track | Z Score | Steams per Playlist | Streams per Playlist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1935 | 1929 | Ten by Ten | Lazy Mechanics | Carrier | 2023-02-24 | UKMDH2100069 | 33.0 | 2265.0 | 37.0 | 2401.0 | ... | NaN | NaN | NaN | NaN | 51.0 | NaN | False | -0.830885 | 61.216216 | 61.216216 |
1609 | 1603 | I Am Not Okay | I Am Not Okay | Jelly Roll | 2024-06-12 | USUG12403674 | 36.6 | 4522.0 | 63.0 | 11074150.0 | ... | 10.0 | 277638.0 | 157 | 1863.0 | 7472.0 | NaN | False | -0.830881 | 71.777778 | 71.777778 |
3824 | 3793 | Happy Birthday | Happy Birthday | Jovanie | 2019-03-22 | QM4TX1990759 | 21.9 | 37866.0 | 449.0 | 53427.0 | ... | NaN | 15747.0 | 604 | NaN | 60542.0 | NaN | False | -0.830819 | 84.334076 | 84.334076 |
1990 | 1974 | Sunny Day | Sunny Day | Matteo Rossanese | 2021-05-28 | QZHN92130190 | 32.4 | 5936.0 | 57.0 | 785.0 | ... | NaN | NaN | NaN | NaN | 122323.0 | NaN | False | -0.830878 | 104.140351 | 104.140351 |
468 | 467 | Sabka Malik Ek | Sabka Malik Ek | Dada Black Sheep | 2022-08-12 | FR9W12228852 | 71.3 | 2754.0 | 22.0 | 441.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | False | -0.830884 | 125.181818 | 125.181818 |
2671 | 2666 | A ver, a ver - Ric Piccolo Edit | A ver, a ver (Ric Piccolo Edit) | Donald | 2022-08-24 | GBUQH2200069 | 27.3 | 67852.0 | 531.0 | 98875.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | False | -0.830763 | 127.781544 | 127.781544 |
82 | 83 | Peekaboo | Glide - Single | Dave Angel | 2023-01-27 | GBLTF2300027 | 167.2 | 8082.0 | 63.0 | 32758.0 | ... | NaN | NaN | NaN | NaN | 442.0 | NaN | False | -0.830874 | 128.285714 | 128.285714 |
2218 | 2213 | Lullaby - BadWolf Remix | Lullaby (BadWolf Remix) | TR-MEET | 2023-03-24 | DEEE72301691 | 30.4 | 18204.0 | 113.0 | 5777.0 | ... | NaN | NaN | NaN | NaN | 964.0 | NaN | False | -0.830856 | 161.097345 | 161.097345 |
2534 | 2534 | Halloween | Halloween | Blood Dance | 2022-10-06 | CAENV2255241 | 28.2 | 7332.0 | 45.0 | 1692.0 | ... | NaN | NaN | NaN | NaN | 152.0 | NaN | False | -0.830876 | 162.933333 | 162.933333 |
3449 | 3438 | ECHOES: Cala Boca Menino | Echoes | Fire! Orchestra | 2023-04-14 | NOFGR2331100 | 23.3 | 11960.0 | 69.0 | 4543.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | False | -0.830867 | 173.333333 | 173.333333 |
10 rows × 32 columns