Exploring Data With Pandas¶
In this notebook, you'll be exploring a dataset with pandas. You'll get to choose a dataset from a shared datasets folder I'll be providing.
Step 1 - Getting Access to the Drive¶
First things first, you'll have to access this shared folder and add a shortcut to it to your drive.
- Open this link to our shared datasets folder in a different tab.
- In that folder, select the drop-down from the top where it has the folder's name, "datasets".
- From that drop-down, go to "Organize" and then select "Add Shortcut"
- In the new list of locations to add the shortcut to, select "My Drive"
Step 2 - Connecting Colab to Your Drive¶
Next, we'll want to mount our Google Drive and to our Notebook environment (we'll have
# Connect to Google Drive
from google.colab import drive
drive.mount("/content/gdrive")
Step 3 - Read Your Selected CSV¶
- In this cell, we're going to import the pandas module:
import pandas as pd
- We're going to pick which CSV we want to read and specify the file path in our Colab Notebook's filesystem and then read it using pandas into a dataframe variable called
df
.
df = pd.read_csv("gdrive/MyDrive/datasets/pokemon.csv")
And to confirm that our notebook looks right, we'll look at the
.head()
(first 5 rows) of our CSV. Just call this as the last line in your cell to see the results.df.head()
Alternatively, if you want to see a random sample of rows, you can use:
df.sample(5)
Step 4 - Inspect the DataFrame¶
In this cell, use a few different methods on the dataframe to inspect it and get a feel for it.
df.info() # Note: This is a method (you call it)
df.describe() # Note: This is a method
df.columns # Note: This is an attribute (you don't call it)
df.dtypes # Note: This is an attribute
df.index # Note: This is an attribute
Ask yourself, what does this say about the dataframe?
For context, here's the documentation for:
Step 5 - Pull Out a Single Series From the DataFrame¶
Next, try pulling out a single Series
from the dataframe. This is like a fancy list.
column_name = ... # This should be a string
df[column_name]
Step 6 - Use Methods to Describe That Series¶
Try some of the Series Descriptive Methods.
For example:
df[column_name].mode()
Try a few different aggregate methods, what does this say about the Series
?
Step 7 - Use Methods to Describe All of the Series in the DataFrame¶
Try some of the DataFrame Descriptive Methods
Step 8 - Try the DataFrame .describe()
Method¶
Try df.describe()
Ask yourself, what does this tell me about the DataFrame?
Step 9 - Try Filtering a DataFrame by a Boolean¶
One of the major features of Pandas is that we can filter (keep a subset of) the rows based on a comparison of two columns.
df[df[str_col_name] == "Target Value"]
df[df[num_col_name] >= 200]
For now, only do one comparison at a time. Don't use any and
or or
statements.
Try assigning these to variables. Try using the descriptive methods from above on these new filtered dataframes.