Sorting Hat w/ One-Hot Encoding¶

In this exercise, we're going to build another decision tree but with all categorical data, and we're going to do it using one-hot encoding.

One-Hot Encoding¶

One-hot-encoding is a form of feature engineering that converts categorical data into a numerical format.
It creates a binary column for each category in the original feature. The binary column has a value of 1 if the category is present and 0 otherwise.
This lets us use categorical data in machine learning models that require numerical input.

A simple way to do it in pandas is with pd.get_dummies(df).

df = pd.DataFrame({"animal": ["dog", "cat", "dog", "penguin"]})
pd.get_dummies(df)
   animal_cat  animal_dog  animal_penguin
0           0           1               0
1           1           0               0
2           0           1               0
3           0           0               1

Exercise¶

Let's say we wanted to train a model to be a Hogwarts Sorting Hat. Let's use a set of feautures about the student to predict which house they should go into.

Each entry represents a student and some information about them including their "House" which will be our label class to train our tree on.

Inspect the dataframe and decide which features you want to use for your decision tree to guess the "House".

One-hot encode the feature columns using pandas. He's a toy example to use for reference.

# Initial Features
feature_cols = ["A", "B", "C"]

# Onehot encode all the feature columns and reconstruct the dataframe with the new columns
onehot_features_df = pd.get_dummies(df[feature_cols]).astype(int)
label_df = df[["D"]]
df = pd.concat([label_df, onehot_features_df], axis=1)

# Re-assign the feature_cols to the new column names
feature_cols = onehot_features_df.columns

Inspect this new dataframe and make sense of what you're seeing.
Now go back to our decision tree exercise and use your code from there to train a new tree on this data and evaluate the results.

Try manipulating aspects of the tree to see if you can get a better accuracy. I don't expect this to be a very accurate model, it's just fun.
Plot the tree! I've left some code below to help you do this.

In [17]:

Copied!





from google.colab import drive
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# 1. Load and clean the data
# I did this part for you!
drive.mount('/content/gdrive')

df = pd.read_csv('/content/gdrive/My Drive/datasets/harry_potter_characters_fixed_clean.csv')
df.columns

feature_cols = ['Gender', 'Blood Status', 'Eye Color', 'Hair Color']

# 1.5. Onehot encode all the features, make sure you create a new feature_cols list
onehot_features_df = pd.get_dummies(df[feature_cols]).astype(int)
label_df = df[["House"]]
df = pd.concat([label_df, onehot_features_df], axis=1)

# Re-assign the feature_cols to the new column names
feature_cols = onehot_features_df.columns

# 2. Train/Test Split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# 3. Create X_train, y_train, X_test, y_test variables
X_train = df_train[feature_cols]
y_train = df_train["House"]
X_test = df_test[feature_cols]
y_test = df_test[feature_cols]

# 4. Train the Model
model = DecisionTreeClassifier(min_samples_split=3)
model.fit(X_train, y_train)

# 5. Evaluate the Model (Report Accuracy as correct/all)
y_pred = model.predict(X_test)

df_test["House Prediction"] = y_pred
accuracy = len(df_test[df_test["House"] == df_test["House Prediction"]]) / len(df_test)
print(f"Accuracy is {accuracy}")
from google.colab import drive
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# 1. Load and clean the data
# I did this part for you!
drive.mount('/content/gdrive')

df = pd.read_csv('/content/gdrive/My Drive/datasets/harry_potter_characters_fixed_clean.csv')
df.columns

feature_cols = ['Gender', 'Blood Status', 'Eye Color', 'Hair Color']

# 1.5. Onehot encode all the features, make sure you create a new feature_cols list
onehot_features_df = pd.get_dummies(df[feature_cols]).astype(int)
label_df = df[["House"]]
df = pd.concat([label_df, onehot_features_df], axis=1)

# Re-assign the feature_cols to the new column names
feature_cols = onehot_features_df.columns

# 2. Train/Test Split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# 3. Create X_train, y_train, X_test, y_test variables
X_train = df_train[feature_cols]
y_train = df_train["House"]
X_test = df_test[feature_cols]
y_test = df_test[feature_cols]

# 4. Train the Model
model = DecisionTreeClassifier(min_samples_split=3)
model.fit(X_train, y_train)

# 5. Evaluate the Model (Report Accuracy as correct/all)
y_pred = model.predict(X_test)

df_test["House Prediction"] = y_pred
accuracy = len(df_test[df_test["House"] == df_test["House Prediction"]]) / len(df_test)
print(f"Accuracy is {accuracy}")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Accuracy is 0.6

In [18]:

Copied!





# Plot the tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(60, 24))
plot_tree(model, feature_names=X_train.columns, class_names=sorted(y_train.unique()), filled=True)
plt.show()
# Plot the tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(60, 24))
plot_tree(model, feature_names=X_train.columns, class_names=sorted(y_train.unique()), filled=True)
plt.show()

No description has been provided for this image

In [20]:

Copied!

df["Eye Color"].unique()
df["Eye Color"].unique()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
   3652         try:
-> 3653             return self._engine.get_loc(casted_key)
   3654         except KeyError as err:

/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Eye Color'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-20-9dfabbf5c840> in <cell line: 1>()
----> 1 df["Eye Color"].unique()

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3759             if self.columns.nlevels > 1:
   3760                 return self._getitem_multilevel(key)
-> 3761             indexer = self.columns.get_loc(key)
   3762             if is_integer(indexer):
   3763                 indexer = [indexer]

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
   3653             return self._engine.get_loc(casted_key)
   3654         except KeyError as err:
-> 3655             raise KeyError(key) from err
   3656         except TypeError:
   3657             # If we have a listlike key, _check_indexing_error will raise

KeyError: 'Eye Color'

In [ ]: