Clustering Italian Restauraunts¶
Here's a (hopefully) fun application of K-means clustering. I've provided a CSV of italian restuaraunts, their names, lats and longs, and ratings.
You can read more about this dataset and its fields on Kaggle: https://www.kaggle.com/datasets/jcraggy/nyc-italian-restaurants-plus
Pitch¶
Imagine you're a data scientist for a startup that's trying to provide recommendations to restauraunt goers. When your customer can't get a table at their desired restaurant, you'd like to recommend them an alternative they might find similar.
You're going to do this by k-means clustering the restaurants from this dataset and then visualize by scatter-plotting them on a map.
Instructions¶
Load the dataset.
from google.colab import drive import pandas as pd from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler drive.mount('/content/gdrive') df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_plus_loc.csv") # Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting. original_df = df.copy()
Do some k-means clustering. Feel free to do a simple k-means or introduce some feature scaling and/or PCA like we did in our Penguin's clustering exercise. Refer to the Penguins Clustering Exercise from yesterday (just like great art, great data science is often theft of prior work).
You pick the features to cluster on. You may not feel all of them are relevant to your clustering.
Make sure you call the kmeans results,
kmeans
like we did in the prior exercise. If you call it something else, you'll just have to change the below code.Use the new k-means clusters to group and average your original dataframe and print it out. Each row in this new grouped table represents one of your clusters. How would you describe this cluster?
original_df["kmeans"] = kmeans.labels_ clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean() print(clusters_df)
Once you've build your clusters, scatter plot the results on their original lat/longs over a map of manhattan using
folium
.import seaborn as sns import matplotlib.pyplot as plt import folium fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12) colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen'] # Plot each entry in df by it's latitude and longitude on the folium map for index, row in original_df.iterrows(): color = colors[kmeans.labels_[index]] description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}" folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap) # Display the map fmap
from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
drive.mount('/content/gdrive')
df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_plus_loc.csv")
# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()
df.head()
Mounted at /content/gdrive
Case | Restaurant | Price | Food | Decor | Service | East | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Daniella Ristorante | 43 | 22 | 18 | 20 | 0 | 40.746831 | -73.996758 |
1 | 2 | Tello's Ristorante | 32 | 20 | 19 | 19 | 0 | 40.743421 | -73.999537 |
2 | 3 | Biricchino | 34 | 21 | 13 | 18 | 0 | 40.748864 | -73.995519 |
3 | 4 | Bottino | 41 | 20 | 20 | 17 | 0 | 40.748485 | -74.003313 |
4 | 5 | Da Umberto | 54 | 24 | 19 | 21 | 0 | 40.739581 | -73.995910 |
import matplotlib.pyplot as plt
df = original_df.copy()
# Choose features
df = df[["latitude", "longitude", "Price", "Food", "Service", "Decor"]]
# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(df)
df = pd.DataFrame(data=X, columns=df.columns)
# PCA
pca = PCA(n_components=None)
dfx_pca = pca.fit(df)
dfx_pca.explained_variance_ratio_
n_components = 3
pca = PCA(n_components=n_components)
df = pd.DataFrame(pca.fit_transform(df))
# # Pick number of clusters
# inertia = []
# for k in range(1, 10):
# kmeans = KMeans(n_clusters=k, random_state=42).fit(df)
# inertia.append(kmeans.inertia_)
# plt.plot(range(1, 10), inertia, marker="o")
# plt.xlabel("Number of clusters")
# plt.ylabel("Inertia")
# plt.title("Elbow Method")
# plt.show()
n_clusters = 6
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(df)
# original_df["kmeans"] = kmeans.labels_
# clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
# clusters_df
import folium
fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']
# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
color = colors[kmeans.labels_[index]]
description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)
# Display the map
fmap
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(