Clustering Italian Restauraunts¶

Here's a (hopefully) fun application of K-means clustering. I've provided a CSV of italian restuaraunts, their names, lats and longs, and ratings.

You can read more about this dataset and its fields on Kaggle: https://www.kaggle.com/datasets/jcraggy/nyc-italian-restaurants-plus

Pitch¶

Imagine you're a data scientist for a startup that's trying to provide recommendations to restauraunt goers. When your customer can't get a table at their desired restaurant, you'd like to recommend them an alternative they might find similar.

You're going to do this by k-means clustering the restaurants from this dataset and then visualize by scatter-plotting them on a map.

Instructions¶

Load the dataset.

from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

drive.mount('/content/gdrive')

df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_plus_loc.csv")

# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()

Do some k-means clustering. Feel free to do a simple k-means or introduce some feature scaling and/or PCA like we did in our Penguin's clustering exercise. Refer to the Penguins Clustering Exercise from yesterday (just like great art, great data science is often theft of prior work).

You pick the features to cluster on. You may not feel all of them are relevant to your clustering.

Make sure you call the kmeans results, kmeans like we did in the prior exercise. If you call it something else, you'll just have to change the below code.
Use the new k-means clusters to group and average your original dataframe and print it out. Each row in this new grouped table represents one of your clusters. How would you describe this cluster?
```
original_df["kmeans"] = kmeans.labels_
clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
print(clusters_df)
```

Once you've build your clusters, scatter plot the results on their original lat/longs over a map of manhattan using folium.

import seaborn as sns
import matplotlib.pyplot as plt
import folium

fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']

# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
    color = colors[kmeans.labels_[index]]
    description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
    folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)

# Display the map
fmap

In [1]:

Copied!





from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

drive.mount('/content/gdrive')

df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_plus_loc.csv")

# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()

df.head()
from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

drive.mount('/content/gdrive')

df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_plus_loc.csv")

# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()

df.head()

Mounted at /content/gdrive

Out[1]:

	Case	Restaurant	Price	Food	Decor	Service	latitude	longitude
0	1	Daniella Ristorante	43	22	18	20	40.746831	-73.996758
1	2	Tello's Ristorante	32	20	19	19	40.743421	-73.999537
2	3	Biricchino	34	21	13	18	40.748864	-73.995519
3	4	Bottino	41	20	20	17	40.748485	-74.003313
4	5	Da Umberto	54	24	19	21	40.739581	-73.995910

In [21]:

Copied!





import matplotlib.pyplot as plt

df = original_df.copy()

# Choose features
df = df[["latitude", "longitude", "Price", "Food", "Service", "Decor"]]

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(df)
df = pd.DataFrame(data=X, columns=df.columns)

# PCA
pca = PCA(n_components=None)
dfx_pca = pca.fit(df)
dfx_pca.explained_variance_ratio_
n_components = 3
pca = PCA(n_components=n_components)
df = pd.DataFrame(pca.fit_transform(df))

# # Pick number of clusters
# inertia = []
# for k in range(1, 10):
#     kmeans = KMeans(n_clusters=k, random_state=42).fit(df)
#     inertia.append(kmeans.inertia_)
# plt.plot(range(1, 10), inertia, marker="o")
# plt.xlabel("Number of clusters")
# plt.ylabel("Inertia")
# plt.title("Elbow Method")
# plt.show()

n_clusters = 6

kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(df)


# original_df["kmeans"] = kmeans.labels_
# clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
# clusters_df

import folium

fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']

# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
    color = colors[kmeans.labels_[index]]
    description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
    folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)

# Display the map
fmap
import matplotlib.pyplot as plt

df = original_df.copy()

# Choose features
df = df[["latitude", "longitude", "Price", "Food", "Service", "Decor"]]

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(df)
df = pd.DataFrame(data=X, columns=df.columns)

# PCA
pca = PCA(n_components=None)
dfx_pca = pca.fit(df)
dfx_pca.explained_variance_ratio_
n_components = 3
pca = PCA(n_components=n_components)
df = pd.DataFrame(pca.fit_transform(df))

# # Pick number of clusters
# inertia = []
# for k in range(1, 10):
#     kmeans = KMeans(n_clusters=k, random_state=42).fit(df)
#     inertia.append(kmeans.inertia_)
# plt.plot(range(1, 10), inertia, marker="o")
# plt.xlabel("Number of clusters")
# plt.ylabel("Inertia")
# plt.title("Elbow Method")
# plt.show()

n_clusters = 6

kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(df)


# original_df["kmeans"] = kmeans.labels_
# clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
# clusters_df

import folium

fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']

# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
    color = colors[kmeans.labels_[index]]
    description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
    folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)

# Display the map
fmap

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

Out[21]:

Make this Notebook Trusted to load map: File -> Trust Notebook

In [ ]: