# Clustering Italian Restauraunts

Here's a (hopefully) fun application of K-means clustering. I've provided a CSV of italian restuaraunts, their names, lats and longs, and ratings.

You can read more about this dataset and its fields on Kaggle:
https://www.kaggle.com/datasets/jcraggy/nyc-italian-restaurants-plus


## Pitch

Imagine you're a data scientist for a startup that's trying to provide recommendations to restauraunt goers. When your customer can't get a table at their desired restaurant, you'd like to recommend them an alternative they might find similar.

You're going to do this by k-means clustering the restaurants from this dataset and then visualize by scatter-plotting them on a map.


## Instructions

1. Load the dataset.
   ```python
   from google.colab import drive
   import pandas as pd
   from sklearn.cluster import KMeans
   from sklearn.decomposition import PCA
   from sklearn.preprocessing import StandardScaler
   
   drive.mount('/content/gdrive')
   
   df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_italian.csv")
   
   # Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
   original_df = df.copy()

   # Inspect it with df.head()
   print("Sample:")
   display(df.sample())
   ```

2. Do some k-means clustering. Feel free to do a simple k-means or introduce some feature scaling and/or PCA like we did in our Penguin's clustering exercise. Refer to the [Penguins Clustering Exercise](https://colab.research.google.com/drive/1MtnMkyvg9x1oA9nSIwHemoQtsnlIPZwh?usp=sharing) (just like great art, great data science is often theft of prior work).

   You pick the features to cluster on. You may not feel all of them are relevant to your clustering.

   **Don't bother with the Train/Test split this time**, just cluster all of the data.

   **Make sure you call the kmeans results, `kmeans`** like we did in the prior exercise. If you call it something else, you'll just have to change the below code.


3. Use the new k-means clusters to group and average your original dataframe and print it out. Each row in this new grouped table represents one of your clusters. How would you describe this cluster?

   ```python
   original_df["kmeans"] = kmeans.labels_
   clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
   print("Clusters:")
   display(clusters_df)
   ```

4. Once you've build your clusters, scatter plot the results on their original lat/longs over a map of manhattan using `folium`.
   ```python
   import seaborn as sns
   import matplotlib.pyplot as plt
   import folium
   
   fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
   colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']
   
   # Plot each entry in df by it's latitude and longitude on the folium map
   for index, row in original_df.iterrows():
       color = colors[kmeans.labels_[index]]
       description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
       folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)
   
   # Display the map
   display(fmap)
   ```

In [2]:
from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

drive.mount('/content/gdrive')

df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_italian.csv")

# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()

# Inspect it with df.head()
print("Sample:")
display(df.sample(5))

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Sample:


Unnamed: 0,Case,Restaurant,Price,Food,Decor,Service,East,latitude,longitude
37,38,Il Menestrello,52,22,19,22,1,40.757374,-73.97491
136,137,Limoncello,46,19,18,20,0,40.76141,-73.982782
129,130,Rainbow Grill,65,19,23,18,0,40.759318,-73.97935
73,74,Maruzzella,33,19,14,18,1,40.771311,-73.953691
112,113,Enoteca i Trulli,43,23,20,21,1,40.742114,-73.983603


In [17]:
X = df[["Price", "Food", "Decor", "Service", "East", "latitude", "longitude"]]

# Fit scaler and transform
scaler = StandardScaler()
X = pd.DataFrame(
    scaler.fit_transform(X),
    columns=X.columns,
    index=X.index
)

# Determine best PCA
pca = PCA(n_components=None)
pca_temp = pca.fit(X)
n_components = sum(pca_temp.explained_variance_ratio_ > 0.1)
print(f"Number of components with variance > 0.1: {n_components}")

# Now fit PCA with optimal components
pca = PCA(n_components=n_components)
X = pd.DataFrame(
    pca.fit_transform(X),
    index=X.index
)

n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
train_clusters = kmeans.labels_

original_df["kmeans"] = kmeans.labels_
clusters_df = original_df.drop(["Case", "Restaurant"], axis=1).groupby("kmeans").mean()
print("Clusters:")
display(clusters_df)

Number of components with variance > 0.1: 3
Clusters:


Unnamed: 0_level_0,Price,Food,Decor,Service,East,latitude,longitude
kmeans,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,36.166667,19.055556,15.416667,17.5,0.972222,40.769489,-73.962658
1,47.628571,21.457143,19.628571,20.6,0.514286,40.746842,-73.9873
2,49.509434,22.283019,19.018868,21.188679,0.943396,40.769639,-73.962396
3,35.909091,19.136364,16.409091,17.840909,0.068182,40.759697,-73.986406


In [18]:
import seaborn as sns
import matplotlib.pyplot as plt
import folium

fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']

# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
    color = colors[kmeans.labels_[index]]
    description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
    folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)

# Display the map
display(fmap)