# Introduction to Machine Learning with sklearn

In this exercise, we'll learn the basics of machine learning using sklearn (scikit-learn), Python's most popular ML library. We'll predict airline passenger numbers over time.

## Part 1: Load and Prepare the Data


In [None]:
# JUST RUN THIS

from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')

# Load and clean the data
df = pd.read_csv('/content/gdrive/MyDrive/datasets/air_passengers.csv')
df.rename(columns={'#Passengers': 'Passengers'}, inplace=True)
df['Month'] = pd.to_datetime(df['Month'])

# Create our feature: months since start
df['Month_Count'] = (df['Month'].dt.year - 1949) * 12 + df['Month'].dt.month - 1

# Look at our data
print(df.head())
print(f"\nWe have {len(df)} months of data")


## Part 2: Preparing Data for Machine Learning

In machine learning:
- **Features (X)**: What we use to make predictions (Month_Count)
- **Labels (y)**: What we're trying to predict (Passengers)

We also need to split our data:
- **Training data**: Used to learn the pattern
- **Testing data**: Used to check if we learned correctly



In [None]:
# JUST RUN THIS

# Create X and y for sklearn
# X needs to be 2D (that's why we use double brackets)
X = df[['Month_Count']]  # Features - notice the double brackets!
y = df['Passengers']     # Labels

print(f"X shape: {X.shape}")  # Should be (144, 1)
print(f"y shape: {y.shape}")  # Should be (144,)

# Split into training and testing sets
# Everything before 1958 is training
train_mask = df['Month'] < '1958-01-01'
test_mask = df['Month'] >= '1958-01-01'

X_train = X[train_mask]
y_train = y[train_mask]
X_test = X[test_mask]
y_test = y[test_mask]

print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")


## Part 3: Your First sklearn Model

Let's implement a function to train a linear regression model:

The way you'll do this is by initializing a model:
```python
model = LinearRegression()
```

And then calling the `.fit` method and passing it `X_train` and `y_train`.

```python
model.fit(X_train, y_train)
```

Make sure you return the model when you're done.


In [None]:
# EDIT THIS

from sklearn.linear_model import LinearRegression

def train_model(X_train, y_train):
    # Input: X_train is a DataFrame of features
    #        y_train is a Series of labels
    # Output: Returns a trained model

    # TODO: Your code here!
    # 1. Create a LinearRegression model
    # 2. Train it using .fit(X_train, y_train)
    # 3. Return the trained model
    pass

# Train the model using your function
model = train_model(X_train, y_train)

# Check what the model learned
print(f"Slope: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Line equation: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")


## Part 4: Making Predictions

Now implement a function that uses the trained model:

The input to this function is the `X_test` values that we _didn't_ train the model on (in our case, `X` is just what month it is).

You'll do this using:

```python
y_pred = model.predict(X_values)
```

The output will be the labels `y` that the model will predict for those `X` values (in our case `y` is the number of passengers for the month in `X`).

In [None]:
# EDIT THIS

def make_predictions(model, X_values):
    # Input: model is a trained sklearn model
    #        X_values is a DataFrame of features
    # Output: Returns a Series of predictions

    # TODO: Use model.predict() to get predictions
    # Hint: predictions = model.predict(X_values)
    pass

# Test your function
predictions = make_predictions(model, X)
print(f"First 5 predictions: {pd.Series(predictions).head()}")


## Part 5: Visualize the Results

Once you've made the predictions we can visualize the results.

In [None]:
# JUST RUN THIS

# Add predictions to our dataframe
df['Predictions'] = predictions

# Plot actual vs predicted
ax = df.plot(x='Month', y='Passengers', label='Actual', figsize=(10, 6))
df.plot(x='Month', y='Predictions', ax=ax, color='red', label='Predicted')

# Add a line showing train/test split
import matplotlib.pyplot as plt
plt.axvline(pd.to_datetime('1958-01-01'), color='green', linestyle='--', label='Train/Test Split')
plt.title('Linear Regression Predictions')
plt.legend()
plt.show()

# Part 6: Evaluate the Model

How good are our predictions? Let's calculate the Mean Squared Error (MSE):

At this point, we have two Series (vectors):

- `y_test`: The real passenger numbers we set aside for testing and did not use to train the model.
- `y_pred`: The predicted passenger numbers from our linear regression model.

These two series represent the same months: one is the actual data, and the other is the model's prediction.

To evaluate our model, we'll calculate the Mean Squared Error (MSE):

1. Find the difference between the real values (`y_test`) and the predicted values (`y_pred`) for each month.
2. Square each difference.
3. Calculate the average of these squared differences.

In [None]:
# EDIT THIS

def calculate_test_mse(y_true, predictions):
    # Input: y_true is actual values (Series)
    #        predictions is predicted values (Series)
    # Output: Returns the MSE (a single number)

    # TODO: Calculate MSE
    # 1. Calculate errors: y_true - predictions
    # 2. Square the errors
    # 3. Return the mean
    pass

# Calculate MSE on test data only
test_predictions = predictions[test_mask]
test_mse = calculate_test_mse(y_test, test_predictions)
print(f"Test MSE: {test_mse:.2f}")

## Bonus: A Better Model

Notice how our data curves upward? Linear regression only fits straight lines. Here's a model that can fit curved patterns:

```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# This creates a model that can fit curves!
poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression())
])

# Try it out if you're curious!
poly_model.fit(X_train, y_train)
poly_predictions = poly_model.predict(X)

# Plot the curved predictions
df['Poly_Predictions'] = poly_predictions
ax = df.plot(x='Month', y='Passengers', label='Actual', figsize=(10, 6))
df.plot(x='Month', y='Predictions', ax=ax, color='red', label='Linear')
df.plot(x='Month', y='Poly_Predictions', ax=ax, color='green', label='Polynomial')
plt.legend()
plt.show()
```

This polynomial model can capture the curved growth pattern in airline passengers much better than a straight line!


In [None]:
# BONUS CODE HERE


