Categorical Supervised Machine Learning Algorithms

Logistic Regression

Not to be confused with Linear Regression!

Problem Statement

  • Logistic regression is used for binary classification problems.
  • It models the probability that a given input belongs to a particular category.

Supervised Learning Examples


Credit Card Fraud Detection

  • Features: Vendor, location, time, distance from last transaction
  • Labels: Chargebacks on previous transactions

A Breief Interlude into Probability

  • The probability of something happening is denoted as:

  • Two independent events have a joint probability:

  • For example, if the probability of flipping a coin on heads once is 1/2 than the probability of flipping it twice and getting heads both times is:

Conditional Probability

  • The probability of the positive class y given some non-independent event x is denoted as:

  • Bayes' theorem relates the probability of the positive class to the likelihood and prior probability:

COVID-19 Infection Example

  • Question: I have a cough, what's the likelihood I have COVID-19?

    • : Prior probability of having COVID-19 (e.g., infection rate in the population)
    • : Likelihood of having symptoms given that you have COVID-19
    • : Probability of observing the symptoms (e.g., general probability of being sick)
  • Using Bayes' theorem:

Example Calculation

  • Suppose:

    • (5% of the population is infected)
    • (90% of infected people show symptoms)
    • (20% of the population shows symptoms)
  • Then:

  • Interpretation: Given that you have symptoms, there is a 22.5% probability that you have COVID-19.

Logistic Function

  • The logistic function (or sigmoid function) is defined as:

    where

Probability Prediction

  • The model predicts the probability of the positive class:

Decision Boundary

  • The decision boundary is defined as:

Loss Function and Optimization

  • Logistic regression uses the cross-entropy loss function:
  • The model parameters are optimized using gradient descent to minimize the loss function.

Validating Classification Models

  • Train/Test Split: Randomly remove 20% of the examples to evaluate the model's performance.

    df_train = df.sample(frac=0.8)
    df_test = df.drop(df_train.index)
    

    or

    df_train, df_test = train_test_split(df, test_size=0.2)
    
  • Accuracy: The proportion of correctly classified instances.

Exercise

https://shorturl.at/QwcIA

TODO include precision and recall in here