Decision Trees

Monday, July 8th

Today

  • Decision Trees
  • Feature Engineering
  • Unsupervised Machine Learning
Decision Trees

Decision Tree Learning

center
Decision Trees
Decision Trees

What is a Decision Tree?

  • A decision tree is a flowchart-like structure used for classification and regression tasks.
  • It recursively splits the dataset into subsets based on feature values, forming a tree of decisions.
Decision Trees

Problem Statement

Objective

  • Objective: Predict the class label of an instance based on its feature values.
  • Input: Numerical or categorical features and class labels.
  • Output: A decision tree that classifies instances into predefined classes.
Decision Trees

Decision Tree Prediction

X = pd.DataFrame({"num_legs": [4], "num_eyes": [2]})
mode.score(X)
dog
Decision Trees

Why Use Decision Trees?

  • Interpretability: Easy to understand and visualize.
  • Feature Selection: Automatically selects important features.
  • Non-Parametric: No assumptions about the underlying data distribution.
  • Portability: You can easily port the model to code.
Decision Trees

Portability

def predict(x):
    num_legs = x["num_legs"]
    num_eyes = x["num_eyes"]
    if num_legs >= 3:
        if num_eyes >= 3:
            return "spider"
        else:
            return "dog"
    else:
        return "penguin"
Decision Trees

Tree Construction

  • Step 1: Start with the entire dataset at the root.
  • Step 2: Select the best feature to split the data based on the chosen criterion.
  • Step 3: Split the data into subsets.
  • Step 4: Recursively apply steps 2 and 3 to each subset.
  • Step 5: Stop splitting when a stopping condition is met (e.g., maximum depth, minimum instances per node).
Decision Trees

Splitting Criteria - Gini Impurity

  • Measures the impurity of a node. Lower values indicate purer nodes.
  • Gini impurity for a node with classes :

    where is the proportion of instances of class in the node.
Decision Trees

Splitting Criteria - Entropy

  • Measures the randomness in the node. Lower values indicate less randomness.
  • Entropy for a node with classes :

    where is the proportion of instances of class in the node.
Decision Trees

Splitting Criteria - Information Gain

  • Measures the reduction in entropy after a split.
  • Information Gain for a split :

    where is the entropy of the parent node, are the subsets formed by the split, and is the number of instances in subset .
Decision Trees

Pruning

  • Purpose: Reduce overfitting by removing branches that have little importance.
  • Types:
    • Pre-pruning: Stop growing the tree early based on a predefined condition.
      • Maximum depth.
      • Minimum instances per node.
    • Post-pruning: Grow the full tree and then remove branches that do not provide significant power.
      • Cost complexity pruning.
Decision Trees

Cost Complexity Pruning

Add a penalty for tree complexity to the cost function:

where is the misclassification rate of tree , is a complexity parameter, and is the number of leaves in the tree.

Decision Trees

Advantages and Disadvantages

Advantages

  • Easy to understand and interpret.
  • Can handle both numerical and categorical data.
  • Requires little data preprocessing.

Disadvantages

  • Prone to overfitting, especially with deep trees.
  • Can be unstable; small changes in data can lead to different splits.
  • Greedy algorithms may not find the globally optimal tree.
Decision Trees

Exercise

https://shorturl.at/uk0fi