Mastering Multi-Label Classification in Python with Scikit-Learn

Welcome back! Today, we’re diving into multi-label classification.

In this post, I’ll guide you through setting up a multi-label classification pipeline using Scikit-Learn. We’ll build a synthetic dataset, train a classifier, and evaluate its performance with metrics tailored to multi-label tasks.


What is Multi-Label Classification?

Unlike traditional classification problems where each instance belongs to a single category, multi-label classification allows each instance to belong to multiple categories simultaneously. For example:

  • Email classification: Emails can be tagged as both “Work” and “Important.”
  • Image tagging: An image might be tagged with “Beach,” “Sunset,” and “Vacation.”

Multi-label classification requires unique metrics to evaluate the models’ performance.


Step-by-Step Guide

1. Generating a Synthetic Multi-Label Dataset

To demonstrate multi-label classification, we’ll use Scikit-Learn’s make_multilabel_classification function. This generates a synthetic dataset where each instance can belong to multiple labels.

from sklearn.datasets import make_multilabel_classification

# Generate synthetic multi-label data
X, Y = make_multilabel_classification(
    n_samples=1000,  # Number of instances
    n_features=20,   # Number of features
    n_classes=5,     # Number of labels
    n_labels=2,      # Average number of labels per instance
    random_state=42
)

print("Feature matrix shape:", X.shape)  # (1000, 20)
print("Label matrix shape:", Y.shape)    # (1000, 5)

2. Splitting the Dataset

We’ll split the dataset into training and testing subsets to evaluate our model’s performance.

from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

3. Training a Multi-Label Classifier

We’ll use a RandomForestClassifier, which inherently supports multi-label classification by treating each label as a separate binary classification problem. In fact, most Scikit-Learn classifiers support multi-label classification.

from sklearn.ensemble import RandomForestClassifier

# Train a RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, Y_train)

4. Making Predictions

Once trained, we’ll use the model to predict labels for the test data.

# Predict labels for the test set
Y_pred = model.predict(X_test)

5. Evaluating the Model

Evaluating multi-label models requires specialized metrics. These are pretty intuitive; the metric(s) you use to inform you decisions will depend on your business problem. Ultimately you’ll need to put on your business hat, or talk to a subject matter expert to determine which metrics are appropriate. Here are a few we’ll use:

  • Hamming Loss: Measures the fraction of incorrectly predicted labels.
  • Jaccard Score: Measures label set similarity between predictions and true values.
  • Exact Match Accuracy: Evaluates the percentage of instances where the predicted label set matches the true label set exactly.
  • Classification Report: Provides detailed precision, recall, and F1 scores for each label.
from sklearn.metrics import classification_report, hamming_loss, jaccard_score, accuracy_score

# Hamming Loss: Lower is better
print("Hamming Loss:", hamming_loss(Y_test, Y_pred))

# Jaccard Score: Higher is better
print("Jaccard Score (samples average):", jaccard_score(Y_test, Y_pred, average='samples'))

# Exact Match Accuracy: Strict metric
exact_match_accuracy = accuracy_score(Y_test, Y_pred)
print("Exact Match Accuracy:", exact_match_accuracy)

# Detailed Classification Report
print("\nClassification Report:")
print(classification_report(Y_test, Y_pred, zero_division=0))

6. Sample Output

Here’s what the output might look like for a well-performing model (note this is not an exact match of what the above code would output, but it is close):

Hamming Loss: 0.03
Jaccard Score (samples average): 0.85
Exact Match Accuracy: 0.78

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.85      0.87       300
           1       0.82      0.88      0.85       310
           2       0.86      0.80      0.83       290
           3       0.88      0.91      0.89       320
           4       0.85      0.86      0.85       280

   micro avg       0.86      0.86      0.86      1500
   macro avg       0.86      0.86      0.86      1500
weighted avg       0.86      0.86      0.86      1500
 samples avg       0.85      0.85      0.85      1500

Key Takeaways

  1. Scikit-Learn’s make_multilabel_classification makes it easy to create synthetic datasets for testing.
  2. The right metrics, like Hamming Loss and Jaccard Score, are essential for understanding performance in multi-label settings, but ultimately the choice of metric(s) depends on the business problem.
  3. Many Scikit-Learn classifiers, such as RandomForestClassifier, natively support multi-label learning.