Selecting a multi-label classifier

In this document you will learn:

  • what classifier approaches are available in scikit-multilearn
  • how to perform classification

This section assumes that you have prepared a data set for classification and:

  • x_train, x_test variables contain input feature train and test matrices
  • y_train, y_test variables contain output label train and test matrices

As we noted in Concepts guide multi-label classification can be performed under three approaches:

  • algorithm adaption approach
  • problem transformation approach
  • ensemble of multi-label classifiers approach

Adapted algorithms

The algorithm adaptation approach is based on a single-label classification method adapted for multi-label classification problem. Scikit-multilearn provides:

Algorithm adaption methods methods usually require parameter estimation. Selecting best parameters of algorithm adaptation classifiers is discussed in Estimating parameters.

An example code for using skmultilearn.adapt.MLkNN looks like this:

from skmultilearn.adapt import MLkNN

classifier = MLkNN(k=3)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

Problem transformation

Problem transformation approaches are provided in the skmultilearn.problem_transform module and they require a selection of a scikit-learn compatible single-label base classificatier that will be cloned one or more times during the problem transformation. Scikit-learn provides a variety of base classifiers such as:

Scikit-multilearn provides three problem transformation approaches:

Problem transformation classifiers take two arguments:

  • classifier - an instance of a base classifier object, to be cloned and refitted upon the multi-label classifiers fit stage
  • require_dense - a [boolean, boolean] governing whether the base classifier receives dense or sparse arguments. It is explained in detail in The data format for multi-label classification

An example of a Label Powerset transformation from multi-label classification to a single-label multi-class problem to be solved using a Gaussian Naive Bayes classifier:

from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB

# initialize Label Powerset multi-label classifier
# with a gaussian naive bayes base classifier
classifier = LabelPowerset(GaussianNB())

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

By default the base classifier will be provided with a dense representation, but some scikit-learn classifiers also support sparse representations. This is an example use of a Binary Relevance classifier with a single-class SVM classifier that does can handle sparse input matrix:

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC

# initialize Binary Relevance multi-label classifier
# with an SVM classifier
# SVM in scikit only supports the X matrix in sparse representation

classifier = BinaryRelevance(classifier = SVC(), require_dense = [False, True])

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

Ensemble approaches

It is often useful to train more than one model for a subset of labels in multi-label classification, especially for large label spaces - a well-selected smaller label subspace can allow more efficient classification. For this purpose the module implements ensemble classification schemes that construct an ensemble of base multi-label classifiers.

Currently the following ensemble classification schemes are available in scikit-multilearn:

An example code for an ensemble of RandomForests under a Label Powerset multi-label classifiers trained for each label subspace - partitioned using fast greedy community detection methods on a label co-occurrence graph looks like this:

from sklearn.ensemble import RandomForestClassifier
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

# construct base forest classifier
base_classifier = RandomForestClassifier()

# setup problem transformation approach with sparse matrices for random forest
problem_transform_classifier = LabelPowerset(classifier=base_classifier,
    require_dense=[False, False])

# partition the label space using fastgreedy community detection
# on a weighted label co-occurrence graph with self-loops allowed
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True,
    include_self_edges=True)

# setup the ensemble metaclassifier
classifier = LabelSpacePartitioningClassifier(problem_transform_classifier, clusterer)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

MEKA classifiers

In a situation when one needs a method not yet implemented in scikit-multilearn - a MEKA/MULAN wrapper is provided and described in section Using the meka wrapper.