5. How to select a classifier

This document will guide you through the process of selecting a classifier for your problem.

Note that there is no established, scientifically proven rule-set for selecting a classifier to solve a general multi-label classification problem. Succesful approaches often come from mixing intuitions about which classifiers are worth considering, decomposition in to subproblems, and experimental model selection.

There are two things you need to consider before choosing a classifier:

  • performance, i.e. generalization quality, how well will the model understand the relationship between features and labels, note that there for different use cases you might want to measure the quality using different measures, we’ll talk about the measures in a moment
  • efficiency, i.e. how fast the classifier will perform, does it scale, is it usable in your problem based on number of labels, samples or label combinations

There are two ways to make the choice: - intuition based on asymptotic performance and results from empirical studies - data-driven model selection using cross-validated parameter search

Let’s load up a data set to see have some thing to work on first.

In [2]:
from skmultilearn.dataset import load_dataset
In [4]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ =load_dataset('emotions', 'test')
emotions:train - does not exists downloading
Downloaded emotions-train
emotions:test - does not exists downloading
Downloaded emotions-test

Usually classifier’s performance depends on three elements:

  • number of samples
  • number of labels
  • number of unique label classes
  • number of features

We can obtain the first two from the shape of our output space matrices:

In [8]:
y_train.shape, y_test.shape
Out[8]:
((391, 6), (202, 6))

We can use numpy and the list of rows with non-zero values in output matrices to get the number of unique label combinations.

In [9]:
import numpy as np
np.unique(y_train.rows).shape, np.unique(y_test.rows).shape
Out[9]:
((26,), (21,))

Number of features can be found in the shape of the input matrix:

In [10]:
X_train.shape[1]
Out[10]:
72

5.1. Intutions

5.1.1. Generalization quality measures

There are several ways to measure a classifier’s generalization quality:

  • Hamming loss measures how well the classifier predicts each of the labels, averaged over samples, then over labels
  • accuracy score measures how well the classifier predicts label combinations, averaged over samples
  • jaccard similarity measures the proportion of predicted labels for a sample to its correct assignment, averaged over samples
  • precision measures how many samples with ,
  • recall measures how many samples ,
  • F1 score measures a weighted average of precision and recall, where both have the same impact on the score

These measures are conveniently provided by sklearn:

In [ ]:
from skmultilearn.adapt import MLkNN
classifier = MLkNN(k=3)
prediction = classifier.fit(X_train, y_train).predict(X_test)
In [7]:
import sklearn.metrics as metrics

metrics.hamming_loss(y_test, prediction)
Out[7]:
0.2953795379537954

5.1.2. Performance

Scikit-multilearn provides 11 classifiers that allow a strong variety of classification scenarios through label partitioning and ensemble classification, let’s look at the important factors influencing performance. $ g(x) $ denotes the performance of the base classifier in some of the classifiers.

BRkNNaClassifier, BRkNNbClassifier

Parameter estimation needed: Yes, 1 parameter

Complexity: O(n_{labels} * n_{samples} * n_{features} * k)

BRkNN classifiers train a k Nearest Neighbor per label and use infer label assignment in one of the two variants.

Strong sides: - takes some label relations into account while estimating single-label classifers - works when distance between samples is a good predictor for label assignment. Often used in biosciences.

Weak sides: - trains a classifier per label - less suitable for large label space - requires parameter estimation.

MLkNN

Parameter estimation needed: Yes, 2 parameters

Complexity: O((n_{samples} * n_{features} + n_{labels}) * k)

MLkNN builds uses k-NearestNeighbors find nearest examples to a test class and uses Bayesian inference to select assigned labels.

Strong sides: - estimates one multi-class subclassifier - works when distance between samples is a good predictor for label assignment - often used in biosciences.

Weak sides: - requires parameter estimation

MLARAM

Parameter estimation needed: Yes, 2 parameters

Complexity: O(n_{samples})

An ART classifier which uses clustering of learned prototypes into large clusters improve performance.

Strong sides: - linear in number of samples, scales well

Weak sides: - requires parameter estimation - ART techniques have had generalization limits in the past

BinaryRelevance

Parameter estimation needed: Only for base classifier

Complexity: O(n_{labels} * base_single_class_classifier_complexity)

Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems.

Strong sides: - estimates single-label classifiers - can generalize beyond avialable label combinations

Weak sides: - not suitable for large number of labels - ignores label relations

ClassifierChain

Parameter estimation needed: Yes, 1 + parameters for base classifier

Complexity: O(n_{labels} * base_single_class_classifier_complexity)

Transforms multi-label problem to a multi-class problem where each label combination is a separate class.

Strong sides: - estimates single-label classifiers - can generalize beyond avialable label combinations - takes label relations into account

Weak sides: - not suitable for large number of labels - quality strongly depends on the label ordering in chain.

LabelPowerset

Parameter estimation needed: Only for base classifier

Complexity: O(base_multi_class_classifier_complexity(n_classes = n_label_combinations))

Transforms multi-label problem to a multi-class problem where each label combination is a separate class and uses a multi-class classifier to solve the problem.

Strong sides: - estimates label dependencies, with only one classifier - often best solution for subset accuracy if training data contains all relevant label combinations

Weak sides: - requires all label combinations predictable by the classifier to be present in the training data - very prone to underfitting with large label spaces

RakelD

Parameter estimation needed: Yes, 1 + base classifier’s parameters Complexity: O(n_{partitions} * base_multi_class_classifier_complexity(n_classes = n_label_combinations_per_partition))

Randomly partitions label space and trains a Label Powerset classifier per partition with a base multi-class classifier.

Strong sides:

  • may use less classifiers than Binary Relevance and still generalize label relations while not underfitting like LabelPowerset

Weak sides:

  • using random approach is not very probable to draw an optimal label space division

RakelO

Parameter estimation needed: Yes, 2 + base classifier’s parameters Complexity: O(n_{partitions} * base_multi_class_classifier_complexity(n_classes = n_label_combinations_per_cluster))

Randomly draw label subspaces (possibly overlapping) and trains a Label Powerset classifier per partition with a base multi-class classifier, labels are assigned based on voting.

Strong sides:

  • may provide better results with overlapping models

Weak sides:

  • takes large number of classifiers to generate improvement, not scalable
  • random subspaces may not be optimal

LabelSpacePartitioningClassifier

Parameter estimation needed: Only base classifier Complexity: O(n_{partitions} * base_classifier_complexity(n_classes = n_label_combinations_per_partition))

Randomly partitions the label space and trains a base classifier per partition with a base multi-class classifier.

Strong sides:

  • accomodates to different types of problems
  • infers when to divide into subproblems or not and decide when to use less classifiers than Binary Relevance
  • scalable to data sets with large numbers of labels
  • generalizes label relations well while not underfitting like LabelPowerset
  • does not require parameter estimation

Weak sides:

  • requires label relationships present in training data to be representable of the problem
  • partitioning may prevent certain label combinations from being correctly classified, depends on base classifier

MajorityVotingClassifier

Parameter estimation needed: Only base classifier Complexity: O(n_{clusters} * base_classifier_complexity(n_classes = n_label_combinations_per_cluster))

Uses clustering methods to divide the label space into subspaces (possibly overlapping) and trains a base classifier per partition with a base multi-class classifier, labels are assigned based on voting.

Strong sides:

  • accomodates to different types of problems
  • infers when to divide into subproblems or not and decide when to use less classifiers than Binary Relevance
  • scalable to data sets with large numbers of labels
  • generalizes label relations well while not underfitting like LabelPowerset
  • does not require parameter estimation

Weak sides:

  • requires label relationships present in training data to be representable of the problem

5.2. Data-driven model selection

Scikit-multilearn allows estimating parameters to select best models for multi-label classification using scikit-learn’s model selection GridSearchCV API. In the simplest version it can look for the best parameter of a scikit-multilearn’s classifier, which we’ll show on the example case of estimating parameters for MLkNN, and in the more complicated cases of problem transformation methods it can estimate both the method’s hyper parameters and the base classifiers parameter.

5.2.1. Estimating hyper-parameter k for MLkNN

In the case of estimating the hyperparameter of a multi-label classifier, we first import the relevant classifier and scikit-learn’s GridSearchCV class. Then we define the values of parameters we want to evaluate. We are interested in which combination of k - the number of neighbours, s - the smoothing parameter works best. We also need to select a measure which we want to optimize - we’ve chosen the F1 macro score.

After selecting the parameters we intialize and _run the cross validation grid search and print the best hyper parameters.

In [5]:
from skmultilearn.adapt import MLkNN
from sklearn.model_selection import GridSearchCV

parameters = {'k': range(1,3), 's': [0.5, 0.7, 1.0]}

clf = GridSearchCV(MLkNN(), parameters, scoring='f1_macro')
clf.fit(X_train, y_train)

print (clf.best_params_, clf.best_score_)
({'k': 1, 's': 0.5}, 0.45223607257008969)

These values can be then used directly with the classifier.

5.2.2. Estimating hyper-parameter k for embedded classifiers

In problem transformation classifiers we often need to estimate not only a hyper parameter, but also the parameter of the base classifier, and also - maybe even the problem transformation method. Let’s take a look at this on a three-layer construction of ensemble of problem transformation classifiers using label space partitioning, the parameters include:

  • classifier: which takes a parameter - a classifier for transforming multi-label classification problem to a single-label classification, we will decide between the Label Powerset and Classifier Chains
  • classifier__classifier: which is the base classifier for the transformation strategy, we will use random forests here
  • classifier__classifier__n_estimators: the number of trees to be used in the forest, will be passed to the random forest object
  • clusterer: a label space partitioning class, we will decide between two approaches provided by the NetworkX library.
In [10]:
from skmultilearn.problem_transform import ClassifierChain, LabelPowerset
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from skmultilearn.cluster import NetworkXLabelGraphClusterer
from skmultilearn.cluster import LabelCooccurrenceGraphBuilder
from skmultilearn.ensemble import LabelSpacePartitioningClassifier

from sklearn.svm import SVC

parameters = {
    'classifier': [LabelPowerset(), ClassifierChain()],
    'classifier__classifier': [RandomForestClassifier()],
    'classifier__classifier__n_estimators': [10, 20, 50],
    'clusterer' : [
        NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'louvain'),
        NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'lpa')
    ]
}

clf = GridSearchCV(LabelSpacePartitioningClassifier(), parameters, scoring = 'f1_macro')
clf.fit(X_train, y_train)

print (clf.best_params_, clf.best_score_)
({'classifier__classifier__n_estimators': 50, 'classifier__classifier': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False), 'classifier': LabelPowerset(classifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       require_dense=[True, True]), 'clusterer': <skmultilearn.cluster.networkx.NetworkXLabelGraphClusterer object at 0x7fc42ec75e50>}, 0.59569187181557981)