5. How to select a classifier¶
This document will guide you through the process of selecting a classifier for your problem.
Note that there is no established, scientifically proven rule-set for selecting a classifier to solve a general multi-label classification problem. Succesful approaches often come from mixing intuitions about which classifiers are worth considering, decomposition in to subproblems, and experimental model selection.
There are two things you need to consider before choosing a classifier:
- performance, i.e. generalization quality, how well will the model understand the relationship between features and labels, note that there for different use cases you might want to measure the quality using different measures, we’ll talk about the measures in a moment
- efficiency, i.e. how fast the classifier will perform, does it scale, is it usable in your problem based on number of labels, samples or label combinations
There are two ways to make the choice: - intuition based on asymptotic performance and results from empirical studies - data-driven model selection using cross-validated parameter search
Let’s load up a data set to see have some thing to work on first.
In [2]:
from skmultilearn.dataset import load_dataset
In [4]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ =load_dataset('emotions', 'test')
emotions:train - does not exists downloading
Downloaded emotions-train
emotions:test - does not exists downloading
Downloaded emotions-test
Usually classifier’s performance depends on three elements:
- number of samples
- number of labels
- number of unique label classes
- number of features
We can obtain the first two from the shape of our output space matrices:
In [8]:
y_train.shape, y_test.shape
Out[8]:
((391, 6), (202, 6))
We can use numpy and the list of rows with non-zero values in output matrices to get the number of unique label combinations.
In [9]:
import numpy as np
np.unique(y_train.rows).shape, np.unique(y_test.rows).shape
Out[9]:
((26,), (21,))
Number of features can be found in the shape of the input matrix:
In [10]:
X_train.shape[1]
Out[10]:
72
5.1. Intutions¶
5.1.1. Generalization quality measures¶
There are several ways to measure a classifier’s generalization quality:
- Hamming loss measures how well the classifier predicts each of the labels, averaged over samples, then over labels
- accuracy score measures how well the classifier predicts label combinations, averaged over samples
- jaccard similarity measures the proportion of predicted labels for a sample to its correct assignment, averaged over samples
- precision measures how many samples with ,
- recall measures how many samples ,
- F1 score measures a weighted average of precision and recall, where both have the same impact on the score
These measures are conveniently provided by sklearn:
In [ ]:
from skmultilearn.adapt import MLkNN
classifier = MLkNN(k=3)
prediction = classifier.fit(X_train, y_train).predict(X_test)
In [7]:
import sklearn.metrics as metrics
metrics.hamming_loss(y_test, prediction)
Out[7]:
0.2953795379537954
5.1.2. Performance¶
Scikit-multilearn provides 11 classifiers that allow a strong variety of classification scenarios through label partitioning and ensemble classification, let’s look at the important factors influencing performance. $ g(x) $ denotes the performance of the base classifier in some of the classifiers.
Parameter estimation needed: Yes, 1 parameter
Complexity:
O(n_{labels} * n_{samples} * n_{features} * k)
BRkNN classifiers train a k Nearest Neighbor per label and use infer label assignment in one of the two variants.
Strong sides: - takes some label relations into account while estimating single-label classifers - works when distance between samples is a good predictor for label assignment. Often used in biosciences.
Weak sides: - trains a classifier per label - less suitable for large label space - requires parameter estimation.
Parameter estimation needed: Yes, 2 parameters
Complexity:
O((n_{samples} * n_{features} + n_{labels}) * k)
MLkNN builds uses k-NearestNeighbors find nearest examples to a test class and uses Bayesian inference to select assigned labels.
Strong sides: - estimates one multi-label SVM subclassifier without any one-vs-all or one-vs-rest comparisons, O(1) classifiers instead of O(l^2). - works when distance between samples is a good predictor for label assignment
Weak sides: - requires parameter estimation
Parameter estimation needed: Yes, 2 parameters
Complexity:
O((n_{samples} * n_{features} + n_{labels}) * k)
MLkNN builds uses k-NearestNeighbors find nearest examples to a test class and uses Bayesian inference to select assigned labels.
Strong sides: - estimates one multi-class subclassifier - works when distance between samples is a good predictor for label assignment - often used in biosciences.
Weak sides: - requires parameter estimation
Parameter estimation needed: Yes, 2 parameters
Complexity:
O(n_{samples})
An ART classifier which uses clustering of learned prototypes into large clusters improve performance.
Strong sides: - linear in number of samples, scales well
Weak sides: - requires parameter estimation - ART techniques have had generalization limits in the past
Parameter estimation needed: Only for base classifier
Complexity:
O(n_{labels} * base_single_class_classifier_complexity)
Transforms a multi-label classification problem with L labels into L single-label separate binary classification problems.
Strong sides: - estimates single-label classifiers - can generalize beyond avialable label combinations
Weak sides: - not suitable for large number of labels - ignores label relations
Parameter estimation needed: Yes, 1 + parameters for base classifier
Complexity:
O(n_{labels} * base_single_class_classifier_complexity)
Transforms multi-label problem to a multi-class problem where each label combination is a separate class.
Strong sides: - estimates single-label classifiers - can generalize beyond avialable label combinations - takes label relations into account
Weak sides: - not suitable for large number of labels - quality strongly depends on the label ordering in chain.
Parameter estimation needed: Only for base classifier
Complexity:
O(base_multi_class_classifier_complexity(n_classes = n_label_combinations))
Transforms multi-label problem to a multi-class problem where each label combination is a separate class and uses a multi-class classifier to solve the problem.
Strong sides: - estimates label dependencies, with only one classifier - often best solution for subset accuracy if training data contains all relevant label combinations
Weak sides: - requires all label combinations predictable by the classifier to be present in the training data - very prone to underfitting with large label spaces
Parameter estimation needed: Yes, 1 + base classifier’s parameters Complexity:
O(n_{partitions} * base_multi_class_classifier_complexity(n_classes = n_label_combinations_per_partition))
Randomly partitions label space and trains a Label Powerset classifier per partition with a base multi-class classifier.
Strong sides:
- may use less classifiers than Binary Relevance and still generalize label relations while not underfitting like LabelPowerset
Weak sides:
- using random approach is not very probable to draw an optimal label space division
Parameter estimation needed: Yes, 2 + base classifier’s parameters Complexity:
O(n_{partitions} * base_multi_class_classifier_complexity(n_classes = n_label_combinations_per_cluster))
Randomly draw label subspaces (possibly overlapping) and trains a Label Powerset classifier per partition with a base multi-class classifier, labels are assigned based on voting.
Strong sides:
- may provide better results with overlapping models
Weak sides:
- takes large number of classifiers to generate improvement, not scalable
- random subspaces may not be optimal
Parameter estimation needed: Only base classifier Complexity:
O(n_{partitions} * base_classifier_complexity(n_classes = n_label_combinations_per_partition))
Uses clustering methods to divide the label space into subspaces and trains a base classifier per partition with a base multi-class classifier.
Strong sides:
- accomodates to different types of problems
- infers when to divide into subproblems or not and decide when to use less classifiers than Binary Relevance
- scalable to data sets with large numbers of labels
- generalizes label relations well while not underfitting like LabelPowerset
- does not require parameter estimation
Weak sides:
- requires label relationships present in training data to be representable of the problem
- partitioning may prevent certain label combinations from being correctly classified, depends on base classifier
Parameter estimation needed: Only base classifier Complexity:
O(n_{clusters} * base_classifier_complexity(n_classes = n_label_combinations_per_cluster))
Uses clustering methods to divide the label space into subspaces (possibly overlapping) and trains a base classifier per partition with a base multi-class classifier, labels are assigned based on voting.
Strong sides:
- accomodates to different types of problems
- infers when to divide into subproblems or not and decide when to use less classifiers than Binary Relevance
- scalable to data sets with large numbers of labels
- generalizes label relations well while not underfitting like LabelPowerset
- does not require parameter estimation
Weak sides:
- requires label relationships present in training data to be representable of the problem
Parameter estimation needed: Only for embedder Complexity: depends on the selection of embedder, regressor and classifier
Embedds the label space, trains a regressor (or many) for unseen samples to predict their embeddings, and a classifier to correct the regression error
Strong sides:
- improves discriminability and joint label probability distributions
- good results with low-complexity linear embeddings and weak regressors/classifiers
- Weak sides:
- requires some parameter estimation while rule-of-thumb ideas exist in papers
5.2. Data-driven model selection¶
Scikit-multilearn allows estimating parameters to select best models for multi-label classification using scikit-learn’s model selection GridSearchCV API. In the simplest version it can look for the best parameter of a scikit-multilearn’s classifier, which we’ll show on the example case of estimating parameters for MLkNN, and in the more complicated cases of problem transformation methods it can estimate both the method’s hyper parameters and the base classifiers parameter.
5.2.1. Estimating hyper-parameter k for MLkNN¶
In the case of estimating the hyperparameter of a multi-label
classifier, we first import the relevant classifier and scikit-learn’s
GridSearchCV class. Then we define the values of parameters we want to
evaluate. We are interested in which combination of k
- the number
of neighbours, s
- the smoothing parameter works best. We also need
to select a measure which we want to optimize - we’ve chosen the F1
macro score.
After selecting the parameters we intialize and _run the cross validation grid search and print the best hyper parameters.
In [5]:
from skmultilearn.adapt import MLkNN
from sklearn.model_selection import GridSearchCV
parameters = {'k': range(1,3), 's': [0.5, 0.7, 1.0]}
clf = GridSearchCV(MLkNN(), parameters, scoring='f1_macro')
clf.fit(X_train, y_train)
print (clf.best_params_, clf.best_score_)
({'k': 1, 's': 0.5}, 0.45223607257008969)
These values can be then used directly with the classifier.
5.2.2. Estimating hyper-parameter k for embedded classifiers¶
In problem transformation classifiers we often need to estimate not only a hyper parameter, but also the parameter of the base classifier, and also - maybe even the problem transformation method. Let’s take a look at this on a three-layer construction of ensemble of problem transformation classifiers using label space partitioning, the parameters include:
classifier
: which takes a parameter - a classifier for transforming multi-label classification problem to a single-label classification, we will decide between the Label Powerset and Classifier Chainsclassifier__classifier
: which is the base classifier for the transformation strategy, we will use random forests hereclassifier__classifier__n_estimators
: the number of trees to be used in the forest, will be passed to the random forest objectclusterer
: a label space partitioning class, we will decide between two approaches provided by the NetworkX library.
In [10]:
from skmultilearn.problem_transform import ClassifierChain, LabelPowerset
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from skmultilearn.cluster import NetworkXLabelGraphClusterer
from skmultilearn.cluster import LabelCooccurrenceGraphBuilder
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from sklearn.svm import SVC
parameters = {
'classifier': [LabelPowerset(), ClassifierChain()],
'classifier__classifier': [RandomForestClassifier()],
'classifier__classifier__n_estimators': [10, 20, 50],
'clusterer' : [
NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'louvain'),
NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'lpa')
]
}
clf = GridSearchCV(LabelSpacePartitioningClassifier(), parameters, scoring = 'f1_macro')
clf.fit(X_train, y_train)
print (clf.best_params_, clf.best_score_)
({'classifier__classifier__n_estimators': 50, 'classifier__classifier': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False), 'classifier': LabelPowerset(classifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
require_dense=[True, True]), 'clusterer': <skmultilearn.cluster.networkx.NetworkXLabelGraphClusterer object at 0x7fc42ec75e50>}, 0.59569187181557981)
In [ ]: