1. Getting started with scikit-multilearn

Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.

To install it just run the command:

$ pip install scikit-multilearn

Scikit-multilearn works with Python 2 and 3 on Windows, Linux and OSX. The module name is skmultilearn.

In [1]:
from skmultilearn.dataset import load_dataset

Let’s load up some data. In this tutorial we will be working with the emotions data set introduced in emotions.

In [2]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading

The feature_names variable contains list of pairs (feature name, type) that were provided in the original data set. In the case of emotions data the authors write:

The extracted features fall into two categories: rhythmic and timbre.

Let’s take a look at the first few features:

In [3]:
feature_names[:10]
Out[3]:
[(u'Mean_Acc1298_Mean_Mem40_Centroid', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_Rolloff', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_Flux', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_0', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_1', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_2', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_3', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_4', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_5', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_6', u'NUMERIC')]

The label_names variable contains list of pairs (label name, type) of labels that were used to annotate the music. The paper states that:

The Tellegen-Watson-Clark model was employed for labeling the data with emotions. The sound clips were annotated by three male experts of age 20, 25 and 30 from the School of Music Studies

The labels counts in the training data are as follows:

Label Description # Examples
L1 amazed-surprised 173
L2 happy-pleased 166
L3 relaxing-calm 264
L4 quiet-still 148
L5 sad-lonely 168
L6 angry-fearful 189

Let’s see the contents of label_names:

In [4]:
label_names
Out[4]:
[(u'amazed-suprised', [u'0', u'1']),
 (u'happy-pleased', [u'0', u'1']),
 (u'relaxing-calm', [u'0', u'1']),
 (u'quiet-still', [u'0', u'1']),
 (u'sad-lonely', [u'0', u'1']),
 (u'angry-aggresive', [u'0', u'1'])]
In [5]:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC
In [6]:
clf = BinaryRelevance(
    classifier=SVC(),
    require_dense=[False, True]
)

On a side note, Binary Relevance trains a classifier per each of the labels, we can see that the classifier hasn’t been trained yet:

In [7]:
clf.classifiers
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-5aa82f5c3cc2> in <module>()
----> 1 clf.classifiers

AttributeError: 'BinaryRelevance' object has no attribute 'classifiers'

Scikit-learn introduces a convention of how classifiers are organized. The typical usage of classifier is:

  • fit it to the data (trains the classifier and returns self)
  • predict results on new data (returns predicted results)

Scikit-multilearn follows these conventions, let’s train a multi-label classifier:

In [8]:
clf.fit(X_train, y_train)
Out[8]:
BinaryRelevance(classifier=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
        require_dense=[False, True])

The base classifiers have been trained now:

In [9]:
clf.classifiers
Out[9]:
[SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False)]
In [10]:
prediction = clf.predict(X_test)
In [11]:
prediction
Out[11]:
<202x6 sparse matrix of type '<type 'numpy.int64'>'
    with 246 stored elements in Compressed Sparse Column format>
In [12]:
## Measure the quality
In [13]:
import sklearn.metrics as metrics

Scikit-learn provides a set of metrics useful for evaluating the quality of the model. They are most often used by providing the true assignment matrix/array as the first argument, and the prediction matrix/array as the second argument.

In [14]:
metrics.hamming_loss(y_test, prediction)
Out[14]:
0.26485148514851486
In [15]:
metrics.accuracy_score(y_test, prediction)
Out[15]:
0.14356435643564355