1. Getting started with scikit-multilearn¶
Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.
To install it just run the command:
$ pip install scikit-multilearn
Scikit-multilearn works with Python 2 and 3 on Windows, Linux and OSX.
The module name is skmultilearn
.
In [1]:
from skmultilearn.dataset import load_dataset
Let’s load up some data. In this tutorial we will be working with the
emotions
data set introduced in emotions.
In [2]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading
The feature_names
variable contains list of pairs (feature name,
type) that were provided in the original data set. In the case of
emotions
data the authors write:
The extracted features fall into two categories: rhythmic and timbre.
Let’s take a look at the first few features:
In [3]:
feature_names[:10]
Out[3]:
[(u'Mean_Acc1298_Mean_Mem40_Centroid', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_Rolloff', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_Flux', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_0', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_1', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_2', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_3', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_4', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_5', u'NUMERIC'),
(u'Mean_Acc1298_Mean_Mem40_MFCC_6', u'NUMERIC')]
The label_names
variable contains list of pairs (label name, type)
of labels that were used to annotate the music. The paper states that:
The Tellegen-Watson-Clark model was employed for labeling the data with emotions. The sound clips were annotated by three male experts of age 20, 25 and 30 from the School of Music Studies
The labels counts in the training data are as follows:
Label | Description | # Examples |
---|---|---|
L1 | amazed-surprised | 173 |
L2 | happy-pleased | 166 |
L3 | relaxing-calm | 264 |
L4 | quiet-still | 148 |
L5 | sad-lonely | 168 |
L6 | angry-fearful | 189 |
Let’s see the contents of label_names
:
In [4]:
label_names
Out[4]:
[(u'amazed-suprised', [u'0', u'1']),
(u'happy-pleased', [u'0', u'1']),
(u'relaxing-calm', [u'0', u'1']),
(u'quiet-still', [u'0', u'1']),
(u'sad-lonely', [u'0', u'1']),
(u'angry-aggresive', [u'0', u'1'])]
In [5]:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC
In [6]:
clf = BinaryRelevance(
classifier=SVC(),
require_dense=[False, True]
)
On a side note, Binary Relevance trains a classifier per each of the labels, we can see that the classifier hasn’t been trained yet:
In [7]:
clf.classifiers
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-5aa82f5c3cc2> in <module>()
----> 1 clf.classifiers
AttributeError: 'BinaryRelevance' object has no attribute 'classifiers'
Scikit-learn introduces a convention of how classifiers are organized. The typical usage of classifier is:
- fit it to the data (trains the classifier and returns self)
- predict results on new data (returns predicted results)
Scikit-multilearn follows these conventions, let’s train a multi-label classifier:
In [8]:
clf.fit(X_train, y_train)
Out[8]:
BinaryRelevance(classifier=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
require_dense=[False, True])
The base classifiers have been trained now:
In [9]:
clf.classifiers
Out[9]:
[SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)]
In [10]:
prediction = clf.predict(X_test)
In [11]:
prediction
Out[11]:
<202x6 sparse matrix of type '<type 'numpy.int64'>'
with 246 stored elements in Compressed Sparse Column format>
In [12]:
## Measure the quality
In [13]:
import sklearn.metrics as metrics
Scikit-learn provides a set of metrics useful for evaluating the quality of the model. They are most often used by providing the true assignment matrix/array as the first argument, and the prediction matrix/array as the second argument.
In [14]:
metrics.hamming_loss(y_test, prediction)
Out[14]:
0.26485148514851486
In [15]:
metrics.accuracy_score(y_test, prediction)
Out[15]:
0.14356435643564355