5. Multi-label data stratification

With the development of more complex multi-label transformation methods the community realizes how much the quality of classification depends on how the data is split into train/test sets or into folds for parameter estimation. More questions appear on stackoverflow or crossvalidated concerning methods for multi-label stratification.

For many reasons, described here and here traditional single-label approaches to stratifying data fail to provide balanced data set divisions which prevents classifiers from generalizing information.

Some train/test splits don’t include evidence for a given label at all in the train set. others disproportionately put even as much as 70% of label pair evidence in the test set, leaving the train set without proper evidence for generalizing conditional probabilities for label relations.

You can also watch a great video presentation from ECML 2011 which explains this in depth:

On the Stratification of Multi-Label Data Grigorios Tsoumakas

Scikit-multilearn provides an implementation of iterative stratification which aims to provide well-balanced distribution of evidence of label relations up to a given order. To see what it means, let’s load up some data. We’ll be using the scene data set, both in divided and undivided variants, to illustrate the problem.

In [263]:
from skmultilearn.dataset import load_dataset
X,y, _, _ = load_dataset('scene', 'undivided')
scene:undivided - exists, not redownloading

Let’s look at how many examples are available per label combination:

In [264]:
from skmultilearn.model_selection.measures import get_combination_wise_output_matrix
In [265]:
Counter(combination for row in get_combination_wise_output_matrix(y.A, order=2) for combination in row)
Out[265]:
Counter({(0, 0): 427,
         (0, 3): 1,
         (0, 4): 38,
         (0, 5): 19,
         (1, 1): 364,
         (2, 2): 397,
         (2, 3): 24,
         (2, 4): 14,
         (3, 3): 433,
         (3, 4): 76,
         (3, 5): 6,
         (4, 4): 533,
         (4, 5): 1,
         (5, 5): 431})

Let’s load up the original division, to see how the set was split into train/test data in 2004, before multi-label stratification methods appeared.

In [266]:
_, original_y_train, _, _ = load_dataset('scene', 'train')
_, original_y_test, _, _ = load_dataset('scene', 'test')
scene:train - exists, not redownloading
scene:test - exists, not redownloading
In [267]:
import pandas as pd
In [268]:
pd.DataFrame({
    'train': Counter(str(combination) for row in get_combination_wise_output_matrix(original_y_train.A, order=2) for combination in row),
    'test' : Counter(str(combination) for row in get_combination_wise_output_matrix(original_y_test.A, order=2) for combination in row)
}).T.fillna(0.0)
Out[268]:
(0, 0) (0, 3) (0, 4) (0, 5) (1, 1) (2, 2) (2, 3) (2, 4) (3, 3) (3, 4) (3, 5) (4, 4) (4, 5) (5, 5)
test 200.0 1.0 17.0 7.0 199.0 200.0 16.0 8.0 237.0 49.0 5.0 256.0 0.0 207.0
train 227.0 0.0 21.0 12.0 165.0 197.0 8.0 6.0 196.0 27.0 1.0 277.0 1.0 224.0
In [269]:
original_y_train.shape[0], original_y_test.shape[0]
Out[269]:
(1211, 1196)

We can see that the split size is nearly identical, yet some label combination evidence is well balanced between the splits. While this is a toy case on a small data set, such phenomena are common in larger datasets. We would like to fix this.

Let’s load the iterative stratifier and divided the set again.

In [270]:
from skmultilearn.model_selection import iterative_train_test_split
In [278]:
X_train, y_train, X_test, y_test = iterative_train_test_split(X, y, test_size = 0.5)
In [279]:
pd.DataFrame({
    'train': Counter(str(combination) for row in get_combination_wise_output_matrix(y_train.A, order=2) for combination in row),
    'test' : Counter(str(combination) for row in get_combination_wise_output_matrix(y_test.A, order=2) for combination in row)
}).T.fillna(0.0)
Out[279]:
(0, 0) (0, 3) (0, 4) (0, 5) (1, 1) (2, 2) (2, 3) (2, 4) (3, 3) (3, 4) (3, 5) (4, 4) (4, 5) (5, 5)
test 213.0 0.0 19.0 9.0 182.0 199.0 12.0 7.0 217.0 38.0 3.0 267.0 1.0 215.0
train 214.0 1.0 19.0 10.0 182.0 198.0 12.0 7.0 216.0 38.0 3.0 266.0 0.0 216.0

We can see that the new division is much more balanced.