5. Multi-label data stratification¶
With the development of more complex multi-label transformation methods the community realizes how much the quality of classification depends on how the data is split into train/test sets or into folds for parameter estimation. More questions appear on stackoverflow or crossvalidated concerning methods for multi-label stratification.
For many reasons, described here and here traditional single-label approaches to stratifying data fail to provide balanced data set divisions which prevents classifiers from generalizing information.
Some train/test splits don’t include evidence for a given label at all in the train set. others disproportionately put even as much as 70% of label pair evidence in the test set, leaving the train set without proper evidence for generalizing conditional probabilities for label relations.
You can also watch a great video presentation from ECML 2011 which explains this in depth:
On the Stratification of Multi-Label Data Grigorios Tsoumakas
Scikit-multilearn provides an implementation of iterative stratification which aims to provide well-balanced distribution of evidence of label relations up to a given order. To see what it means, let’s load up some data. We’ll be using the scene data set, both in divided and undivided variants, to illustrate the problem.
In [263]:
from skmultilearn.dataset import load_dataset
X,y, _, _ = load_dataset('scene', 'undivided')
scene:undivided - exists, not redownloading
Let’s look at how many examples are available per label combination:
In [264]:
from skmultilearn.model_selection.measures import get_combination_wise_output_matrix
In [265]:
Counter(combination for row in get_combination_wise_output_matrix(y.A, order=2) for combination in row)
Out[265]:
Counter({(0, 0): 427,
(0, 3): 1,
(0, 4): 38,
(0, 5): 19,
(1, 1): 364,
(2, 2): 397,
(2, 3): 24,
(2, 4): 14,
(3, 3): 433,
(3, 4): 76,
(3, 5): 6,
(4, 4): 533,
(4, 5): 1,
(5, 5): 431})
Let’s load up the original division, to see how the set was split into train/test data in 2004, before multi-label stratification methods appeared.
In [266]:
_, original_y_train, _, _ = load_dataset('scene', 'train')
_, original_y_test, _, _ = load_dataset('scene', 'test')
scene:train - exists, not redownloading
scene:test - exists, not redownloading
In [267]:
import pandas as pd
In [268]:
pd.DataFrame({
'train': Counter(str(combination) for row in get_combination_wise_output_matrix(original_y_train.A, order=2) for combination in row),
'test' : Counter(str(combination) for row in get_combination_wise_output_matrix(original_y_test.A, order=2) for combination in row)
}).T.fillna(0.0)
Out[268]:
(0, 0) | (0, 3) | (0, 4) | (0, 5) | (1, 1) | (2, 2) | (2, 3) | (2, 4) | (3, 3) | (3, 4) | (3, 5) | (4, 4) | (4, 5) | (5, 5) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
test | 200.0 | 1.0 | 17.0 | 7.0 | 199.0 | 200.0 | 16.0 | 8.0 | 237.0 | 49.0 | 5.0 | 256.0 | 0.0 | 207.0 |
train | 227.0 | 0.0 | 21.0 | 12.0 | 165.0 | 197.0 | 8.0 | 6.0 | 196.0 | 27.0 | 1.0 | 277.0 | 1.0 | 224.0 |
In [269]:
original_y_train.shape[0], original_y_test.shape[0]
Out[269]:
(1211, 1196)
We can see that the split size is nearly identical, yet some label combination evidence is well balanced between the splits. While this is a toy case on a small data set, such phenomena are common in larger datasets. We would like to fix this.
Let’s load the iterative stratifier and divided the set again.
In [270]:
from skmultilearn.model_selection import iterative_train_test_split
In [278]:
X_train, y_train, X_test, y_test = iterative_train_test_split(X, y, test_size = 0.5)
In [279]:
pd.DataFrame({
'train': Counter(str(combination) for row in get_combination_wise_output_matrix(y_train.A, order=2) for combination in row),
'test' : Counter(str(combination) for row in get_combination_wise_output_matrix(y_test.A, order=2) for combination in row)
}).T.fillna(0.0)
Out[279]:
(0, 0) | (0, 3) | (0, 4) | (0, 5) | (1, 1) | (2, 2) | (2, 3) | (2, 4) | (3, 3) | (3, 4) | (3, 5) | (4, 4) | (4, 5) | (5, 5) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
test | 213.0 | 0.0 | 19.0 | 9.0 | 182.0 | 199.0 | 12.0 | 7.0 | 217.0 | 38.0 | 3.0 | 267.0 | 1.0 | 215.0 |
train | 214.0 | 1.0 | 19.0 | 10.0 | 182.0 | 198.0 | 12.0 | 7.0 | 216.0 | 38.0 | 3.0 | 266.0 | 0.0 | 216.0 |
We can see that the new division is much more balanced.