skmultilearn.model_selection.iterative_stratification module¶
Iterative stratification for multilabel data
The classifier follows methods outlined in Sechidis11 and Szymanski17 papers related to stratyfing multilabel data.
In general what we expect from a given stratification output is that a strata, or a fold, is close to a given, demanded size, usually equal to 1/k in kfold approach, or a x% train to test set division in 2fold splits.
The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.
You can also watch a video presentation by G. Tsoumakas which explains the algorithm. In 2017 Szymanski & Kajdanowicz extended the algorithm to handle highorder relationships in the data set, if order = 1, the algorithm falls back to the original Sechidis11 setting.
If order is larger than 1 this class constructs a list of label combinations with replacement, i.e. allowing combinations of lower order to be take into account. For example for combinations of order 2, the stratifier will consider both label pairs (1, 2) and single labels denoted as (1,1) in the algorithm. In higher order cases the when two combinations of different size have similar desirablity: the larger, i.e. more specific combination is taken into consideration first, thus if a label pair (1,2) and label 1 represented as (1,1) are of similar desirability, evidence for (1,2) will be assigned to folds first.
You can use this class exactly the same way you would use a normal scikit KFold class:
from skmultilearn.model_selection import IterativeStratification
k_fold = IterativeStratification(n_splits=2, order=1):
for train, test in k_fold.split(X, y):
classifier.fit(X[train], y[train])
result = classifier.predict(X[test])
# do something with the result, comparing it to y[test]
Most of the methods of this class are private, you will not need them unless you are extending the method.
If you use this method to stratify data please cite both: Sechidis, K., Tsoumakas, G., & Vlahavas, I. (2011). On the stratification of multilabel data. Machine Learning and Knowledge Discovery in Databases, 145158. http://lpis.csd.auth.gr/publications/sechidisecmlpkdd2011.pdf
Piotr Szymański, Tomasz Kajdanowicz ; Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 74:2235, 2017. http://proceedings.mlr.press/v74/szyma%C5%84ski17a.html
Bibtex:
@article{sechidis2011stratification,
title={On the stratification of multilabel data},
author={Sechidis, Konstantinos and Tsoumakas, Grigorios and Vlahavas, Ioannis},
journal={Machine Learning and Knowledge Discovery in Databases},
pages={145158},
year={2011},
publisher={Springer}
}
@InProceedings{pmlrv74szymański17a,
title = {A Network Perspective on Stratification of MultiLabel Data},
author = {Piotr Szymański and Tomasz Kajdanowicz},
booktitle = {Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications},
pages = {2235},
year = {2017},
editor = {Luís Torgo and Bartosz Krawczyk and Paula Branco and Nuno Moniz},
volume = {74},
series = {Proceedings of Machine Learning Research},
address = {ECMLPKDD, Skopje, Macedonia},
publisher = {PMLR},
}

class
skmultilearn.model_selection.iterative_stratification.
IterativeStratification
(n_splits=3, order=1, sample_distribution_per_fold=None, random_state=None)[source]¶ Bases:
sklearn.model_selection._split._BaseKFold
Iteratively stratify a multilabel data set into folds
Construct an interative stratifier that splits the data set into folds trying to maintain balanced representation with respect to orderth label combinations.

n_splits
¶ number of splits, int – the number of folds to stratify into

order
¶ int, >= 1 – the order of label relationship to take into account when balancing sample distribution across labels

sample_distribution_per_fold
¶ None or List[float],
(n_splits)
– desired percentage of samples in each of the folds, if None and equal distribution of samples per fold is assumed i.e. 1/n_splits for each fold. The value is held inself.percentage_per_fold
.

random_state
¶ int – the random state seed (optional)


skmultilearn.model_selection.iterative_stratification.
iterative_train_test_split
(X, y, test_size)[source]¶ Iteratively stratified train/test split
Parameters: test_size (float, [0,1]) – the proportion of the dataset to include in the test split, the rest will be put in the train set Returns: stratified division into train/test split Return type: X_train, y_train, X_test, y_test