skmultilearn.model_selection.iterative_stratification module¶
Iterative stratification for multi-label data
The classifier follows methods outlined in Sechidis11 and Szymanski17 papers related to stratyfing multi-label data.
In general what we expect from a given stratification output is that a strata, or a fold, is close to a given, demanded size, usually equal to 1/k in k-fold approach, or a x% train to test set division in 2-fold splits.
The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.
You can also watch a video presentation by G. Tsoumakas which explains the algorithm. In 2017 Szymanski & Kajdanowicz extended the algorithm to handle high-order relationships in the data set, if order = 1, the algorithm falls back to the original Sechidis11 setting.
If order is larger than 1 this class constructs a list of label combinations with replacement, i.e. allowing combinations of lower order to be take into account. For example for combinations of order 2, the stratifier will consider both label pairs (1, 2) and single labels denoted as (1,1) in the algorithm. In higher order cases the when two combinations of different size have similar desirablity: the larger, i.e. more specific combination is taken into consideration first, thus if a label pair (1,2) and label 1 represented as (1,1) are of similar desirability, evidence for (1,2) will be assigned to folds first.
You can use this class exactly the same way you would use a normal scikit KFold class:
from skmultilearn.model_selection import IterativeStratification
k_fold = IterativeStratification(n_splits=2, order=1):
for train, test in k_fold.split(X, y):
classifier.fit(X[train], y[train])
result = classifier.predict(X[test])
# do something with the result, comparing it to y[test]
Most of the methods of this class are private, you will not need them unless you are extending the method.
If you use this method to stratify data please cite both: Sechidis, K., Tsoumakas, G., & Vlahavas, I. (2011). On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases, 145-158. http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf
Piotr Szymański, Tomasz Kajdanowicz ; Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 74:22-35, 2017. http://proceedings.mlr.press/v74/szyma%C5%84ski17a.html
Bibtex:
@article{sechidis2011stratification,
title={On the stratification of multi-label data},
author={Sechidis, Konstantinos and Tsoumakas, Grigorios and Vlahavas, Ioannis},
journal={Machine Learning and Knowledge Discovery in Databases},
pages={145--158},
year={2011},
publisher={Springer}
}
@InProceedings{pmlr-v74-szymański17a,
title = {A Network Perspective on Stratification of Multi-Label Data},
author = {Piotr Szymański and Tomasz Kajdanowicz},
booktitle = {Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications},
pages = {22--35},
year = {2017},
editor = {Luís Torgo and Bartosz Krawczyk and Paula Branco and Nuno Moniz},
volume = {74},
series = {Proceedings of Machine Learning Research},
address = {ECML-PKDD, Skopje, Macedonia},
publisher = {PMLR},
}
-
class
skmultilearn.model_selection.iterative_stratification.
IterativeStratification
(n_splits=3, order=1, sample_distribution_per_fold=None, random_state=None)[source]¶ Bases:
sklearn.model_selection._split._BaseKFold
Iteratively stratify a multi-label data set into folds
Construct an interative stratifier that splits the data set into folds trying to maintain balanced representation with respect to order-th label combinations.
-
order
¶ the order of label relationship to take into account when balancing sample distribution across labels
Type: int, >= 1
-
sample_distribution_per_fold
¶ desired percentage of samples in each of the folds, if None and equal distribution of samples per fold is assumed i.e. 1/n_splits for each fold. The value is held in
self.percentage_per_fold
.Type: None or List[float], (n_splits)
-
-
skmultilearn.model_selection.iterative_stratification.
iterative_train_test_split
(X, y, test_size)[source]¶ Iteratively stratified train/test split
Parameters: test_size (float, [0,1]) – the proportion of the dataset to include in the test split, the rest will be put in the train set Returns: stratified division into train/test split Return type: X_train, y_train, X_test, y_test