skmultilearn.model_selection.iterative_stratification module

Iterative stratification for multi-label data

The classifier follows methods outlined in Sechidis11 and Szymanski17 papers related to stratyfing multi-label data.

In general what we expect from a given stratification output is that a strata, or a fold, is close to a given, demanded size, usually equal to 1/k in k-fold approach, or a x% train to test set division in 2-fold splits.

The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.

You can also watch a video presentation by G. Tsoumakas which explains the algorithm. In 2017 Szymanski & Kajdanowicz extended the algorithm to handle high-order relationships in the data set, if order = 1, the algorithm falls back to the original Sechidis11 setting.

If order is larger than 1 this class constructs a list of label combinations with replacement, i.e. allowing combinations of lower order to be take into account. For example for combinations of order 2, the stratifier will consider both label pairs (1, 2) and single labels denoted as (1,1) in the algorithm. In higher order cases the when two combinations of different size have similar desirablity: the larger, i.e. more specific combination is taken into consideration first, thus if a label pair (1,2) and label 1 represented as (1,1) are of similar desirability, evidence for (1,2) will be assigned to folds first.

You can use this class exactly the same way you would use a normal scikit KFold class:

from skmultilearn.model_selection import IterativeStratification

k_fold = IterativeStratification(n_splits=2, order=1):
for train, test in k_fold.split(X, y):
    classifier.fit(X[train], y[train])
    result = classifier.predict(X[test])
    # do something with the result, comparing it to y[test]

Most of the methods of this class are private, you will not need them unless you are extending the method.

If you use this method to stratify data please cite both: Sechidis, K., Tsoumakas, G., & Vlahavas, I. (2011). On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases, 145-158. http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf

Piotr Szymański, Tomasz Kajdanowicz ; Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 74:22-35, 2017. http://proceedings.mlr.press/v74/szyma%C5%84ski17a.html

Bibtex:

@article{sechidis2011stratification,
  title={On the stratification of multi-label data},
  author={Sechidis, Konstantinos and Tsoumakas, Grigorios and Vlahavas, Ioannis},
  journal={Machine Learning and Knowledge Discovery in Databases},
  pages={145--158},
  year={2011},
  publisher={Springer}
}

@InProceedings{pmlr-v74-szymański17a,
  title =    {A Network Perspective on Stratification of Multi-Label Data},
  author =   {Piotr Szymański and Tomasz Kajdanowicz},
  booktitle =    {Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications},
  pages =    {22--35},
  year =     {2017},
  editor =   {Luís Torgo and Bartosz Krawczyk and Paula Branco and Nuno Moniz},
  volume =   {74},
  series =   {Proceedings of Machine Learning Research},
  address =      {ECML-PKDD, Skopje, Macedonia},
  publisher =    {PMLR},
}
class skmultilearn.model_selection.iterative_stratification.IterativeStratification(n_splits=3, order=1, sample_distribution_per_fold=None, random_state=None)[source]

Bases: sklearn.model_selection._split._BaseKFold

Iteratively stratify a multi-label data set into folds

Construct an interative stratifier that splits the data set into folds trying to maintain balanced representation with respect to order-th label combinations.

n_splits

the number of folds to stratify into

Type:number of splits, int
order

the order of label relationship to take into account when balancing sample distribution across labels

Type:int, >= 1
sample_distribution_per_fold

desired percentage of samples in each of the folds, if None and equal distribution of samples per fold is assumed i.e. 1/n_splits for each fold. The value is held in self.percentage_per_fold.

Type:None or List[float], (n_splits)
random_state

the random state seed (optional)

Type:int
skmultilearn.model_selection.iterative_stratification.iterative_train_test_split(X, y, test_size)[source]

Iteratively stratified train/test split

Parameters:test_size (float, [0,1]) – the proportion of the dataset to include in the test split, the rest will be put in the train set
Returns:stratified division into train/test split
Return type:X_train, y_train, X_test, y_test