Source code for skmultilearn.model_selection.iterative_stratification

# -*- coding: utf-8 -*-
"""Iterative stratification for multi-label data

The classifier follows methods outlined in Sechidis11 and Szymanski17 papers related to stratyfing
multi-label data.

In general what we expect from a given stratification output is that a strata, or a fold, is close to a given, demanded size,
usually equal to 1/k in k-fold approach, or a x% train to test set division in 2-fold splits.

The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired
by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds,
in the end negative evidence is distributed based on a folds desirability of size.

You can also watch a `video presentation <http://videolectures.net/ecmlpkdd2011_tsoumakas_stratification/?q=stratification%20multi%20label>`_ by G. Tsoumakas which explains the algorithm. In 2017 Szymanski & Kajdanowicz extended the algorithm
to handle high-order relationships in the data set, if order = 1, the algorithm falls back to the original Sechidis11 setting.

If order is larger than 1 this class constructs a list of label combinations with replacement, i.e. allowing combinations of lower
order to be take into account. For example for combinations of order 2, the stratifier will consider both
label pairs (1, 2) and single labels denoted as (1,1) in the algorithm. In higher order cases the
when two combinations of different size have similar desirablity: the larger, i.e. more specific combination
is taken into consideration first, thus if a label pair (1,2) and label 1 represented as (1,1) are of similar
desirability, evidence for (1,2) will be assigned to folds first.

You can use this class exactly the same way you would use a normal scikit KFold class:

.. code-block :: python

    from skmultilearn.model_selection import IterativeStratification

    k_fold = IterativeStratification(n_splits=2, order=1):
    for train, test in k_fold.split(X, y):
        classifier.fit(X[train], y[train])
        result = classifier.predict(X[test])
        # do something with the result, comparing it to y[test]

Most of the methods of this class are private, you will not need them unless you are extending the method.

If you use this method to stratify data please cite both:
Sechidis, K., Tsoumakas, G., & Vlahavas, I. (2011). On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases, 145-158.
http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf

Piotr Szymański, Tomasz Kajdanowicz ; Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 74:22-35, 2017.
http://proceedings.mlr.press/v74/szyma%C5%84ski17a.html

Bibtex:

.. code-block:: bibtex

    @article{sechidis2011stratification,
      title={On the stratification of multi-label data},
      author={Sechidis, Konstantinos and Tsoumakas, Grigorios and Vlahavas, Ioannis},
      journal={Machine Learning and Knowledge Discovery in Databases},
      pages={145--158},
      year={2011},
      publisher={Springer}
    }

    @InProceedings{pmlr-v74-szymański17a,
      title =    {A Network Perspective on Stratification of Multi-Label Data},
      author =   {Piotr Szymański and Tomasz Kajdanowicz},
      booktitle =    {Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications},
      pages =    {22--35},
      year =     {2017},
      editor =   {Luís Torgo and Bartosz Krawczyk and Paula Branco and Nuno Moniz},
      volume =   {74},
      series =   {Proceedings of Machine Learning Research},
      address =      {ECML-PKDD, Skopje, Macedonia},
      publisher =    {PMLR},
    }
"""

from sklearn.model_selection._split import _BaseKFold
import numpy as np
import scipy.sparse as sp
import itertools
from sklearn.utils import check_random_state

[docs]def iterative_train_test_split(X, y, test_size):
    """Iteratively stratified train/test split

    Parameters
    ----------
    test_size : float, [0,1]
        the proportion of the dataset to include in the test split, the rest will be put in the train set

    Returns
    -------
    X_train, y_train, X_test, y_test
        stratified division into train/test split
    """

    stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])
    train_indexes, test_indexes = next(stratifier.split(X, y))

    X_train, y_train = X[train_indexes, :], y[train_indexes, :]
    X_test, y_test = X[test_indexes, :], y[test_indexes, :]

    return X_train, y_train, X_test, y_test



def _fold_tie_break(desired_samples_per_fold, M):
    """Helper function to split a tie between folds with same desirability of a given sample

    Parameters
    ----------
    desired_samples_per_fold: np.array[Float], :code:`(n_splits)`
        number of samples desired per fold
    M : np.array(int)
        List of folds between which to break the tie

    Returns
    -------
    fold_number : int
        The selected fold index to put samples into
    """
    if len(M) == 1:
        return M[0]
    else:
        max_val = max(desired_samples_per_fold[M])
        M_prim = np.where(
            np.array(desired_samples_per_fold) == max_val)[0]
        M_prim = np.array([x for x in M_prim if x in M])
        return np.random.choice(M_prim, 1)[0]


def _get_most_desired_combination(samples_with_combination):
    """Select the next most desired combination whose evidence should be split among folds

    Parameters
    ----------
    samples_with_combination : Dict[Combination, List[int]], :code:`(n_combinations)`
            map from each label combination present in y to list of sample indexes that have this combination assigned

    Returns
    -------
    combination: Combination
        the combination to split next
    """
    currently_chosen = None
    best_number_of_combinations, best_support_size = None, None

    for combination, evidence in samples_with_combination.items():
        number_of_combinations, support_size = (len(set(combination)), len(evidence))
        if support_size == 0:
            continue
        if currently_chosen is None or (
                best_number_of_combinations < number_of_combinations and best_support_size > support_size
        ):
            currently_chosen = combination
            best_number_of_combinations, best_support_size = number_of_combinations, support_size

    return currently_chosen


[docs]class IterativeStratification(_BaseKFold):
    """Iteratively stratify a multi-label data set into folds

    Construct an interative stratifier that splits the data set into folds trying to maintain balanced representation
    with respect to order-th label combinations.

    Attributes
    ----------

    n_splits : number of splits, int
        the number of folds to stratify into

    order : int, >= 1
        the order of label relationship to take into account when balancing sample distribution across labels

    sample_distribution_per_fold : None or List[float], :code:`(n_splits)`
        desired percentage of samples in each of the folds, if None and equal distribution of samples per fold
        is assumed i.e. 1/n_splits for each fold. The value is held in :code:`self.percentage_per_fold`.

    random_state : int
        the random state seed (optional)
    """

    def __init__(self, n_splits=3, order=1, sample_distribution_per_fold = None, random_state=None):
        self.order = order
        super(
            IterativeStratification,
            self).__init__(n_splits,
                           shuffle=False,
                           random_state=random_state)

        if sample_distribution_per_fold:
            self.percentage_per_fold = sample_distribution_per_fold
        else:
            self.percentage_per_fold = [1 / float(self.n_splits) for _ in range(self.n_splits)]

    def _prepare_stratification(self, y):
        """Prepares variables for performing stratification

        For the purpose of clarity, the type Combination denotes List[int], :code:`(self.order)` and represents a
        label combination of the order we want to preserve among folds in stratification. The total number of
        combinations present in :code:`(y)` will be denoted as :code:`(n_combinations)`.

        Sets
        ----

        self.n_samples, self.n_labels : int, int
            shape of y

        self.desired_samples_per_fold: np.array[Float], :code:`(n_splits)`
            number of samples desired per fold

        self.desired_samples_per_combination_per_fold: Dict[Combination, np.array[Float]], :code:`(n_combinations, n_splits)`
            number of samples evidencing each combination desired per each fold

        Parameters
        ----------

        y : output matrix or array of arrays (n_samples, n_labels)

        Returns
        -------

        rows : List[List[int]], :code:`(n_samples, n_labels)`
            list of label indices assigned to each sample

        rows_used : Dict[int, bool], :code:`(n_samples)`
            boolean map from a given sample index to boolean value whether it has been already assigned to a fold or not

        all_combinations :  List[Combination], :code:`(n_combinations)`
            list of all label combinations of order self.order present in y

        per_row_combinations : List[Combination], :code:`(n_samples)`
            list of all label combinations of order self.order present in y per row

        samples_with_combination : Dict[Combination, List[int]], :code:`(n_combinations)`
            map from each label combination present in y to list of sample indexes that have this combination assigned

        folds: List[List[int]] (n_splits)
            list of lists to be populated with samples

        """
        self.n_samples, self.n_labels = y.shape
        self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples
                                                  for i in range(self.n_splits)])
        rows = sp.lil_matrix(y).rows
        rows_used = {i: False for i in range(self.n_samples)}
        all_combinations = []
        per_row_combinations = [[] for i in range(self.n_samples)]
        samples_with_combination = {}
        folds = [[] for _ in range(self.n_splits)]

        # for every row
        for sample_index, label_assignment in enumerate(rows):
            # for every n-th order label combination
            # register combination in maps and lists used later
            for combination in itertools.combinations_with_replacement(label_assignment, self.order):
                if combination not in samples_with_combination:
                    samples_with_combination[combination] = []

                samples_with_combination[combination].append(sample_index)
                all_combinations.append(combination)
                per_row_combinations[sample_index].append(combination)

        all_combinations = [list(x) for x in set(all_combinations)]

        self.desired_samples_per_combination_per_fold = {
            combination:
                np.array([len(evidence_for_combination) * self.percentage_per_fold[j]
                          for j in range(self.n_splits)])
            for combination, evidence_for_combination in samples_with_combination.items()
        }
        return rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds

    def _distribute_positive_evidence(self, rows_used, folds, samples_with_combination, per_row_combinations):
        """Internal method to distribute evidence for labeled samples across folds

        For params, see documentation of :code:`self._prepare_stratification`. Does not return anything,
        modifies params.
        """
        l = _get_most_desired_combination(samples_with_combination)
        while l is not None:
            while len(samples_with_combination[l]) > 0:
                row = samples_with_combination[l].pop()
                if rows_used[row]:
                    continue

                max_val = max(self.desired_samples_per_combination_per_fold[l])
                M = np.where(
                    np.array(self.desired_samples_per_combination_per_fold[l]) == max_val)[0]
                m = _fold_tie_break(self.desired_samples_per_combination_per_fold[l], M)
                folds[m].append(row)
                rows_used[row] = True
                for i in per_row_combinations[row]:
                    if row in samples_with_combination[i]:
                        samples_with_combination[i].remove(row)
                    self.desired_samples_per_combination_per_fold[i][m] -= 1
                self.desired_samples_per_fold[m] -= 1

            l = _get_most_desired_combination(samples_with_combination)

    def _distribute_negative_evidence(self, rows_used, folds):
        """Internal method to distribute evidence for unlabeled samples across folds

        For params, see documentation of :code:`self._prepare_stratification`. Does not return anything,
        modifies params.
        """
        available_samples = [
            i for i, v in rows_used.items() if not v]
        samples_left = len(available_samples)

        while samples_left > 0:
            row = available_samples.pop()
            rows_used[row] = True
            samples_left -= 1
            fold_selected = np.random.choice(np.where(self.desired_samples_per_fold > 0)[0], 1)[0]
            self.desired_samples_per_fold[fold_selected] -= 1
            folds[fold_selected].append(row)

    def _iter_test_indices(self, X, y=None, groups=None):
        """Internal method for providing scikit-learn's split with folds

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
            Note that providing ``y`` is sufficient to generate the splits and
            hence ``np.zeros(n_samples)`` may be used as a placeholder for
            ``X`` instead of actual training data.
        y : array-like, shape (n_samples,)
            The target variable for supervised learning problems.
            Stratification is done based on the y labels.
        groups : object
            Always ignored, exists for compatibility.

        Yields
        ------
        fold : List[int]
            indexes of test samples for a given fold, yielded for each of the folds
        """
        if self.random_state:
            check_random_state(self.random_state)

        rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \
            self._prepare_stratification(y)

        self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)
        self._distribute_negative_evidence(rows_used, folds)

        for fold in folds:
            yield fold