{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Relevant Concepts in Multi-Label Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section you will learn the basic concepts behind multi-label classification.\n",
    "\n",
    "## Aim\n",
    "\n",
    "Classification aims to assign classes/labels to objects. Objects usually represent things we come across in daily life: photos, audio recordings, text documents, videos, but can also include complicated biological systems. \n",
    "\n",
    "Objects are usually represented by their selected features (its count denoted as ``n_features`` in the documentation). Features are the characteristics of objects that distinguish them from others. For example text documents can be represented by words that are present in them. \n",
    "\n",
    "The output of classification for a given object is either a class or a set of classes. Traditional classification, usually due to computational limits, aimed at solving only single-label scenarios in which at most one class had been assigned to an object.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Single-label vs multi-label classification\n",
    "\n",
    "One can identify two types of single-label classification problems:\n",
    "\n",
    "- a single-class one, where the decision is whether to assign the class or not, for ex. having a photo sample from someones pancreas, deciding if it is a photo of cancer sample or not. This is also sometimes called binary classification, as the output values of the predictions are always ``0`` or ``1``\n",
    "- a multi-class problem where the class, if assigned, is selected from a number of available classes: for example, assigning a brand to a photo of a car\n",
    "\n",
    "In multi-label classification one can assign more than one label/class out of the available `n_labels` to a given object.\n",
    "\n",
    "[Madjarov et al.](http://kt.ijs.si/DragiKocev/wikipage/lib/exe/fetch.php?media=2012pr_ml_comparison.pdf) divide approaches to multi-label classification into three categories, you should select a scikit-multilearn base class according to the philosophy behind your classifier: \n",
    "\n",
    "- algorithm adaptation, currently none in ``scikit-multilearn`` in the future they will be placed in ``skmultilearn.adapt``\n",
    "- problem transformation, such as Binary Relevance, Label Powerset & more, are now available from ``skmultilearn.problem_transformation``\n",
    "- ensemble classification, such as ``RAkEL`` or label space partitioning classifiers, are now available from ``skmultilearn.ensemble``\n",
    "\n",
    "A single-label classifier is a function that given an object represented as a feature vector of length ``n_features`` assigns a class (a number, or None). A multi-label classifier outputs a set of assigned labels, either in a form of a list of assigned labels or as a binary vector in which a ``1`` or ``0`` on ``i``-th position indicates if an ``i``-th label is assigned or not.\n",
    "\n",
    "To learn a classifier we use a training set that provides ``n_samples`` of sampled objects represented by ``n_features`` with evidence concerning which labels out of ``n_labels`` are assigned to each of the object. The quality of the classifier is tested on a test set that follows the same format.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-label classification data\n",
    "\n",
    "To train a classification model we need data about a phenomenon that the classifier is supposed to generalise. Such data usually comes in two parts:\n",
    "\n",
    "- the objects to classify - the input space - which we will denote as ``X`` and which consists of ``n_samples`` that are represented using ``n_features``\n",
    "- the labels assigned to ``n_samples`` objects - an output space - which we will denote as ``y``. ``y`` provides information about which, out of ``n_labels`` that are available, are actually assigned to each of ``n_samples`` objects\n",
    " \n",
    "### The multi-label data representation\n",
    "\n",
    "scikit-multilearn expects on input:\n",
    "\n",
    "- ``X`` to be a matrix of shape ``(n_samples, n_features)``\n",
    "- ``y`` to be a matrix of shape ``(n_samples, n_labels)`` \n",
    "\n",
    "Let's load up a data set to see this in practice:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "emotions:train - exists, not redownloading\n"
     ]
    }
   ],
   "source": [
    "from skmultilearn.dataset import load_dataset\n",
    "X, y, _, _ = load_dataset('emotions', 'train')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<391x72 sparse matrix of type '<type 'numpy.float64'>'\n",
       " \twith 28059 stored elements in LInked List format>,\n",
       " <391x6 sparse matrix of type '<type 'numpy.int64'>'\n",
       " \twith 709 stored elements in LInked List format>)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that in the case of emotions data the values are:\n",
    "- n_samples: 391\n",
    "- n_features: 72\n",
    "- n_labels: 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "By matrix scikit-multilearn understands following the ``A[i,j]`` element accessing scheme. Sparse matrices should be used instead of dense ones, especially for the output space. Scikit-multilearn will internally convert dense representations to sparse representations that are most suitable to a given classification procedure. Scikit-multilearn will output\n",
    "\n",
    "``X`` can store any type of data a given classification method can handle is allowed, but nominal encoding is always helpful. Nominal encoding is enabled by default when loading data with :meth:`skmultilearn.dataset.Dataset.load_arff_to_numpy` helper, which also returns sparse representations of ``X`` and ``y`` loaded from ARFF data file. \n",
    "\n",
    "``y`` is expected to be a binary ``integer`` indicator matrix of shape. In the binary indicator matrix each matrix element ``A[i,j]`` should be either ``1`` if label ``j`` is assigned to an object no ``i``, and ``0`` if not.\n",
    "\n",
    "We highly recommend for every multi-label output space to be stored in sparse matrices and expect scikit-multilearn classifiers to operate only on sparse binary label indicator matrices internally. This is also the format of predicted label assignments. Sparse representation is employed as default because it is very rare for a real-world output space ``y`` to be dense. Usually, the number of labels assigned per instance is just a small portion of all labels. The average percentage of labels assigned per object is called `label density` and in established data sets it `tends to be small <http://mulan.sourceforge.net/datasets-mlc.html>`_.\n",
    "\n",
    "### Single-label representations in problem transformation\n",
    "\n",
    "The problem transformation approach to multi-label classification converts multi-label problems to single-label problems: single-class or multi-class. Then those problems are solved using base classifiers. Scikit-multilearn maintains compatibility with [scikit-learn data format for single-label classifiers](http://scikit-learn.org/stable/modules/multiclass.html) ,which expect:\n",
    "\n",
    "- ``X`` to have an `(n_samples, n_features)` shape and be one of the following:\n",
    "\n",
    "\t- an ``array-like`` of ``array-likes``, which usually means a nested array, where ``i``-th row and ``j``-th column are adressed as ``X[i][j]``, in many cases the classifiers expect ``array-like`` to be an ``np.array``\n",
    "\t- a dense matrix of the type ``np.matrix``\n",
    "\t- a scipy sparse matrix  \n",
    "\n",
    "- ``y`` to be a one-dimensional ``array-like`` of shape ``(n_samples,)`` with one class value per sample, which is a natural representation of a single-label problem\n",
    "\n",
    "The data set is stored in sparse matrices for efficiency. However not all scikit-learn classifiers support matrix input and sparse representations. For this reason, every scikit-multilearn classifier that follows a problem transformation approach admits a ``require_dense`` parameter in the constructor. As these scikit-multilearn classifiers transform the multi-label problem to a set of single-label problems and solve them using scikit-learn base classifiers - the ``require_dense`` parameter allows control over which format of the transformed input and output space passed to the base classifier.\n",
    "\n",
    "The parameter ``require_dense`` expects a two-element list: ``[bool or None, bool or None]`` which control the input and output space formats respectively. If None - the base classifier will receive a dense representation if it does not inherit :class:`skmultilearn.base.MLClassifierBase`, otherwise the representation forwarded will be sparse. The dense representation for ``X`` is a ``numpy.matrix``, while for ``y`` it is a ``numpy.array of int`` (scikit-learn's required format of the output space).\n",
    "\n",
    "Scikit-learn's expected format is described [in the scikit-learn docs](http://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification-format) and assumes that:\n",
    "\n",
    "- ``X`` is provided either as a ``numpy.matrix``, a ``sparse.matrix`` or as ``array likes of arrays likes`` (vectors) of features, i.e. the array of row vectors that consist of input features (same length, i.e. feature/attribute count), ex. a two-object set with each row being a small 1px x 1px image with RGB channels (three ``int8`` values describing red, blue, green colors per pixel): ``[[128,10,10,20,30,128], [10,155,30,10,155,10]]`` - scikit-multilearn will expect a matrix representation and will forward a matrix representation to the base classifier\n",
    "- ``y`` is expected to be provided as an array of array likes\n",
    "\n",
    "Some scikit-learn classifiers support the sparse representation of ``X`` especially for textual data, to have it forwarded as such to the scikit-learn classifier one needs to pass ``require_dense = [False, None]`` to the scikit-multilearn classifier's constructor. If you are sure that the base classifier you use will be able to handle a sparse matrix representation of ``y`` - pass ``require_dense = [None, False]``. Pass ``require_dense = [False, False]`` if both ``X`` and ``y`` are supported in sparse representation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}