{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Developer documentation\n", "\n", "Scikit-multilearn development team is an open international community that welcomes contributions and new developers. This document is for you if you want to implement a new:\n", "\n", "- classifier\n", "- relationship graph builder\n", "- label space clusterer\n", "\n", "Before we can go into development details, we need to discuss how to setup a comfortable development environment and what is the best way to contribute.\n", "\n", "\n", "### Working with the repository\n", "\n", "Scikit-learn is developed on github using git for code version management. To get the current codebase you need to checkout the scikit-multilearn repository\n", "\n", "```\n", "git clone git@github.com:scikit-multilearn/scikit-multilearn.git\n", "```\n", "\n", "To make a contribution to the repository your should fork the repository, clone your fork, and start development based on the `master` branch. Once you're done, push your commits to your repository and submit a pull request for review. \n", "\n", "The review usually includes:\n", "- making sure that your code works, i.e. it has enough unit tests and tests pass\n", "- reading your code's documentation, it should follow the numpydoc standard\n", "- checking whether your code works properly on sparse matrix input\n", "- your class should not store more data in memory than neccessary\n", "\n", "Once your contributions adhere to reviewer comments, your code will be included in the next release.\n", "\n", "### Development Docker image\n", "\n", "To ease development and testing we provide a docker image containing all libraries needed to test all of scikit-multilearn codebase. It is an ubuntu based docker image with libraries that are very costly to compile such as python-graphtool. This docker image can be easily integrated with your PyCharm environment.\n", "\n", "To pull the [scikit-multilearn docker image](https://github.com/scikit-multilearn/development-docker) just use:\n", "\n", "```bash\n", "$ docker pull niedakh/scikit-multilearn-dev:latest\n", "```\n", "\n", "After cloning the scikit-multilearn repository, run the following command:\n", "\n", "\n", "This docker contains two python environments set for scikit-multilearn: 2.7 and 3.x, to use the first one run `python2` and `pip2`, the second is available via `python3` and `pip3`.\n", "\n", "You can pull the latest version from Docker hub using:\n", "```bash\n", "$ docker pull niedakh/scikit-multilearn-dev:latest\n", "```\n", "\n", "You can start it via:\n", "```bash\n", "$ docker run -e \"MEKA_CLASSPATH=/opt/meka/lib\" -v \"YOUR_CLONE_DIR:/home/python-dev/repo\" --name scikit_multilearn_dev_test_docker -p 8888:8888 -d niedakh/scikit-multilearn-dev:latest\n", "```\n", "\n", "To run the tests under the python 2.7 environment use:\n", "```bash\n", "$ docker exec -it scikit_multilearn_dev_test_docker python3 -m pytest /home/python-dev/repo\n", "```\n", "\n", "or for python 3.x use:\n", "```bash\n", "$ docker exec -it scikit_multilearn_dev_test_docker python2 -m pytest /home/python-dev/repo\n", "```\n", "\n", "To play around just login with:\n", "```bash\n", "$ docker exec -it scikit_multilearn_dev_test_docker bash\n", "```\n", "\n", "To start jupyter notebook run:\n", "\n", "```bash\n", "$ docker exec -it scikit_multilearn_dev_test_docker bash -c \"cd /home/python-dev/repo && jupyter notebook\"\n", "```\n", "\n", "### Building documentation\n", "\n", "In order to build HTML documentation just run:\n", "\n", "```bash\n", "$ docker exec -it scikit_multilearn_dev_test_docker bash -c \"cd /home/python-dev/repo/docs && make html\"\n", "```\n", "\n", "\n", "### Development\n", "\n", "One of the most comfortable ways to work on the library is to use [Pycharm](https://www.jetbrains.com/pycharm/) and its [support for docker-contained interpreters](https://www.jetbrains.com/help/pycharm/using-docker-as-a-remote-interpreter.html), just configure access to the docker server, set it up in Pycharm, use `niedakh/scikit-multilearn-dev:latest` as the image name and set up relevant path mappings, voila - you can now use this environment for development, debugging and running tests within the IDE. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing code\n", "\n", "At the very list you should make sure that your code:\n", "\n", "- works on Python 2 and Python 3 on Windows 10/Linux/OSX using travis/appveyor \n", "\n", "- PEP8 coding guidelines\n", "\n", "- follows scikit-learn interfaces if relevant interfaces exist\n", "\n", "- is documented in the [numpydocs fashion](http://numpydoc.readthedocs.io/en/latest/format.html), especially that all public API is documented, including attributes and an example use case, see existing code for inspiration\n", "\n", "- has tests written, you can find relevant tests in ``skmultilearn.cluster.tests`` and ``skmultilearn.problem_transform.tests``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing a label space clusterer\n", "\n", "One of the approaches to multi-label classification is to cluster the label space into subspaces and perform classification in smaller subproblems to reduce the risk of under/overfitting.\n", "\n", "In order to create your own label space clusterer you need to inherit :class:`LabelSpaceClustererBase` and implement the ``fit_predict(X, y)`` class method. Expect ``X`` and ``y`` to be sparse matrices, you and also use :func:`skmultilearn.utils.get_matrix_in_format` to convert to a desired matrix format. ``fit_predict(X, y)`` should return an array-like (preferably ``ndarray`` or at least a ``list``) of ``n_clusters`` subarrays which contain lists of labels present in a given cluster. An example of a correct partition of five labels is: ``np.array([[0,1], [2,3,4]])`` and of overlapping clusters: ``np.array([[0,1,2], [2,3,4]])``.\n", "\n", "\n", "### Example Clusterer\n", "\n", "Let us look at a toy example, where a clusterer divides the label space based on how a given label's ordinal divides modulo a given number of clusters." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emotions:train - exists, not redownloading\n", "emotions:test - exists, not redownloading\n" ] } ], "source": [ "X_train, y_train, _, _ = load_dataset('emotions', 'train')\n", "X_test, y_test, _, _ = load_dataset('emotions', 'test')" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n", "from skmultilearn.cluster.base import LabelSpaceClustererBase\n", "\n", "\n", "class ModuloClusterer(LabelSpaceClustererBase):\n", " \"\"\"Initializes the clusterer\n", "\n", " Parameters\n", " ----------\n", " n_clusters: int\n", " number of clusters to partition into\n", " \n", " Returns\n", " --------\n", " array-like of array-like, (n_clusters,)\n", " list of lists label indexes, each sublist represents labels\n", " that are in that community\n", " \"\"\"\n", " def __init__(self, n_clusters = None):\n", " \n", " super(ModuloClusterer, self).__init__()\n", " self.n_clusters = n_clusters\n", "\n", " def fit_predict(self, X, y):\n", " n_labels = y.shape[1]\n", " partition_list = [[] for _ in range(self.n_clusters)]\n", " for label in range(n_labels):\n", " partition_list[label % self.n_clusters].append(label)\n", " return np.array(partition_list)\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 3],\n", " [1, 4],\n", " [2, 5]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusterer = ModuloClusterer(n_clusters=3)\n", "clusterer.fit_predict(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the example Clusterer\n", "Such a clusterer can then be used with an ensemble classifier such as the ``LabelSpacePartitioningClassifier``." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n", "from skmultilearn.problem_transform import LabelPowerset\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LabelSpacePartitioningClassifier(classifier=LabelPowerset(classifier=GaussianNB(priors=None), require_dense=[True, True]),\n", " clusterer=ModuloClusterer(n_clusters=3),\n", " require_dense=[False, False])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LabelSpacePartitioningClassifier(\n", " classifier = LabelPowerset(classifier=GaussianNB()),\n", " clusterer = clusterer\n", ")\n", "clf " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.23762376237623761" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train)\n", "prediction = clf.predict(X_test)\n", "accuracy_score(y_test, prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing a Graph Builder\n", "\n", "Scikit-multilearn implements clusterers that are capable of infering label space clusters (in network science the word communities is used more often) from a graph/network depicting label relationships. These clusterers are further described in [Label relations](labelrelations.ipynb) chapter of the user guide.\n", "\n", "To implement your own graph builder you need to subclass `GraphBuilderBase` and implement the `transform` function which should return a weighted (or not) adjacency matrix in the form of a dictionary, with keys ``(label1, label2)`` and values representing a weight.\n", "\n", "\n", "### Example GraphBuilder\n", "\n", "Let's implement a simple graph builder which returns the correlations between labels." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "from scipy import stats\n", "from skmultilearn.cluster import GraphBuilderBase\n", "from skmultilearn.utils import get_matrix_in_format\n", "\n", "class LabelCorrelationGraphBuilder(GraphBuilderBase):\n", " \"\"\"Builds a graph with label correlations on edge weights\"\"\"\n", "\n", " def transform(self, y):\n", " \"\"\"Generate weighted adjacency matrix from label matrix\n", "\n", " This function generates a weighted label correlation\n", " graph based on input binary label vectors\n", "\n", " Parameters\n", " ----------\n", " y : numpy.ndarray or scipy.sparse\n", " dense or sparse binary matrix with shape \n", " ``(n_samples, n_labels)``\n", "\n", " Returns\n", " -------\n", " dict\n", " weight map with a tuple of ints as keys\n", " and a float value ``{ (int, int) : float }``\n", " \"\"\"\n", " label_data = get_matrix_in_format(y, 'csc')\n", " labels = range(label_data.shape[1])\n", " \n", " self.is_weighted = True\n", " \n", " edge_map = {}\n", " \n", " for label_1 in labels:\n", " for label_2 in range(0, label_1+1):\n", " # calculate pearson R correlation coefficient for label pairs\n", " # we only include the edges above diagonal as it is an undirected graph\n", " pearson_r, _ = stats.pearsonr(label_data[:,label_2].todense(), label_data[:,label_1].todense())\n", " edge_map[(label_2, label_1)] = pearson_r[0]\n", " \n", " return edge_map\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "graph_builder = LabelCorrelationGraphBuilder()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{(0, 0): 1.0,\n", " (0, 1): 0.0054205072520802679,\n", " (0, 2): -0.4730507042031965,\n", " (0, 3): -0.35907118960632034,\n", " (0, 4): -0.32287762681546733,\n", " (0, 5): 0.24883125852376733,\n", " (1, 1): 1.0,\n", " (1, 2): 0.1393556218283642,\n", " (1, 3): -0.25112700233108359,\n", " (1, 4): -0.3343594619173676,\n", " (1, 5): -0.36277277605002756,\n", " (2, 2): 1.0,\n", " (2, 3): 0.34204580629202336,\n", " (2, 4): 0.23107157941324433,\n", " (2, 5): -0.56137098197912705,\n", " (3, 3): 1.0,\n", " (3, 4): 0.48890609122000817,\n", " (3, 5): -0.35949125643829821,\n", " (4, 4): 1.0,\n", " (4, 5): -0.28842101609587079,\n", " (5, 5): 1.0}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_builder.transform(y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This adjacency matrix can be then used by a Label Graph clusterer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the example GraphBuilder" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 5], [1], [2], [3, 4]], dtype=object)" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skmultilearn.cluster import NetworkXLabelGraphClusterer\n", "clusterer = NetworkXLabelGraphClusterer(graph_builder=graph_builder)\n", "clusterer.fit_predict(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clusterer can be then used with the LabelSpacePartitioning classifier." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.13861386138613863" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skmultilearn.ensemble import LabelSpacePartitioningClassifier\n", "from skmultilearn.problem_transform import LabelPowerset\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.metrics import accuracy_score\n", "\n", "clf = LabelSpacePartitioningClassifier(\n", " classifier = LabelPowerset(classifier=GaussianNB()),\n", " clusterer = clusterer\n", ")\n", "clf.fit(X_train, y_train)\n", "prediction = clf.predict(X_test)\n", "accuracy_score(y_test, prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing a classifier\n", "\n", "To implement a multi-label classifier you need to subclass a classifier base class. Currently, you can select of a few classifier base classes depending on which approach to multi-label classification you follow.\n", "\n", "Scikit-multilearn inheritance tree for the classifier is shown on the figure below.\n", "\n", "![Classifier inheritance diagram][inheritance]\n", "\n", "[inheritance]: inheritance.png\n", "\n", "\n", "To implement a scikit-learn's ecosystem compatible classifier, we need to subclass two classes from sklearn.base: BaseEstimator and ClassifierMixin. For that we provide :class:`skmultilearn.base.MLClassifierBase` base class. We further extend this class with properties specific to the problem transformation approach in multi-label classification in :class:`skmultilearn.base.ProblemTransformationBase`.\n", "\n", "To implement a scikit-learn's ecosystem compatible classifier, we need to subclass two classes from sklearn.base: BaseEstimator and ClassifierMixin. For that we provide :class:`skmultilearn.base.MLClassifierBase` base class. We further extend this class with properties specific to the problem transformation approach in multi-label classification in :class:`skmultilearn.base.ProblemTransformationBase`.\n", "\n", "### Scikit-learn base classses\n", "\n", "#### BaseEstimator\n", "\n", "The base estimator class from scikit is responsible for providing the ability of cloning classifiers, for example when multiple instances of the same classifier are needed for cross-validation performed using the CrossValidation class.\n", "\n", "The class provides two functions responsible for that: ``get_params``, which fetches parameters from a classifier object and ``set_params``, which sets params of the target clone. The params should also be acceptable by the constructor.\n", "\n", "#### ClassifierMixin\n", "\n", "This is an interface with a non-important method that allows different classes in scikit to detect that our classifier behaves as a classifier (i.e. implements ``fit``/``predict`` etc.) and provides certain kind of outputs.\n", "\n", "\n", "### MLClassifierBase\n", "\n", "The base multi-label classifier in scikit-multilearn is :class:`skmultilearn.base.MLClassifierBase`. It provides two abstract methods: fit(X, y) to train the classifier and predict(X) to predict labels for a set of samples. These functions are expected from every classifier. It also provides a default implementation of get_params/set_params that works for multi-label classifiers.\n", "\n", "All you need to do in your classifier is: \n", "\n", "1. subclass ``MLClassifierBase`` or a derivative class\n", "2. set ``self.copyable_attrs`` in your class's constructor to a list of fields (as strings), that should be cloned (usually it is equal to the list of constructor's arguments)\n", "3. implement the ``fit`` method that trains your classifier\n", "4. implement the ``predict`` method that predicts results\n", "\n", "#### Copyable fields\n", "\n", "One of the most important concepts in scikit-learn's ``BaseEstimator``, is the concept of cloning. Scikit-learn provides a plethora of experiment performing methods, among others, cross-validation, which require the ability to clone a classifier. Scikit-multilearn's base multi-label class - ``MLClassifierBase`` - provides infrastructure for automatic cloning support.\n", "\n", "\n", "An example of this would be: \n", "\n", "```python\n", "from skmultilearn.base import MLClassifierBase\n", "\n", "class AssignKBestLabels(MLClassifierBase):\n", " \"\"\"Assigns k most frequent labels\n", " \n", " Parameters\n", " ----------\n", " k : int\n", " number of most frequent labels to assign\n", " \n", " Example\n", " -------\n", " An example use case for AssignKBestLabels:\n", "\n", " .. code-block:: python\n", "\n", " from skmultilearn. import AssignKBestLabels\n", " \n", " # initialize LabelPowerset multi-label classifier with a RandomForest\n", " classifier = AssignKBestLabels(\n", " k = 3\n", " )\n", "\n", " # train\n", " classifier.fit(X_train, y_train)\n", "\n", " # predict\n", " predictions = classifier.predict(X_test) \n", " \"\"\"\n", "\n", "\n", " def __init__(self, k = None):\n", " super(AssignKBestLabels, self).__init__()\n", " self.k = k\n", " self.copyable_attrs = ['k']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The fit method\n", "\n", "The ``fit(self, X, y)`` expects classifier training data:\n", "\n", "- ``X`` should be a sparse matrix of shape: ``(n_samples, n_features)``, although for compatibility reasons array of arrays and a dense matrix are supported. \n", "\n", "- ``y`` should be a sparse, binary indicator, matrix of shape: ``(n_samples, n_labels)`` with 1 in a position ``i,j`` when ``i``-th sample is labelled with label no. ``j``\n", "\n", "It should return ``self`` after the classifier has been fitted to training data. It is customary that ``fit`` should remember ``n_labels`` in a way. In practice we store ``n_labels`` as ``self.label_count`` in scikit-multilearn classifiers.\n", "\n", "Let's make our classifier trainable:\n", "```python\n", " def fit(self, X, y):\n", " \"\"\"Fits classifier to training data\n", "\n", " Parameters\n", " ----------\n", " X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n", " input feature matrix\n", " y : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n", " binary indicator matrix with label assignments\n", "\n", " Returns\n", " -------\n", " self\n", " fitted instance of self\n", " \"\"\"\n", " frequencies = (y_train.sum(axis=0)/float(y_train.sum().sum())).A.tolist()[0]\n", " labels_sorted_by_frequency = sorted(range(y_train.shape[1]), key = lambda i: frequencies[i])\n", " self.labels_to_assign = labels_sorted_by_frequency[:self.k]\n", " \n", " return self\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The predict and predict_proba method\n", "\n", "The ``predict(self, X)`` returns a prediction of labels for the samples from ``X``:\n", "\n", "- ``X`` should be a sparse matrix of shape: ``(n_samples, n_features)``, although for compatibility reasons array of arrays and a dense matrix are supported. \n", "\n", "The returned value is similar to ``y`` in ``fit``. It should be a sparse binary indicator matrix of the shape ``(n_samples, n_labels)``.\n", "\n", "In some cases, while scikit continues to progress towards a complete switch to sparse matrices, it might be needed to convert the sparse matrix to a `dense matrix` or even `array-like of array-likes`. Such is the case for some scoring functions in scikit. This problem should go away in the future versions of scikit.\n", "\n", "The ``predict_proba(self, X)`` functions similarly but returns the likelihood of the label being correctly assigned to samples from ``X``.\n", "\n", "Let's add the prediction functionality to our classifier and see how it works:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.10396039603960396" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skmultilearn.base import MLClassifierBase\n", "from scipy.sparse import lil_matrix\n", "\n", "class AssignKBestLabels(MLClassifierBase):\n", " \"\"\"Assigns k most frequent labels\n", " \n", " Parameters\n", " ----------\n", " k : int\n", " number of most frequent labels to assign\n", " \n", " Example\n", " -------\n", " An example use case for AssignKBestLabels:\n", "\n", " .. code-block:: python\n", "\n", " from skmultilearn. import AssignKBestLabels\n", " \n", " # initialize LabelPowerset multi-label classifier with a RandomForest\n", " classifier = AssignKBestLabels(\n", " k = 3\n", " )\n", "\n", " # train\n", " classifier.fit(X_train, y_train)\n", "\n", " # predict\n", " predictions = classifier.predict(X_test) \n", " \"\"\"\n", "\n", " def __init__(self, k = None):\n", " super(AssignKBestLabels, self).__init__()\n", " self.k = k\n", " self.copyable_attrs = ['k']\n", " \n", " def fit(self, X, y):\n", " \"\"\"Fits classifier to training data\n", "\n", " Parameters\n", " ----------\n", " X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n", " input feature matrix\n", " y : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n", " binary indicator matrix with label assignments\n", "\n", " Returns\n", " -------\n", " self\n", " fitted instance of self\n", " \"\"\"\n", " self.n_labels = y.shape[1]\n", " frequencies = (y.sum(axis=0)/float(y.sum().sum())).A.tolist()[0]\n", " labels_sorted_by_frequency = sorted(range(y.shape[1]), key = lambda i: frequencies[i])\n", " self.labels_to_assign = labels_sorted_by_frequency[:self.k]\n", " \n", " return self\n", " \n", " def predict(self, X):\n", " \"\"\"Predict labels for X\n", "\n", " Parameters\n", " ----------\n", " X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n", " input feature matrix\n", "\n", " Returns\n", " -------\n", " :mod:`scipy.sparse` matrix of `{0, 1}`, shape=(n_samples, n_labels)\n", " binary indicator matrix with label assignments\n", " \"\"\"\n", " \n", " prediction = lil_matrix(np.zeros(shape=(X.shape[0], self.n_labels), dtype=int))\n", " prediction[:,self.labels_to_assign] = 1\n", " \n", " return prediction\n", "\n", " def predict_proba(self, X):\n", " \"\"\"Predict probabilities of label assignments for X\n", "\n", " Parameters\n", " ----------\n", " X : `array_like`, :class:`numpy.matrix` or :mod:`scipy.sparse` matrix, shape=(n_samples, n_features)\n", " input feature matrix\n", "\n", " Returns\n", " -------\n", " :mod:`scipy.sparse` matrix of `float in [0.0, 1.0]`, shape=(n_samples, n_labels)\n", " matrix with label assignment probabilities\n", " \"\"\"\n", " \n", " probabilities = lil_matrix(np.zeros(shape=(X.shape[0], self.n_labels), dtype=float))\n", " probabilities[:,self.labels_to_assign] = 1.0\n", " \n", " return probabilities\n", "\n", "clf = AssignKBestLabels(k=2)\n", "clf.fit(X_train, y_train)\n", "prediction = clf.predict(X_test)\n", "accuracy_score(y_test, prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Selecting the base class\n", "\n", "Madjarov et al. divide approach to multi-label classification into three categories, you should select a scikit-multilearn base class according to the philosophy behind your classifier:\n", "\n", "- algorithm adaptation, when a single-label algorithm is directly adapted to the multi-label case, ex. Decision Trees can be adapted by taking multiple labels into consideration in decision functions, for now the base function for this approach is ``MLClassifierBase``\n", "\n", "- problem transformation, when the multi-label problem is transformed to a set of single-label problems, solved there and converted to a multi-label solution afterwards - for this approach we provide a comfortable ``ProblemTransformationBase`` base class\n", "\n", "- ensemble classification, when the multi-label classification is performed by an ensemble of multi-label classifiers to improve performance, overcome overfitting etc. In the case when your classifier concentrates on clustering the label space, you should use :class:`LabelSpacePartitioningClassifier` - which partitions a label space using a cluster class that implements the :class:`LabelSpaceClustererBase` interface.\n", "\n", "\n", "#### Problem transformation\n", "\n", "Problem transformation approach is centred around the idea of converting a multi-label problem into one or more single-label problems, which are usually solved by single- or multi-class classifiers. Scikit-learn is the de facto standard source of Python implementations of single-label classifiers.\n", "\n", "To perform the transformation, every problem transformation classifier needs a base classifier. As all classifiers that follow scikit-s BaseEstimator a clonable, scikit-multilearn's base class for problem transformation classifiers requires an instance of a base classifier in initialization. Such an instance can be cloned if needed, and its parameters can be set up comfortably.\n", "\n", "The biggest problem with joining single-label scikit classifiers with multi-label classifiers is that there exists no way to learn whether a given scikit classifier accepts sparse matrices as input for ``fit``/``predict`` functions. For this reason ``ProblemTransformationBase`` requires another parameter - ``require_dense`` : ``[ bool, bool ]`` - a list/tuple of two boolean values. If the first one is true, that means the base classifier expects a dense (scikit-compatible array-like of array-likes) representation of the sample feature space ``X``. If the second one is true - the target space ``y`` is passed to the base classifier as an array like of numbers. In case any of these are false - the arguments are passed as a sparse matrix.\n", "\n", "If the ``required_dense`` argument is not passed, it is set to ``[false, false]`` if a classifier inherits ::class::``MLClassifierBase`` and to ``[true, true]`` as a fallback otherwise. In short, it assumes dense representation is required for base classifier if the base classifier is not a scikit-multilearn classifier.\n", "\n", "\n", "\n", "### Ensemble classification\n", "\n", "Ensemble classification is an approach of transforming a multi-label classification problem into a family (an ensemble) of multi-label subproblems. \n", "\n", "\n", "\n", "### Unit testing classifiers\n", "\n", "Scikit-multilearn provides a base unit test class for testing classifiers. Please check ``skmultilearn.tests.classifier_basetest`` for a general framework for testing the multi-label classifier.\n", "\n", "Currently tests test three capabilities of the classifier:\n", "- whether the classifier works with dense/sparse input data :func:`ClassifierBaseTest.assertClassifierWorksWithSparsity`\n", "- whether the classifier predicts probabilities using ``predict_proba`` for dense/sparse input data :func:`ClassifierBaseTest.assertClassifierPredictsProbabilities`\n", "- whether it is clonable and works with scikit-learn's cross-validation classes :func:`ClassifierBaseTest.assertClassifierWorksWithCV`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }