{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How to select a classifier\n", "\n", "This document will guide you through the process of selecting a classifier for your problem.\n", "\n", "Note that there is no established, scientifically proven rule-set for selecting a classifier to solve a general multi-label classification problem. Succesful approaches often come from mixing intuitions about which classifiers are worth considering, decomposition in to subproblems, and experimental model selection.\n", "\n", "There are two things you need to consider before choosing a classifier:\n", "\n", "- performance, i.e. generalization quality, how well will the model understand the relationship between features and labels, note that there for different use cases you might want to measure the quality using different measures, we'll talk about the measures in a moment \n", "- efficiency, i.e. how fast the classifier will perform, does it scale, is it usable in your problem based on number of labels, samples or label combinations\n", "\n", "There are two ways to make the choice:\n", "- intuition based on asymptotic performance and results from empirical studies\n", "- data-driven model selection using cross-validated parameter search\n", "\n", "Let's load up a data set to see have some thing to work on first." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emotions:train - does not exists downloading\n", "Downloaded emotions-train\n", "emotions:test - does not exists downloading\n", "Downloaded emotions-test\n" ] } ], "source": [ "X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')\n", "X_test, y_test, _, _ =load_dataset('emotions', 'test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually classifier's performance depends on three elements:\n", "\n", "- number of samples\n", "- number of labels\n", "- number of unique label classes\n", "- number of features\n", "\n", "We can obtain the first two from the shape of our output space matrices:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((391, 6), (202, 6))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.shape, y_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use numpy and the list of rows with non-zero values in output matrices to get the number of unique label combinations." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((26,), (21,))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.unique(y_train.rows).shape, np.unique(y_test.rows).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of features can be found in the shape of the input matrix:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "72" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intutions\n", "\n", "### Generalization quality measures\n", "\n", "\n", "There are several ways to measure a classifier's generalization quality:\n", "\n", "- [Hamming loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html#sklearn.metrics.hamming_loss) measures how well the classifier predicts each of the labels, averaged over samples, then over labels \n", "- [accuracy score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) measures how well the classifier predicts label combinations, averaged over samples\n", "- [jaccard similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score) measures the proportion of predicted labels for a sample to its correct assignment, averaged over samples\n", "- [precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) measures how many samples with ,\n", "- [recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) measures how many samples , \n", "- [F1 score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) measures a weighted average of precision and recall, where both have the same impact on the score \n", "\n", "These measures are conveniently provided by sklearn:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.adapt import MLkNN\n", "classifier = MLkNN(k=3)\n", "prediction = classifier.fit(X_train, y_train).predict(X_test)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2953795379537954" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sklearn.metrics as metrics\n", "\n", "metrics.hamming_loss(y_test, prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Performance\n", "\n", "Scikit-multilearn provides 11 classifiers that allow a strong variety of classification scenarios through label partitioning and ensemble classification, let's look at the important factors influencing performance. $ g(x) $ denotes the performance of the base classifier in some of the classifiers.\n", "\n", "