{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with scikit-multilearn\n", "\n", "\n", "\n", "Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.\n", "\n", "To install it just run the command:\n", "\n", "```bash\n", "$ pip install scikit-multilearn\n", "```\n", "\n", "Scikit-multilearn works with Python 2 and 3 on Windows, Linux and OSX. The module name is `skmultilearn`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load up some data. In this tutorial we will be working with the `emotions` data set introduced in [emotions](http://cos).\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emotions:train - exists, not redownloading\n", "emotions:test - exists, not redownloading\n" ] } ], "source": [ "X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')\n", "X_test, y_test, _, _ = load_dataset('emotions', 'test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `feature_names` variable contains list of pairs (feature name, type) that were provided in the original data set. In the case of `emotions` data the authors write: \n", "\n", "> The extracted features fall into two categories: rhythmic and timbre. \n", "\n", "Let's take a look at the first few features:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(u'Mean_Acc1298_Mean_Mem40_Centroid', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_Rolloff', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_Flux', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_0', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_1', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_2', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_3', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_4', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_5', u'NUMERIC'),\n", " (u'Mean_Acc1298_Mean_Mem40_MFCC_6', u'NUMERIC')]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_names[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `label_names` variable contains list of pairs (label name, type) of labels that were used to annotate the music. The paper states that:\n", "\n", "> The Tellegen-Watson-Clark model was employed for labeling\n", "the data with emotions. The sound clips were annotated by three male experts of\n", "age 20, 25 and 30 from the School of Music Studies \n", "\n", "The labels counts in the training data are as follows:\n", "\n", "|Label|Description|# Examples|\n", "|-----|-----------|----------|\n", "|L1|amazed-surprised|173|\n", "|L2|happy-pleased|166|\n", "|L3|relaxing-calm|264|\n", "|L4|quiet-still|148|\n", "|L5|sad-lonely|168|\n", "|L6|angry-fearful|189|\n", "\n", "Let's see the contents of `label_names`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(u'amazed-suprised', [u'0', u'1']),\n", " (u'happy-pleased', [u'0', u'1']),\n", " (u'relaxing-calm', [u'0', u'1']),\n", " (u'quiet-still', [u'0', u'1']),\n", " (u'sad-lonely', [u'0', u'1']),\n", " (u'angry-aggresive', [u'0', u'1'])]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.problem_transform import BinaryRelevance\n", "from sklearn.svm import SVC" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "clf = BinaryRelevance(\n", " classifier=SVC(),\n", " require_dense=[False, True]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On a side note, Binary Relevance trains a classifier per each of the labels, we can see that the classifier hasn't been trained yet:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'BinaryRelevance' object has no attribute 'classifiers'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mclf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclassifiers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'BinaryRelevance' object has no attribute 'classifiers'" ] } ], "source": [ "clf.classifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn introduces a convention of how classifiers are organized. The typical usage of classifier is:\n", "\n", "- fit it to the data (trains the classifier and returns self)\n", "- predict results on new data (returns predicted results)\n", "\n", "Scikit-multilearn follows these conventions, let's train a multi-label classifier:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BinaryRelevance(classifier=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " require_dense=[False, True])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The base classifiers have been trained now:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "[SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False),\n", " SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.classifiers" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "prediction = clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<202x6 sparse matrix of type ''\n", "\twith 246 stored elements in Compressed Sparse Column format>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prediction" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "## Measure the quality" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import sklearn.metrics as metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn provides a [set of metrics](http://scikit-learn.org/stable/modules/classes.html#classification-metrics) useful for evaluating the quality of the model. They are most often used by providing the true assignment matrix/array as the first argument, and the prediction matrix/array as the second argument." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.26485148514851486" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metrics.hamming_loss(y_test, prediction)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.14356435643564355" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metrics.accuracy_score(y_test, prediction)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }