{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset handling\n", "\n", "Scikit-multilearn provides methods to load, save and manipulate multi-label data sets in two formats:\n", "\n", "- a scikit-multilearn pickle of data set in scipy sparse format\n", "- the traditional ARFF file format\n", "\n", "The functionality is provided in the :mod:`skmultilearn.dataset` module.\n", "\n", "Scikit-multilearn also provides a repository of most popular benchmark data sets in the scipy sparse format and convienience functions to access them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## scikit-multilearn format" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_dataset_dump, save_dataset_dump" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading scikit-multilearn data format is easier as it stores more information than the ARFF file, all you need to do is specify the path to the data set file." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "X, y, feature_names, label_names = load_dataset_dump('_static/example.pkl.bz2')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(<65x19 sparse matrix of type ''\n", " \twith 491 stored elements in LInked List format>,\n", " <65x7 sparse matrix of type ''\n", " \twith 217 stored elements in LInked List format>,\n", " [('landmass', ['1', '2', '3', '4', '5', '6']),\n", " ('zone', ['1', '2', '3', '4']),\n", " ('area', 'NUMERIC')],\n", " [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y, feature_names[:3], label_names[:3]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'X': <10x4 sparse matrix of type ''\n", " \twith 27 stored elements in LInked List format>,\n", " 'y': <10x3 sparse matrix of type ''\n", " \twith 16 stored elements in LInked List format>,\n", " 'features': [('landmass', ['1', '2', '3', '4', '5', '6']),\n", " ('zone', ['1', '2', '3', '4']),\n", " ('area', 'NUMERIC'),\n", " ('population', 'NUMERIC')],\n", " 'labels': [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])]}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "save_dataset_dump(X[:10,:4], y[:10, :3], feature_names[:4], label_names[:3], filename=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the ``filename`` argument is not ``None`` this dictionary is saved as a bzip2 compressed pickle and the function does not return anything." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# scikit-multilearn repository" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import available_data_sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following benchmark data sets, originally provided in the [MULAN data repository](http://mulan.sourceforge.net/datasets-mlc.html) are provided in ``train``, ``test``, and ``undivided`` variants. The undivided variant contains the complete data set, before the train/test split." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Corel5k',\n", " 'bibtex',\n", " 'birds',\n", " 'delicious',\n", " 'emotions',\n", " 'enron',\n", " 'genbase',\n", " 'mediamill',\n", " 'medical',\n", " 'rcv1subset1',\n", " 'rcv1subset2',\n", " 'rcv1subset3',\n", " 'rcv1subset4',\n", " 'rcv1subset5',\n", " 'scene',\n", " 'tmc2007_500',\n", " 'yeast'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set([x[0] for x in available_data_sets().keys()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variants:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'test', 'train', 'undivided'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set([x[1] for x in available_data_sets().keys()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-multilearn can automatically download the data sets for you, similar to [scikit-learn's data set API](http://scikit-learn.org/stable/datasets/index.html).\n", "\n", "The data is stored by default in the subfolder ``scikit_ml_learn_data`` of your ``SCIKIT_ML_LEARN_DATA`` environment variable. If the variable is not set, the data is stored in ``~/scikit_ml_learn_data``.\n", "\n", "To download a data set use the :meth:`load_dataset` function." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_dataset" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scene - exists, not redownloading\n" ] } ], "source": [ "X, y, feature_names, label_names = load_dataset('scene', 'train')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(<1211x294 sparse matrix of type ''\n", " \twith 351805 stored elements in LInked List format>,\n", " <1211x6 sparse matrix of type ''\n", " \twith 1286 stored elements in LInked List format>,\n", " [('Att1', 'NUMERIC'), ('Att2', 'NUMERIC'), ('Att3', 'NUMERIC')],\n", " [('Beach', ['0', '1']), ('Sunset', ['0', '1']), ('FallFoliage', ['0', '1'])])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y, feature_names[:3], label_names[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ARFF files\n", "\n", "The most common way for storing multi-label data is the [ARFF file format](https://www.cs.waikato.ac.nz/ml/weka/arff.html) created by the [WEKA](https://www.cs.waikato.ac.nz/ml/weka/) library. You can find many benchmark data sets in ARFF format on the [MULAN data repository](http://mulan.sourceforge.net/datasets-mlc.html).\n", "\n", "Loading both dense and sparse ARFF files is simple in scikit-multilearn, just use :func:`load_from_arff`, like this:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import load_from_arff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading multi-label ARFF files requires additional information as the number or placement of labels, is not indicated in the format directly." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "path_to_arff_file = '_static/example.arff'\n", "label_count = 7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different software expects labels in different parts of the ARFF file:\n", "\n", "- MEKA expects labels to appear at the beginning of the file\n", "- MULAN expects labels to appear at the end of the file\n", "\n", "As the `example.arff` comes from MULAN, we set the label location to ``end``." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "label_location=\"end\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two ways to save ARFF data:\n", "- dense, where the file contains a complete dump of the data set row by row, including places where the value is 0\n", "- sparse, as a dictionary of keys, where for each row the non-zero elements are listed with their index\n", "\n", "The example file is not sparse, that's why we set the ``load_sparse`` argument to ``False``" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "arff_file_is_sparse = False" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X, y = load_from_arff(\n", " path_to_arff_file, \n", " label_count=label_count, \n", " label_location=label_location,\n", " load_sparse=arff_file_is_sparse\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or if you also want the metadata: feature and label names:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "X, y, feature_names, label_names = load_from_arff(\n", " path_to_arff_file, \n", " label_count=label_count, \n", " label_location=label_location,\n", " load_sparse=arff_file_is_sparse,\n", " return_attribute_definitions=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see scikit-multilearn encodes nominal types by default as integers, and converts the input space to floats, while the output space to binary indicators 0/1 represented as integers. To change this behavior specify your own params to `load_from_arff` as described in the API documentation." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(<65x19 sparse matrix of type ''\n", " \twith 491 stored elements in LInked List format>,\n", " <65x7 sparse matrix of type ''\n", " \twith 217 stored elements in LInked List format>,\n", " [('landmass', ['1', '2', '3', '4', '5', '6']),\n", " ('zone', ['1', '2', '3', '4']),\n", " ('area', 'NUMERIC')],\n", " [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y, feature_names[:3], label_names[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to save ARFF files, you can use the :meth:`save_arff` function, which can both return a string containing an ARFF dump of the data set, or save it to a provided file when the `filename` argument is passed." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.dataset import save_to_arff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we want to save a subset of the data in a sparse format and with labels at the begining of the file." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "% traindata\n", "@RELATION \"traindata: -C 3\"\n", "\n", "@ATTRIBUTE y0 {0, 1}\n", "@ATTRIBUTE y1 {0, 1}\n", "@ATTRIBUTE y2 {0, 1}\n", "@ATTRIBUTE X0 NUMERIC\n", "@ATTRIBUTE X1 NUMERIC\n", "@ATTRIBUTE X2 NUMERIC\n", "@ATTRIBUTE X3 NUMERIC\n", "\n", "@DATA\n", "{ 0 1,3 3.0,5 1001.0,6 47.0 }\n", "{ 2 1,3 1.0,4 2.0,5 178.0,6 3.0 }\n", "{ 0 1,2 1,3 1.0,4 3.0,5 76.0,6 2.0 }\n", "{ 0 1,2 1,3 5.0,4 1.0 }\n", "{ 0 1,3 4.0,5 47.0,6 1.0 }\n", "{ 2 1,4 3.0 }\n", "{ 0 1,2 1,3 4.0,5 121.0,6 18.0 }\n", "{ 0 1,1 1,3 2.0,5 301.0,6 57.0 }\n", "{ 0 1,1 1,3 4.0 }\n", "{ 0 1,1 1,3 3.0,5 2388.0,6 20.0 }\n", "\n" ] } ], "source": [ "print(save_to_arff(X[:10,:4], y[:10, :3], label_location='start', save_sparse=True))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }