3. Dataset handling

Scikit-multilearn provides methods to load, save and manipulate multi-label data sets in two formats:

  • a scikit-multilearn pickle of data set in scipy sparse format
  • the traditional ARFF file format

The functionality is provided in the :mod:skmultilearn.dataset module.

Scikit-multilearn also provides a repository of most popular benchmark data sets in the scipy sparse format and convienience functions to access them.

3.1. scikit-multilearn format

In [1]:
from skmultilearn.dataset import load_dataset_dump, save_dataset_dump

Loading scikit-multilearn data format is easier as it stores more information than the ARFF file, all you need to do is specify the path to the data set file.

In [2]:
X, y, feature_names, label_names = load_dataset_dump('_static/example.pkl.bz2')
In [3]:
X, y, feature_names[:3], label_names[:3]
Out[3]:
(<65x19 sparse matrix of type '<class 'numpy.float64'>'
    with 491 stored elements in LInked List format>,
 <65x7 sparse matrix of type '<class 'numpy.int64'>'
    with 217 stored elements in LInked List format>,
 [('landmass', ['1', '2', '3', '4', '5', '6']),
  ('zone', ['1', '2', '3', '4']),
  ('area', 'NUMERIC')],
 [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])
In [4]:
save_dataset_dump(X[:10,:4], y[:10, :3], feature_names[:4], label_names[:3], filename=None)
Out[4]:
{'X': <10x4 sparse matrix of type '<class 'numpy.float64'>'
    with 27 stored elements in LInked List format>,
 'y': <10x3 sparse matrix of type '<class 'numpy.int64'>'
    with 16 stored elements in LInked List format>,
 'features': [('landmass', ['1', '2', '3', '4', '5', '6']),
  ('zone', ['1', '2', '3', '4']),
  ('area', 'NUMERIC'),
  ('population', 'NUMERIC')],
 'labels': [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])]}

If the filename argument is not None this dictionary is saved as a bzip2 compressed pickle and the function does not return anything.

4. scikit-multilearn repository

In [5]:
from skmultilearn.dataset import available_data_sets

The following benchmark data sets, originally provided in the MULAN data repository are provided in train, test, and undivided variants. The undivided variant contains the complete data set, before the train/test split.

In [6]:
set([x[0] for x in available_data_sets().keys()])
Out[6]:
{'Corel5k',
 'bibtex',
 'birds',
 'delicious',
 'emotions',
 'enron',
 'genbase',
 'mediamill',
 'medical',
 'rcv1subset1',
 'rcv1subset2',
 'rcv1subset3',
 'rcv1subset4',
 'rcv1subset5',
 'scene',
 'tmc2007_500',
 'yeast'}

Variants:

In [7]:
set([x[1] for x in available_data_sets().keys()])
Out[7]:
{'test', 'train', 'undivided'}

Scikit-multilearn can automatically download the data sets for you, similar to scikit-learn’s data set API.

The data is stored by default in the subfolder scikit_ml_learn_data of your SCIKIT_ML_LEARN_DATA environment variable. If the variable is not set, the data is stored in ~/scikit_ml_learn_data.

To download a data set use the :meth:load_dataset function.

In [8]:
from skmultilearn.dataset import load_dataset
In [9]:
X, y, feature_names, label_names = load_dataset('scene', 'train')
scene - exists, not redownloading
In [10]:
X, y, feature_names[:3], label_names[:3]
Out[10]:
(<1211x294 sparse matrix of type '<class 'numpy.float64'>'
    with 351805 stored elements in LInked List format>,
 <1211x6 sparse matrix of type '<class 'numpy.int64'>'
    with 1286 stored elements in LInked List format>,
 [('Att1', 'NUMERIC'), ('Att2', 'NUMERIC'), ('Att3', 'NUMERIC')],
 [('Beach', ['0', '1']), ('Sunset', ['0', '1']), ('FallFoliage', ['0', '1'])])

4.1. ARFF files

The most common way for storing multi-label data is the ARFF file format created by the WEKA library. You can find many benchmark data sets in ARFF format on the MULAN data repository.

Loading both dense and sparse ARFF files is simple in scikit-multilearn, just use :func:load_from_arff, like this:

In [11]:
from skmultilearn.dataset import load_from_arff

Loading multi-label ARFF files requires additional information as the number or placement of labels, is not indicated in the format directly.

In [12]:
path_to_arff_file = '_static/example.arff'
label_count = 7

Different software expects labels in different parts of the ARFF file:

  • MEKA expects labels to appear at the beginning of the file
  • MULAN expects labels to appear at the end of the file

As the example.arff comes from MULAN, we set the label location to end.

In [13]:
label_location="end"

There are two ways to save ARFF data: - dense, where the file contains a complete dump of the data set row by row, including places where the value is 0 - sparse, as a dictionary of keys, where for each row the non-zero elements are listed with their index

The example file is not sparse, that’s why we set the load_sparse argument to False

In [14]:
arff_file_is_sparse = False
In [15]:
X, y = load_from_arff(
    path_to_arff_file,
    label_count=label_count,
    label_location=label_location,
    load_sparse=arff_file_is_sparse
)

Or if you also want the metadata: feature and label names:

In [16]:
X, y, feature_names, label_names = load_from_arff(
    path_to_arff_file,
    label_count=label_count,
    label_location=label_location,
    load_sparse=arff_file_is_sparse,
    return_attribute_definitions=True
)

As you can see scikit-multilearn encodes nominal types by default as integers, and converts the input space to floats, while the output space to binary indicators 0/1 represented as integers. To change this behavior specify your own params to load_from_arff as described in the API documentation.

In [17]:
X, y, feature_names[:3], label_names[:3]
Out[17]:
(<65x19 sparse matrix of type '<class 'numpy.float64'>'
    with 491 stored elements in LInked List format>,
 <65x7 sparse matrix of type '<class 'numpy.int64'>'
    with 217 stored elements in LInked List format>,
 [('landmass', ['1', '2', '3', '4', '5', '6']),
  ('zone', ['1', '2', '3', '4']),
  ('area', 'NUMERIC')],
 [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])

If you want to save ARFF files, you can use the :meth:save_arff function, which can both return a string containing an ARFF dump of the data set, or save it to a provided file when the filename argument is passed.

In [18]:
from skmultilearn.dataset import save_to_arff

Let’s say we want to save a subset of the data in a sparse format and with labels at the begining of the file.

In [19]:
print(save_to_arff(X[:10,:4], y[:10, :3], label_location='start', save_sparse=True))
% traindata
@RELATION "traindata: -C 3"

@ATTRIBUTE y0 {0, 1}
@ATTRIBUTE y1 {0, 1}
@ATTRIBUTE y2 {0, 1}
@ATTRIBUTE X0 NUMERIC
@ATTRIBUTE X1 NUMERIC
@ATTRIBUTE X2 NUMERIC
@ATTRIBUTE X3 NUMERIC

@DATA
{ 0 1,3 3.0,5 1001.0,6 47.0 }
{ 2 1,3 1.0,4 2.0,5 178.0,6 3.0 }
{ 0 1,2 1,3 1.0,4 3.0,5 76.0,6 2.0 }
{ 0 1,2 1,3 5.0,4 1.0 }
{ 0 1,3 4.0,5 47.0,6 1.0 }
{ 2 1,4 3.0 }
{ 0 1,2 1,3 4.0,5 121.0,6 18.0 }
{ 0 1,1 1,3 2.0,5 301.0,6 57.0 }
{ 0 1,1 1,3 4.0 }
{ 0 1,1 1,3 3.0,5 2388.0,6 20.0 }