3. Dataset handling¶
Scikit-multilearn provides methods to load, save and manipulate multi-label data sets in two formats:
- a scikit-multilearn pickle of data set in scipy sparse format
- the traditional ARFF file format
The functionality is provided in the :mod:skmultilearn.dataset
module.
Scikit-multilearn also provides a repository of most popular benchmark data sets in the scipy sparse format and convienience functions to access them.
3.1. scikit-multilearn format¶
In [1]:
from skmultilearn.dataset import load_dataset_dump, save_dataset_dump
Loading scikit-multilearn data format is easier as it stores more information than the ARFF file, all you need to do is specify the path to the data set file.
In [2]:
X, y, feature_names, label_names = load_dataset_dump('_static/example.pkl.bz2')
In [3]:
X, y, feature_names[:3], label_names[:3]
Out[3]:
(<65x19 sparse matrix of type '<class 'numpy.float64'>'
with 491 stored elements in LInked List format>,
<65x7 sparse matrix of type '<class 'numpy.int64'>'
with 217 stored elements in LInked List format>,
[('landmass', ['1', '2', '3', '4', '5', '6']),
('zone', ['1', '2', '3', '4']),
('area', 'NUMERIC')],
[('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])
In [4]:
save_dataset_dump(X[:10,:4], y[:10, :3], feature_names[:4], label_names[:3], filename=None)
Out[4]:
{'X': <10x4 sparse matrix of type '<class 'numpy.float64'>'
with 27 stored elements in LInked List format>,
'y': <10x3 sparse matrix of type '<class 'numpy.int64'>'
with 16 stored elements in LInked List format>,
'features': [('landmass', ['1', '2', '3', '4', '5', '6']),
('zone', ['1', '2', '3', '4']),
('area', 'NUMERIC'),
('population', 'NUMERIC')],
'labels': [('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])]}
If the filename
argument is not None
this dictionary is saved as
a bzip2 compressed pickle and the function does not return anything.
4. scikit-multilearn repository¶
In [5]:
from skmultilearn.dataset import available_data_sets
The following benchmark data sets, originally provided in the MULAN
data repository are
provided in train
, test
, and undivided
variants. The
undivided variant contains the complete data set, before the train/test
split.
In [6]:
set([x[0] for x in available_data_sets().keys()])
Out[6]:
{'Corel5k',
'bibtex',
'birds',
'delicious',
'emotions',
'enron',
'genbase',
'mediamill',
'medical',
'rcv1subset1',
'rcv1subset2',
'rcv1subset3',
'rcv1subset4',
'rcv1subset5',
'scene',
'tmc2007_500',
'yeast'}
Variants:
In [7]:
set([x[1] for x in available_data_sets().keys()])
Out[7]:
{'test', 'train', 'undivided'}
Scikit-multilearn can automatically download the data sets for you, similar to scikit-learn’s data set API.
The data is stored by default in the subfolder scikit_ml_learn_data
of your SCIKIT_ML_LEARN_DATA
environment variable. If the variable
is not set, the data is stored in ~/scikit_ml_learn_data
.
To download a data set use the :meth:load_dataset
function.
In [8]:
from skmultilearn.dataset import load_dataset
In [9]:
X, y, feature_names, label_names = load_dataset('scene', 'train')
scene - exists, not redownloading
In [10]:
X, y, feature_names[:3], label_names[:3]
Out[10]:
(<1211x294 sparse matrix of type '<class 'numpy.float64'>'
with 351805 stored elements in LInked List format>,
<1211x6 sparse matrix of type '<class 'numpy.int64'>'
with 1286 stored elements in LInked List format>,
[('Att1', 'NUMERIC'), ('Att2', 'NUMERIC'), ('Att3', 'NUMERIC')],
[('Beach', ['0', '1']), ('Sunset', ['0', '1']), ('FallFoliage', ['0', '1'])])
4.1. ARFF files¶
The most common way for storing multi-label data is the ARFF file format created by the WEKA library. You can find many benchmark data sets in ARFF format on the MULAN data repository.
Loading both dense and sparse ARFF files is simple in scikit-multilearn,
just use :func:load_from_arff
, like this:
In [11]:
from skmultilearn.dataset import load_from_arff
Loading multi-label ARFF files requires additional information as the number or placement of labels, is not indicated in the format directly.
In [12]:
path_to_arff_file = '_static/example.arff'
label_count = 7
Different software expects labels in different parts of the ARFF file:
- MEKA expects labels to appear at the beginning of the file
- MULAN expects labels to appear at the end of the file
As the example.arff
comes from MULAN, we set the label location to
end
.
In [13]:
label_location="end"
There are two ways to save ARFF data: - dense, where the file contains a complete dump of the data set row by row, including places where the value is 0 - sparse, as a dictionary of keys, where for each row the non-zero elements are listed with their index
The example file is not sparse, that’s why we set the load_sparse
argument to False
In [14]:
arff_file_is_sparse = False
In [15]:
X, y = load_from_arff(
path_to_arff_file,
label_count=label_count,
label_location=label_location,
load_sparse=arff_file_is_sparse
)
Or if you also want the metadata: feature and label names:
In [16]:
X, y, feature_names, label_names = load_from_arff(
path_to_arff_file,
label_count=label_count,
label_location=label_location,
load_sparse=arff_file_is_sparse,
return_attribute_definitions=True
)
As you can see scikit-multilearn encodes nominal types by default as
integers, and converts the input space to floats, while the output space
to binary indicators 0/1 represented as integers. To change this
behavior specify your own params to load_from_arff
as described in
the API documentation.
In [17]:
X, y, feature_names[:3], label_names[:3]
Out[17]:
(<65x19 sparse matrix of type '<class 'numpy.float64'>'
with 491 stored elements in LInked List format>,
<65x7 sparse matrix of type '<class 'numpy.int64'>'
with 217 stored elements in LInked List format>,
[('landmass', ['1', '2', '3', '4', '5', '6']),
('zone', ['1', '2', '3', '4']),
('area', 'NUMERIC')],
[('red', ['0', '1']), ('green', ['0', '1']), ('blue', ['0', '1'])])
If you want to save ARFF files, you can use the :meth:save_arff
function, which can both return a string containing an ARFF dump of the
data set, or save it to a provided file when the filename
argument
is passed.
In [18]:
from skmultilearn.dataset import save_to_arff
Let’s say we want to save a subset of the data in a sparse format and with labels at the begining of the file.
In [19]:
print(save_to_arff(X[:10,:4], y[:10, :3], label_location='start', save_sparse=True))
% traindata
@RELATION "traindata: -C 3"
@ATTRIBUTE y0 {0, 1}
@ATTRIBUTE y1 {0, 1}
@ATTRIBUTE y2 {0, 1}
@ATTRIBUTE X0 NUMERIC
@ATTRIBUTE X1 NUMERIC
@ATTRIBUTE X2 NUMERIC
@ATTRIBUTE X3 NUMERIC
@DATA
{ 0 1,3 3.0,5 1001.0,6 47.0 }
{ 2 1,3 1.0,4 2.0,5 178.0,6 3.0 }
{ 0 1,2 1,3 1.0,4 3.0,5 76.0,6 2.0 }
{ 0 1,2 1,3 5.0,4 1.0 }
{ 0 1,3 4.0,5 47.0,6 1.0 }
{ 2 1,4 3.0 }
{ 0 1,2 1,3 4.0,5 121.0,6 18.0 }
{ 0 1,1 1,3 2.0,5 301.0,6 57.0 }
{ 0 1,1 1,3 4.0 }
{ 0 1,1 1,3 3.0,5 2388.0,6 20.0 }