skmultilearn.dataset module

skmultilearn.dataset module¶

skmultilearn.dataset.available_data_sets()[source]¶

Lists available data sets and their variants

Returns:	available datasets and their variants with the key pertaining to the `(set_name, variant_name)` and values include md5 and file name on server
Return type:	dict[(set_name, variant_name)] -> [md5, file_name]

skmultilearn.dataset.clear_data_home(data_home=None)[source]¶

Delete all the content of the data home cache.

Parameters:	data_home (str (default is None)) – the path to the directory in which scikit-multilearn data sets should be stored.

skmultilearn.dataset.download_dataset(set_name, variant, data_home=None)[source]¶

Downloads a data set

Parameters:	set_name (str) – name of set from `available_data_sets()` variant (str) – variant of the data set from `available_data_sets()` data_home (default None, str) – custom base folder for data, if None, default is used
Returns:	path to the downloaded data set file on disk
Return type:	str

skmultilearn.dataset.get_data_home(data_home=None, subdirectory='')[source]¶

Return the path of the scikit-multilearn data dir.

This folder is used by some large dataset loaders to avoid downloading the data several times.

By default the data_home is set to a folder named 'scikit_ml_learn_data' in the user home folder.

Alternatively, it can be set by the 'SCIKIT_ML_LEARN_DATA' environment variable or programmatically by giving an explicit folder path. The '~' symbol is expanded to the user home folder.

If the folder does not already exist, it is automatically created.

Parameters:	data_home (str (default is None)) – the path to the directory in which scikit-multilearn data sets should be stored, if None the path is generated as stated above subdirectory (str, default '') – return path subdirectory under data_home if data_home passed or under default if not passed
Returns:	the path to the data home
Return type:	str

skmultilearn.dataset.load_dataset(set_name, variant, data_home=None)[source]¶

Loads a selected variant of the given data set

Parameters:	set_name (str) – name of set from `available_data_sets()` variant (str) – variant of the data set data_home (default None, str) – custom base folder for data, if None, default is used
Returns:	the loaded multilabel data set variant in the scikit-multilearn format, see data_sets
Return type:	dict

skmultilearn.dataset.load_dataset_dump(filename)[source]¶

Loads a compressed data set dump

Parameters: filename (str) – path to dump file, if without .bz2 ending, the .bz2 extension will be appended.

Returns:

Parameters:	filename (str) – path to dump file, if without .bz2 ending, the .bz2 extension will be appended.
Returns:	X (array_like, `numpy.matrix` or `scipy.sparse` matrix, shape=(n_samples, n_features)) – input feature matrix y (array_like, `numpy.matrix` or `scipy.sparse` matrix of {0, 1}, shape=(n_samples, n_labels)) – binary indicator matrix with label assignments names of attributes (List[str]) – list of attribute names for X columns names of labels (List[str]) – list of label names for y columns

X (array_like, numpy.matrix or scipy.sparse matrix, shape=(n_samples, n_features)) – input feature matrix
y (array_like, numpy.matrix or scipy.sparse matrix of {0, 1}, shape=(n_samples, n_labels)) – binary indicator matrix with label assignments
names of attributes (List[str]) – list of attribute names for X columns
names of labels (List[str]) – list of label names for y columns

skmultilearn.dataset.load_from_arff(filename, label_count, label_location='end', input_feature_type='float', encode_nominal=True, load_sparse=False, return_attribute_definitions=False)[source]¶

Method for loading ARFF files as numpy array

Parameters:

Parameters:	filename (str) – path to ARFF file labelcount (integer) – number of labels in the ARFF file endian (str {"big", "little"} (default is "big")) – whether the ARFF file contains labels at the beginning of the attributes list (“start”, MEKA format) or at the end (“end”, MULAN format) input_feature_type (numpy.type as string (default is "float")) – the desire type of the contents of the return ‘X’ array-likes, default ‘i8’, should be a numpy type, see http://docs.scipy.org/doc/numpy/user/basics.types.html encode_nominal (bool (default is True)) – whether convert categorical data into numeric factors - required for some scikit classifiers that can’t handle non-numeric input features. load_sparse (boolean (default is False)) – whether to read arff file as a sparse file format, liac-arff breaks if sparse reading is enabled for non-sparse ARFFs. return_attribute_definitions (boolean (default is False)) – whether to return the definitions for each attribute in the dataset
Returns:	X (`scipy.sparse.lil_matrix` of input_feature_type, shape=(n_samples, n_features)) – input feature matrix y (`scipy.sparse.lil_matrix` of {0, 1}, shape=(n_samples, n_labels)) – binary indicator matrix with label assignments names of attributes (List[str]) – list of attribute names from ARFF file

filename (str) – path to ARFF file
labelcount (integer) – number of labels in the ARFF file
endian (str {"big", "little"} (default is "big")) – whether the ARFF file contains labels at the beginning of the attributes list (“start”, MEKA format) or at the end (“end”, MULAN format)
input_feature_type (numpy.type as string (default is "float")) – the desire type of the contents of the return ‘X’ array-likes, default ‘i8’, should be a numpy type, see http://docs.scipy.org/doc/numpy/user/basics.types.html
encode_nominal (bool (default is True)) – whether convert categorical data into numeric factors - required for some scikit classifiers that can’t handle non-numeric input features.
load_sparse (boolean (default is False)) – whether to read arff file as a sparse file format, liac-arff breaks if sparse reading is enabled for non-sparse ARFFs.
return_attribute_definitions (boolean (default is False)) – whether to return the definitions for each attribute in the dataset

Returns:

X (scipy.sparse.lil_matrix of input_feature_type, shape=(n_samples, n_features)) – input feature matrix
y (scipy.sparse.lil_matrix of {0, 1}, shape=(n_samples, n_labels)) – binary indicator matrix with label assignments
names of attributes (List[str]) – list of attribute names from ARFF file

skmultilearn.dataset.save_dataset_dump(input_space, labels, feature_names, label_names, filename=None)[source]¶

Saves a compressed data set dump

Parameters:

Parameters:	input_space (array-like of array-likes) – Input space array-like of input feature vectors labels (array-like of binary label vectors) – Array-like of labels assigned to each input vector, as a binary indicator vector (i.e. if 5th position has value 1 then the input vector has label no. 5) feature_names (array-like,optional) – names of features label_names (array-like, optional) – names of labels filename (str, optional) – Path to dump file, if without .bz2, the .bz2 extension will be appended.

input_space (array-like of array-likes) – Input space array-like of input feature vectors
labels (array-like of binary label vectors) – Array-like of labels assigned to each input vector, as a binary indicator vector (i.e. if 5th position has value 1 then the input vector has label no. 5)
feature_names (array-like,optional) – names of features
label_names (array-like, optional) – names of labels
filename (str, optional) – Path to dump file, if without .bz2, the .bz2 extension will be appended.

skmultilearn.dataset.save_to_arff(X, y, label_location='end', save_sparse=True, filename=None)[source]¶

Method for dumping data to ARFF files

Parameters:	X (array_like, `numpy.matrix` or `scipy.sparse` matrix, shape=(n_samples, n_features)) – input feature matrix y (array_like, `numpy.matrix` or `scipy.sparse` matrix of {0, 1}, shape=(n_samples, n_labels)) – binary indicator matrix with label assignments label_location (string {"start", "end"} (default is "end")) – whether the ARFF file will contain labels at the beginning of the attributes list (“start”, MEKA format) or at the end (“end”, MULAN format) save_sparse (boolean) – Whether to save in ARFF’s sparse dictionary-like format instead of listing all zeroes within file, very useful in multi-label classification. filename (str or None) – Path to ARFF file, if None, the ARFF representation is returned as string
Returns:	the ARFF dump string, if filename is None
Return type:	str or None