Developing a label space clusterer¶
One of the approaches to multi-label classification is to cluster the label space into subspaces and perform classification in smaller subproblems to reduce the risk of under/overfitting.
scikit-multilearn follows scikit-learn’s ClustererMixin interface API for a clustering class.
scikit-learn concentrates on single-label classification in which one usually performs clustering of the input space
X and not the output class space. In
scikit-learn the clusterer class is expected to take
y and provide a clustering of
X as an
n_samples elements, each element corresponding to the id of the cluster to which the observation is assigned, thus if sample no. 3 is assigned to cluster no. 1:
result should be equal to 1.
The clusterer for the label space in scikit-multilearn follows this interface, in order to create your own label space clusterer you need to inherit
LabelSpaceClustererBase and implement the
fit_predict(X, y) class method. Expect
y to be sparse matrices, you and also use
skmultilearn.utils.get_matrix_in_format() to convert to a desired matrix format.
fit_predict(X, y) should return an array-like, preferably an
n_labels of integers indicating the no. of cluster a given label is assigned to similarly as it is performed in
Let us look at a toy example, where a clusterer divides the label space based on how a given label’s ordinal divides modulo a given number of clusters.
from skmultilearn.ensemble import LabelCooccurenceClustererBase class ModuloClusterer(LabelSpaceClustererBase): def __init__(self, number_of_clusters = None): super(ModuloClusterer, self).__init__() self.number_of_clusters = number_of_clusters def fit_predict(self, X, y): number_of_labels = y.shape # assign a label to a cluster no. label ordinal % number of labeSls return map(lambda x: x % self.number_of_clusters, xrange(number_of_labels))
Label co-occurence graph¶
A feature present currently only in
scikit-multilearn is the possibility to divide the label space based on analysing a label co-occurence graph. In such a graph labels are represented as nodes, and edges are generated based on how labels co-occur together among samples. In other words an edge between label no.
a and number
b is present if there exists a sample in
X that is labeled with both
LabelCooccurenceClustererBase provides a method
LabelCooccurenceClustererBase.generate_coocurence_adjacency_matrix(X, y)() which generates a dict containing label pairs as keys, and a float as edge weight value:
1.0in an unweighted setting
- number of samples in
Xthat are labeled with both labels in a weighted setting (when
The edge_map is both returned and stored in the class as
self.edge_map. This is a base class for building label co-occurence graphs. Subclass
LabelCooccurenceClustererBase and use
generate_coocurence_adjacency_matrix at the beginning of your
fit_predict as shown below, than build a graph using the edge_map property, and infer the communities from the graph. Interfaces for two popular Python graph libraries already exist:
skmultilearn.ensemble.GraphToolCooccurenceClustererthat constructs the graph-tool graph object and uses stochastic block modelling for clustering
skmultilearn.ensemble.IGraphLabelCooccurenceClustererthat constructs an igraph graph object and allows the use of a variety of igraph’s community detection methods for clustering
to use them, just subclass the class and start your fit_predict method like this:
def fit_predict(self, X, y): self.generate_coocurence_adjacency_matrix(y) self.generate_coocurence_graph()