# Developing a label space clusterer¶

One of the approaches to multi-label classification is to cluster the label space into subspaces and perform classification in smaller subproblems to reduce the risk of under/overfitting.

`scikit-multilearn`

follows scikit-learn’s ClustererMixin interface API for a clustering class. `scikit-learn`

concentrates on single-label classification in which one usually performs clustering of the input space `X`

and not the output class space. In `scikit-learn`

the clusterer class is expected to take `X`

and `y`

and provide a clustering of `X`

as an `ndarray`

of `n_samples`

elements, each element corresponding to the id of the cluster to which the observation is assigned, thus if sample no. 3 is assigned to cluster no. 1: `result[3]`

should be equal to 1.

The clusterer for the label space in scikit-multilearn follows this interface, in order to create your own label space clusterer you need to inherit `LabelSpaceClustererBase`

and implement the `fit_predict(X, y)`

class method. Expect `X`

and `y`

to be sparse matrices, you and also use `skmultilearn.utils.get_matrix_in_format()`

to convert to a desired matrix format. `fit_predict(X, y)`

should return an array-like, preferably an `ndarray`

or `list`

of `n_labels`

of integers indicating the no. of cluster a given label is assigned to similarly as it is performed in `scikit-learn`

clusterers.

## Example classifier¶

Let us look at a toy example, where a clusterer divides the label space based on how a given label’s ordinal divides modulo a given number of clusters.

```
from skmultilearn.ensemble import LabelCooccurenceClustererBase
class ModuloClusterer(LabelSpaceClustererBase):
def __init__(self, number_of_clusters = None):
super(ModuloClusterer, self).__init__()
self.number_of_clusters = number_of_clusters
def fit_predict(self, X, y):
number_of_labels = y.shape[1]
# assign a label to a cluster no. label ordinal % number of labeSls
return map(lambda x: x % self.number_of_clusters, xrange(number_of_labels))
```

## Label co-occurence graph¶

A feature present currently only in `scikit-multilearn`

is the possibility to divide the label space based on analysing a label co-occurence graph. In such a graph labels are represented as nodes, and edges are generated based on how labels co-occur together among samples. In other words an edge between label no. `a`

and number `b`

is present if there exists a sample in `X`

that is labeled with both `a`

and `b`

. The `LabelCooccurenceClustererBase`

provides a method `LabelCooccurenceClustererBase.generate_coocurence_adjacency_matrix(X, y)()`

which generates a dict containing label pairs as keys, and a float as edge weight value:

`1.0`

in an unweighted setting- number of samples in
`X`

that are labeled with both labels in a weighted setting (when`self.is_weighted`

is`True`

)

The edge_map is both returned and stored in the class as `self.edge_map`

. This is a base class for building label co-occurence graphs. Subclass `LabelCooccurenceClustererBase`

and use `generate_coocurence_adjacency_matrix`

at the beginning of your `fit_predict`

as shown below, than build a graph using the edge_map property, and infer the communities from the graph. Interfaces for two popular Python graph libraries already exist:

`skmultilearn.ensemble.GraphToolCooccurenceClusterer`

that constructs the graph-tool graph object and uses stochastic block modelling for clustering`skmultilearn.ensemble.IGraphLabelCooccurenceClusterer`

that constructs an igraph graph object and allows the use of a variety of igraph’s community detection methods for clustering

to use them, just subclass the class and start your fit_predict method like this:

```
def fit_predict(self, X, y):
self.generate_coocurence_adjacency_matrix(y)
self.generate_coocurence_graph()
```

Your graph object (a graphtool Graph or an igraph Graph) is available at `self.coocurence_graph`

after those two lines.