{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multi-label embedding-based classification " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multi-label embedding techniques emerged as a response the need to cope with a large label space, but with the rise of computing power they became a method of improving classification quality. Typically the embedding-based multi-label classification starts with embedding the label matrix of the training set in some way, training a regressor for unseen samples to predict their embeddings, and a classifier (sometimes very simple ones) to correct the regression error. Scikit-multilearn provides several multi-label embedders alongisde a general regressor-classifier classification class. \n", "\n", "Currently available embedding strategies include: \n", "\n", "- Label Network Embeddings via OpenNE network embedding library, as in the [LNEMLC paper](https://arxiv.org/abs/1812.02956)\n", "- Cost-Sensitive Label Embedding with Multidimensional Scaling, as in the [CLEMS paper](https://github.com/ej0cl6/csmlc)\n", "- scikit-learn based embeddings such as PCA or [manifold learning approaches](https://scikit-learn.org/stable/modules/manifold.html)\n", "\n", "Let's start with loading some data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emotions:train - exists, not redownloading\n", "emotions:test - exists, not redownloading\n" ] } ], "source": [ "import numpy\n", "import sklearn.metrics as metrics\n", "from skmultilearn.dataset import load_dataset\n", "\n", "X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')\n", "X_test, y_test, _, _ = load_dataset('emotions', 'test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Label Network Embeddings\n", "\n", "The label network embeddings approaches require a working tensorflow installation and the OpenNE library. To install them, run the following code:\n", "\n", "```bash\n", "pip install networkx tensorflow\n", "git clone https://github.com/thunlp/OpenNE/\n", "pip install -e OpenNE/src\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![ ](_static/lnemlc.png)\n", "\n", "For an example we will use the LINE embedding method, one of the most efficient and well-performing state of the art approaches, for the meaning of parameters consult the [OpenNE documentation](). We select `order = 3` which means that the method will take both first and second order proximities between labels for embedding. We select a dimension of 5 times the number of labels, as the linear embeddings tend to need more dimensions for best performance, normalize the label weights to maintain normalized distances in the network and agregate label embedings per sample by summation which is a classical approach." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.embedding import OpenNetworkEmbedder \n", "from skmultilearn.cluster import LabelCooccurrenceGraphBuilder" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "graph_builder = LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False)\n", "openne_line_params = dict(batch_size=1000, order=3)\n", "embedder = OpenNetworkEmbedder(\n", " graph_builder, \n", " 'LINE', \n", " dimension = 5*y_train.shape[1], \n", " aggregation_function = 'add', \n", " normalize_weights=True, \n", " param_dict = openne_line_params\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now need to select a regressor and a classifier, we use random forest regressors with MLkNN which is a well working combination often used for multi-label embedding:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.embedding import EmbeddingClassifier\n", "from sklearn.ensemble import RandomForestRegressor\n", "from skmultilearn.adapt import MLkNN" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pre-procesing for non-uniform negative sampling!\n", "Pre-procesing for non-uniform negative sampling!\n", "epoch:0 sum of loss:4.97153359652\n", "epoch:0 sum of loss:5.11869335175\n", "epoch:1 sum of loss:4.98133981228\n", "epoch:1 sum of loss:4.97720247507\n", "epoch:2 sum of loss:4.81723511219\n", "epoch:2 sum of loss:5.05428689718\n", "epoch:3 sum of loss:5.09079384804\n", "epoch:3 sum of loss:4.72988516092\n", "epoch:4 sum of loss:5.0347994566\n", "epoch:4 sum of loss:4.95063251257\n", "epoch:5 sum of loss:4.68008613586\n", "epoch:5 sum of loss:4.9329983592\n", "epoch:6 sum of loss:4.74205821753\n", "epoch:6 sum of loss:4.68989795446\n", "epoch:7 sum of loss:4.62912601233\n", "epoch:7 sum of loss:4.81548637152\n", "epoch:8 sum of loss:4.40033769608\n", "epoch:8 sum of loss:4.73801320791\n", "epoch:9 sum of loss:4.61178982258\n", "epoch:9 sum of loss:4.91443294287\n" ] } ], "source": [ "clf = EmbeddingClassifier(\n", " embedder,\n", " RandomForestRegressor(n_estimators=10),\n", " MLkNN(k=5)\n", ")\n", "\n", "clf.fit(X_train, y_train)\n", "\n", "predictions = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cost-Sensitive Label Embedding with Multidimensional Scaling\n", "\n", "CLEMS is another well-perfoming method in multi-label embeddings. It uses weighted multi-dimensional scaling to embedd a cost-matrix of unique label combinations. The cost-matrix contains the cost of mistaking a given label combination for another, thus real-valued functions are better ideas than discrete ones. Also, the `is_score` parameter is used to tell the embedder if the cost function is a score (the higher the better) or a loss (the lower the better). Additional params can be also assigned to the weighted scaler. The most efficient parameter for the number of dimensions is equal to number of labels, and is thus enforced here." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.embedding import CLEMS, EmbeddingClassifier\n", "from sklearn.ensemble import RandomForestRegressor\n", "from skmultilearn.adapt import MLkNN\n", "\n", "dimensional_scaler_params = {'n_jobs': -1}\n", "\n", "clf = EmbeddingClassifier(\n", " CLEMS(metrics.jaccard_similarity_score, is_score=True, params=dimensional_scaler_params),\n", " RandomForestRegressor(n_estimators=10, n_jobs=-1),\n", " MLkNN(k=1),\n", " regressor_per_dimension= True\n", ")\n", "\n", "clf.fit(X_train, y_train)\n", "\n", "predictions = clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn based embedders\n", "\n", "Any scikit-learn embedder can be used for multi-label classification embeddings with scikit-multilearn, just select one and try, here's a spectral embedding approach with 10 dimensions of the embedding space:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from skmultilearn.embedding import SKLearnEmbedder, EmbeddingClassifier\n", "from sklearn.manifold import SpectralEmbedding\n", "from sklearn.ensemble import RandomForestRegressor\n", "from skmultilearn.adapt import MLkNN\n", "\n", "clf = EmbeddingClassifier(\n", " SKLearnEmbedder(SpectralEmbedding(n_components = 10)),\n", " RandomForestRegressor(n_estimators=10),\n", " MLkNN(k=5)\n", ")\n", "\n", "clf.fit(X_train, y_train)\n", "\n", "predictions = clf.predict(X_test)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }