项目作者: wilsonjr

项目描述 :
Explaining dimensionality results using SHAP values
高级语言: Jupyter Notebook
项目地址: git://github.com/wilsonjr/ClusterShapley.git
创建时间: 2021-05-12T19:34:40Z
项目社区:https://github.com/wilsonjr/ClusterShapley

开源协议:BSD 3-Clause "New" or "Revised" License

下载


.. -- mode: rst --

|pypiversion| |pypidownloads|

.. |pypi_version| image:: https://img.shields.io/pypi/v/cluster-shapley.svg
.. _pypi_version: https://pypi.python.org/pypi/cluster-shapley/

.. |pypi_downloads| image:: https://pepy.tech/badge/cluster-shapley/month
.. _pypi_downloads: https://pepy.tech/project/cluster-shapley

=====

ClusterShapley

ClusterShapley is a technique to explain non-linear dimendionality reduction results. You can explain the cluster formation after reducing the dimensionality to 2D. Read the preprint <https://arxiv.org/abs/2103.05678> or publisher <https://doi.org/10.1016/j.eswa.2021.115020> versions for further details.


Installation

ClusterShapley depends upon common machine learning libraries, such as scikit-learn and NumPy. It also depends on SHAP.

Requirements:

  • shap
  • numpy
  • scipy
  • scikit-learn
  • pybind11

If you have these requirements installed, use PyPI:

.. code:: bash

  1. pip install cluster-shapley

Usage examples

ClusterShapley package follows the same idea of sklearn classes, in which you need to fit and transform data.

Explaining cluster formation

Suppose you want to investigate the decisions of a dimensionality reduction (DR) technique to impose a projection on 2D. The first thing to do is to project the dataset.

.. code:: python

  1. import umap
  2. import matplotlib.pyplot as plt
  3. from sklearn.datasets import load_iris
  4. data = load_iris()
  5. X, y = data.data, data.target
  6. reducer = umap.UMAP(verbose=0, random_state=0)
  7. embedding = reducer.fit_transform(X)
  8. plt.scatter(embedding[:, 0], embedding[:, 1], c=y)

.. image:: docs/artwork/iris.png
:alt: UMAP embedding of the Iris dataset

Compute explanations

Now, you can generate explanations to understand why UMAP (or any other DR technique) imposed that cluster formation.

.. code:: python

  1. import random
  2. import numpy as np
  3. # our library
  4. import dr_explainer as dre
  5. # fit the dataset
  6. clusterShapley = dre.ClusterShapley()
  7. clusterShapley.fit(X, y)
  8. # compute explanations for data subset
  9. to_explain = np.array(random.sample(X.tolist(), int(X.shape[0] * 0.2)))
  10. shap_values = clusterShapley.transform(to_explain)

The matrix shap_values of shape (3, 30, 4) contains:

  1. * the features' contributions for each class (3);
  2. * upon the samples used to generate explanations (30);
  3. * for each feature (4).

Visualize the contributions using SHAP plot

For now, you can rely on SHAP library to visualize the contributions

.. code:: python

  1. klass = 0
  2. c_exp = shap.Explanation(shap_values[klass], data=to_explain, feature_names=data.feature_names)
  3. shap.plots.beeswarm(c_exp)

.. image:: docs/artwork/explanation_iris0.png
:alt: Contributions for the embedding of class 0

The plot shows the contributions of each feature for the cohesion of the selected class. Example for ‘petal length (cm)’:

  1. * Low feature values (blue) contribute for the cohesion of the selected class.
  2. * Higher feature values (red) *do not* contribute for the cohesion.

Defining your own clusters

Suppose you want to investigate why UMAP clustered 2 classes together while projecting the third one distant in 2D.

To understand that, we can use ClusterShapley to explain how the features contribute to these two major clusters.

.. code:: python

  1. # fit KMeans with two clusters (see notebooks/ for the complete code)

.. image:: docs/artwork/kmeans_clusters.png
:alt: Two clusters returned by KMeans on the embedding

Lets generate explanations knowing that cluster 0 is on right and cluster 1 is on left.

.. code:: python

  1. clusterShapley = dre.ClusterShapley()
  2. clusterShapley.fit(X, kmeans.labels_)
  3. shap_values = clusterShapley.transform(to_explain)

For the right cluster

.. code:: python

  1. c_exp = shap.Explanation(shap_values[0], data=to_explain, feature_names=data.feature_names)
  2. shap.plots.beeswarm(c_exp)

.. image:: docs/artwork/explanation0.png
:alt: Features’ contributions for cluster 0

The right cluster is characterized by the low values of petal length (cm), petal width (cm), sepal length (cm).

For the left cluster

.. code:: python

  1. c_exp = shap.Explanation(shap_values[1], data=to_explain, feature_names=data.feature_names)
  2. shap.plots.beeswarm(c_exp)

.. image:: docs/artwork/explanation1.png
:alt: Features’ contributions for cluster 1

On the other hand, the left cluster (composed by two classes) is characterized by high values of petal length (cm), petal width (cm), sepal length (cm).


Citation

Please, use the following reference to further details and to cite ClusterShapley in your work:

.. code:: bibtex

  1. @article{MarcilioJr2021_ClusterShapley,
  2. title = {Explaining dimensionality reduction results using Shapley values},
  3. journal = {Expert Systems with Applications},
  4. volume = {178},
  5. pages = {115020},
  6. year = {2021},
  7. issn = {0957-4174},
  8. doi = {https://doi.org/10.1016/j.eswa.2021.115020},
  9. url = {https://www.sciencedirect.com/science/article/pii/S0957417421004619},
  10. author = {Wilson E. Marcílio-Jr and Danilo M. Eler}
  11. }

License

ClusterShapley follows the 3-clause BSD license.

ClusterShapley uses the open-source SHAP implementation from SHAP <https://github.com/slundberg/shap>_.

……