Fast and memory-efficient clustering + coreset construction, including fast distance kernels for Bregman and f-divergences.
Minicore is a fast, generic library for constructing and clustering coresets on graphs, in metric spaces and under non-metric dissimilarity measures.
It includes methods for constant-factor and bicriteria approximation solutions, as well as coreset sampling algorithms.
These methods allow for fast and accurate summarization of and clustering of datasets with strong theoretical guarantees.
Minicore both stands for “mini” and “core”, as it builds concise representations via core-sets, and as a portmanteau of Manticore and Minotaur.
Certain applications have specific requirements, such as libosmium for kzclustexp, which computes coresets over OpenStreetMaps data, but these are included as submodules primarily.
Python bindings require numpy, scipy, and a recent C++ compiler capable of compiling C++-17.
See python/README.md
for an example and installation instructions, or you can install by running python3 setup.py
from the base directory.
On some operating systems, SLEEF must be linked dynamically. If python3 setup.py install
fails, try python3 dynsetup.py install
, which will build a dynamic library.
If you are using Conda, dynsetup.py
will install the libsleef.so/dylib files as necessary, but otherwise you will need to add a directory containing one of these dynamic libraries to LD_LIBRARY_PATH or DYLD_LIBRARY_PATH.
Because minicore compiles distance code for the destination hardware, it’s difficult to distribute via PyPI, but can still be installed in a single command via pip:
python3 -m pip install git+git://github.com/dnbaker/minicore@main
CoresetSampler
contains methods for building an importance sampling framework, performing sampling, and reweighting.lsearch.h
jv_solver.h
[Thorup, 2005]
-sampling for pruning search space in both graph shortest-paths metrics and oracle metrics.\alpha-
approximate metric clustererminicore/streaming.h
Soon, the goal is to, given a set of data, k, and a dissimilarity measure,
select the correct approximation algorithm, and sampling strategy to generate a coreset,
as well as optimize the clustering problem using that coreset.
For Bregman divergences and the squared L2 distance, D^2 sampling works.
For all other measures, we will either use the Thorup-sampled JV/local search approximation method
for metrics or the streaming k-means method for alpha-approximate metrics to achieve the approximate solution.
Once this is achieved and importances sampled, we optimize the problems:
There exist the potential to achieve higher accuracy clusterings using coresets compared with the full
data because of the potential to use exhaustive techniques. We have not yet explored this.
graph.h contains a wrapper for boost::adjacency_list
tailored for k-median and other optimal transport problems.
kcenter 2-approximation (farthest point)
Algorithm from:
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985.
Algorithms from:
Hu Ding, Haikuo Yu, Zixiu Wang
Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction
IndexCoreset<IT, FT>
, where IT is index type (integral) and FT is weight type (floating point)MatrixCoreset<MatType, FT>
(Matrix Type, weight type (floating point)
Constructed from an IndexCoreset and a Matrix, simply concatenates both matrices and weight vectors.
Can be reduced using coreset construction.
wrappers in the blz namespace for blaze::DynamicMatrix and blaze::CustomMatrix, with rowiterator()
and columniterator()
functions allowing range-based loops over the the rows or columns of a matrix.
structs providing distances under given norms (effectively distance oracles), use in kmeans.h
Uses mio for mmap’d IO. Some of our software uses
in-memory matrices up to a certain size and then falls back to mmap.
Provides norms and distances.
Includes L1, L2, L3, L4, general p-norms, Bhattacharya, Matusita,
Multinomial and Poisson Bregman divergences, Multinomial Jensen-Shannon Divergence,
and the Multinomial Jensen-Shannon Metric, optionally with priors.
Contains ProbDivApplicator, which is a caching applicator of a particular measure of dissimilarity.
Also contains code for generating D^2 samplers for approximate solutions.
Measures using logs or square roots cache these values.
The k-center 2-approximation is Gonzalez’s
algorithm90224-5).
The k-center clustering, 2-approximation, and coreset with outliers is Ding, Yu, and Wang.
The importance sampling framework we use is from the Braverman, Feldman, and Lang paper from 2016,
while its application to graph metrics is from Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu.
We also support Varadarajan-Xiao, Feldman Langberg, and Bachem et al., methods for coreset sampling for differing dissimilarity measures.
We use a modified iterative version of the sampling from Thorup paper from 2005
for an initial graph bicriteria approximation, which is described in the above Baker, et al. This can be found for shortest-paths graph metrics and oracle metrics in minicore/bicriteria.h.
By default, we use dynamic linking for gcc and libstdc++. If this causes runtime issues on your machine, you can install the python bindings after setting the MINICORE_STATIC
variable.
cd python
export MINICORE_STATIC=1
python3 setup.py install
# or
MINICORE_STATIC=1 python3 setup.py install