A high-performance topological machine learning toolbox in Python

giotto-tda is a high performance topological machine learning toolbox in Python built on top of scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.

Guiding principles

  • Seamless integration with scikit-learn
    Strictly adhere to the scikit-learn API and development guidelines, inherit the strengths of that framework.
  • Code modularity
    Topological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines.
  • Standardisation
    Implement the most successful techniques from the literature into a generic framework with a consistent API.
  • Innovation
    Improve on existing algorithms, and make new ones available in open source.
  • Performance
    For the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (with joblib).
  • Data structures
    Support for tabular data, time series, graphs, and images.

30s guide to giotto-tda

For installation instructions, see the installation instructions.

The functionalities of giotto-tda are provided in scikit-learn–style transformers. This allows you to generate topological features from your data in a familiar way. Here is an example with the VietorisRipsPersistence transformer:

from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()

which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:

diagrams = VR.fit_transform(point_clouds)

A plotting API allows for quick visual inspection of the outputs of many of giotto-tda’s transformers. To visualize the i-th output sample, run

diagrams = VR.plot(diagrams, sample=i)

You can create scalar or vector features from persistence diagrams using giotto-tda’s dedicated transformers. Here is an example with the PersistenceEntropy transformer:

from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)

features is a two-dimensional numpy array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from scikit-learn. In particular, topological feature creation steps can be fed to or used alongside models from scikit-learn, creating end-to-end pipelines which can be evaluated in cross-validation, optimised via grid-searches, etc.:

from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)

giotto-tda also implements the Mapper algorithm as a highly customisable scikit-learn Pipeline, and provides simple plotting functions for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:

from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)


Tutorials and examples

We provide a number of tutorials and examples, which offer:

  • quick start guides to the API;

  • in-depth examples showcasing more of the library’s features;

  • intuitive explanations of topological techniques.

Use cases

A selection of use cases for giotto-tda is collected at this page. The related GitHub repositories can be found at github.

What’s new

Major Features and Improvements

This is a major release which substantially broadens the scope of giotto-tda and introduces several improvements. The library’s documentation has been greatly improved and is now hosted via GitHub pages. It includes rendered jupyter notebooks from the repository’s examples folder, as well as an improved theory glossary, more detailed installation instructions, improved guidelines for contributing, and an FAQ.

Plotting functions and plotting API

This version introduces built-in plotting capabilities to giotto-tda. These come in the form of:

  • a new plotting subpackage populated with plotting functions for common data structures;

  • a new PlotterMixin and a class-level plotting API based on newly introduced plot, transform_plot and fit_transform_plot methods which are now available in several of giotto-tda’s transformers.

Changes and additions to gtda.homology

The internal structure of this subpackage has been changed. ConsistentRescaling has been moved to a new point_clouds subpackage (see below), and gtda.homology no longer contains a point_clouds submodule. Instead, it contains two submodules, simplicial and cubical. simplicial contains the VietorisRipsPersistence class as well as the following new classes:

  • SparseRipsPersistence,

  • EuclideanCechPersistence.

The cubical submodule contains CubicalPersistence, a new class for computing persistent homology of filtered cubical complexes such as those coming from 2D or 3D greyscale images.

New images subpackage

The new gtda.images subpackage contains classes which, together with gtda.homology.CubicalPersistence, extend the capabilities of giotto-tda to computer vision, by handling input representing binary or greyscale 2D/3D images represented as arrays.

The classes in gtda.images.filtrations are responsible for converting binary image input into greyscale images in a variety of ways. The greyscale output can then be fed to gtda.homology.CubicalPersistence to extract topological signatures in the form of persistence diagrams. These classes are:

  • HeightFiltration,

  • RadialFiltration,

  • DilationFiltration,

  • ErosionFiltration,

  • SignedDistanceFiltration.

The classes in gtda.images.preprocessing perform a variety of preprocessing steps on either binary or greyscale image input, as well as conversion to point cloud format. They are:

  • Binarizer,

  • Inverter,

  • Padder,

  • ImageToPointCloud.

New point_clouds subpackage

ConsistentRescaling is no longer placed in gtda.homology. Instead, it is now in a point_clouds subpackage containing classes which process or modify the geometry of point cloud data. gtda.point_clouds also contains the new class ConsecutiveRescaling, written with time series applications in mind.

List of point cloud input

All classes in the homology subpackage (VietorisRipsPersistence, SparseRipsPersistence, and EuclideanCechPersistence) can now take as inputs to the fit and transform methods lists of 2D arrays instead of simply 3D arrays. In this way, collections of point clouds with varying numbers of points can be processed.

Changes and additions to gtda.diagrams

The diagrams subpackage contains the following new classes:

  • PersistenceImage

  • Silhouette

Additionally, the subpackage has been reorganised as follows:

  • The features submodule now only contains the scalar feature generation classes Amplitude (moved there from distance) and PersistenceEntropy.

  • Classes which produce vector representations from persistence diagrams have been moved to the new representations submodule.

Changes and additions to gtda.utils

  • validate_params has been thoroughly refactored, documented and exposed for the benefit of developers.

  • check_diagrams has been modified, documented and exposed for the benefit of developers.

  • The new check_point_clouds performs validation of inputs consisting of collections of point clouds of distance matrices. It accepts both lists of 2D ndarrays and 3D ndarrays, and is used in the fit and transform methods of classes in gtda.homology.simplicial to allow for list input (see above).

External modules and HPC improvements

A substantial effort has been put in improving the quality of the high-performance components contained in gtda.externals. The end result is a cleaner packaging as well as faster execution of C++ functions due to improved bindings. In particular:

  • Two binaries are now shipped for ripser, one of them being optimised for calculations with mod 2 coefficients.

  • Recent improvements by the authors of the hera C++ library have been integrated in giotto-tda.

  • Compiler optimisations for Windows-based systems have been added.

  • The integration of pybind11 has been improved and several issues arising with CMake and boost during developer installations have been addressed.

Bug Fixes

  • Fixed a bug with TakensEmbedding’s algorithm for search of optimal parameters.

  • Inconsistencies in between the meaning of “bottleneck amplitude” in the theory and in the code have been ironed out. The code has been modified to agree with the theory glossary. The outputs of the gtda.diagrams classes Amplitude, Scaler and Filtering is affected.

  • Fixed bugs affecting color normalization in Mapper graph plots.

Backwards-Incompatible Changes

  • Python 3.5 is no longer supported.

  • Mac OS X versions below 10.14 are no longer supported by the wheels shipped via PyPI.

  • ConsistentRescaling is no longer found in gtda.homology and is now part of gtda.point_clouds.

  • The outputs of the gtda.diagrams classes Amplitude, Scaler and Filtering have changed due to sqrt(2) factors (see Bug Fixes).

  • The meta_transformers module has been removed.

  • The plotting module has been removed from the examples folder of the repository.

Thanks to our Contributors

This release contains contributions from many people:

Umberto Lupo, Guillaume Tauzin, Wojciech Reise, Julian Burella Pérez, Roman Yurchak, Lewis Tunstall, Anibal Medina-Mardones, and Adélie Garin.

We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.