Overview¶
A high-performance topological machine learning toolbox in Python
giotto-tda is a high performance topological machine learning toolbox in Python built on top of
scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.
Guiding principles¶
Seamless integration withscikit-learnStrictly adhere to thescikit-learnAPI and development guidelines, inherit the strengths of that framework. Code modularityTopological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines. StandardisationImplement the most successful techniques from the literature into a generic framework with a consistent API. InnovationImprove on existing algorithms, and make new ones available in open source. PerformanceFor the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (withjoblib). Data structuresSupport for tabular data, time series, graphs, and images.
30s guide to giotto-tda¶
For installation instructions, see the installation instructions.
The functionalities of giotto-tda are provided in scikit-learn–style transformers.
This allows you to generate topological features from your data in a familiar way. Here is an example with the VietorisRipsPersistence transformer:
from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()
which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:
diagrams = VR.fit_transform(point_clouds)
A plotting API allows for quick visual inspection of the outputs of many of giotto-tda’s transformers. To visualize the i-th output sample, run
diagrams = VR.plot(diagrams, sample=i)
You can create scalar or vector features from persistence diagrams using giotto-tda’s dedicated transformers. Here is an example with the PersistenceEntropy transformer:
from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)
features is a two-dimensional numpy array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from scikit-learn.
In particular, topological feature creation steps can be fed to or used alongside models from scikit-learn, creating end-to-end pipelines which can be evaluated in cross-validation,
optimised via grid-searches, etc.:
from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
giotto-tda also implements the Mapper algorithm as a highly customisable scikit-learn Pipeline, and provides simple plotting functions
for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:
from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)
Resources¶
Tutorials and examples¶
We provide a number of tutorials and examples, which offer:
quick start guides to the API;
in-depth examples showcasing more of the library’s features;
intuitive explanations of topological techniques.
What’s new¶
Major Features and Improvements¶
The theory glossary has been improved to include the notions of vectorization, kernel and amplitude for persistence diagrams.
The
ripserfunction ingtda.externals.python.ripser_interfaceno longer uses scikit-learn’spairwise_distanceswhenmetricis'precomputed', thus allowing square arrays with negative entries or infinities to be passed.check_point_cloudsingtda.utils.validationnow checks for square array input when the input should be a collection of distance-type matrices. Warnings guide the user to correctly setting thedistance_matricesparameter.force_all_finite=Falseno longer means accepting NaN input (only infinite input is accepted).VietorisRipsPersistenceingtda.homology.simplicialno longer masks out infinite entries in the input to be fed toripser.The docstrings for
check_point_cloudsandVietorisRipsPersistencehave been improved to reflect these changes and the extra level of generality forripser.
Bug Fixes¶
The variable used to indicate the location of Boost headers has been renamed from
Boost_INCLUDE_DIRtoBoost_INCLUDE_DIRSto address developer installation issues in some Linux systems.
Backwards-Incompatible Changes¶
The keyword parameter
distance_matrixincheck_point_cloudshas been renamed todistance_matrices.
Thanks to our Contributors¶
This release contains contributions from many people:
Umberto Lupo, Anibal Medina-Mardones, Julian Burella Pérez, Guillaume Tauzin, and Wojciech Reise.
We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.