A high-performance topological machine learning toolbox in Python
giotto-tda is a high performance topological machine learning toolbox in Python built on top of
scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.
- Seamless integration with
scikit-learnStrictly adhere to the
scikit-learnAPI and development guidelines, inherit the strengths of that framework.
- Code modularityTopological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines.
- StandardisationImplement the most successful techniques from the literature into a generic framework with a consistent API.
- InnovationImprove on existing algorithms, and make new ones available in open source.
- PerformanceFor the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (with
- Data structuresSupport for tabular data, time series, graphs, and images.
30s guide to
For installation instructions, see the installation instructions.
The functionalities of
giotto-tda are provided in
This allows you to generate topological features from your data in a familiar way. Here is an example with the
from gtda.homology import VietorisRipsPersistence VR = VietorisRipsPersistence()
which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:
diagrams = VR.fit_transform(point_clouds)
A plotting API allows for quick visual inspection of the outputs of many of
giotto-tda’s transformers. To visualize the i-th output sample, run
diagrams = VR.plot(diagrams, sample=i)
You can create scalar or vector features from persistence diagrams using
giotto-tda’s dedicated transformers. Here is an example with the
from gtda.diagrams import PersistenceEntropy PE = PersistenceEntropy() features = PE.fit_transform(diagrams)
features is a two-dimensional
numpy array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from
In particular, topological feature creation steps can be fed to or used alongside models from
scikit-learn, creating end-to-end pipelines which can be evaluated in cross-validation,
optimised via grid-searches, etc.:
from sklearn.ensemble import RandomForestClassifier from gtda.pipeline import make_pipeline from sklearn.model_selection import train_test_split X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels) RFC = RandomForestClassifier() model = make_pipeline(VR, PE, RFC) model.fit(X_train, y_train) model.score(X_valid, y_valid)
giotto-tda also implements the Mapper algorithm as a highly customisable
Pipeline, and provides simple plotting functions
for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:
from gtda.mapper import make_mapper_pipeline from sklearn.decomposition import PCA from sklearn.cluster import DBSCAN pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN()) plot_interactive_mapper_graph(pipe, data)
Tutorials and examples¶
We provide a number of tutorials and examples, which offer:
quick start guides to the API;
in-depth examples showcasing more of the library’s features;
intuitive explanations of topological techniques.
Major Features and Improvements¶
This is a major release which adds substantial new functionality and introduces several improvements.
Persistent homology of directed flag complexes via
FlagserPersistencetransformer has been added to
gtda.homology(#339). It wraps
pyflagser.flagser_weightedto allow for computations of persistence diagrams from directed or undirected weighted graphs. A new notebook demonstrates its use.
Edge collapsing and performance improvements for persistent homology¶
GUDHI C++ components have been updated to the state of GUDHI v3.3.0, yielding performance improvements in
Bindings for GUDHI’s edge collapser have been created and can now be used as an optional preprocessing step via the optional keyword argument
gtda.externals.ripser(#469 and #483). When
collapse_edges=True, and the input data and/or number of required homology dimensions is sufficiently large, the resulting runtimes for Vietoris–Rips persistent homology are state of the art.
New transformers and functionality in
WeakAlphaPersistencetransformer has been added to
EuclideanCechPersistence, it computes persistent homology from point clouds, but its runtime can scale much better with size in low dimensions.
VietorisRipsPersistencenow accepts sparse input when
CubicalPersistencenow accepts lists of 2D arrays (#503).
reduced_homologyparameter has been added to all persistent homology transformers. When
True, one infinite bar in the H0 barcode is removed for the user automatically. Previously, it was not possible to keep these bars in the simplicial homology transformers. The default is always
True, which implies a breaking change in the case of
ComplexPolynomialfeature extraction transformer has been added (#479).
NumberOfPointsfeature extraction transformer has been added (#496).
An option to normalize the entropy in
PersistenceEntropyaccording to a heuristic has been added, and a
nan_fill_valueparameter allows to replace any NaN produced by the entropy calculation with a fixed constant (#450).
The computations in
PersistenceImageand in the pairwise distances and amplitudes related to them has been changed to yield the continuum limit when
n_binstends to infinity;
sigmais now measured in the same units as the filtration parameter and defaults to 0.1 (#454).
curves subpackage has been added to preprocess, and extract features from, collections of multi-channel curves such as returned by
Silhouette (#480). It contains:
StandardFeaturestransformer that can extract features channel-wise in a generic way.
Derivativetransformer that computes channel-wise derivatives of any order by discrete differences (#492).
metaestimator subpackage has been added with a
CollectionTransformer meta-estimator which converts any transformer instance into a fit-transformer acting on collections (#495).
TakensEmbeddingis now a new transformer acting on collections of time series (#460).
TakensEmbeddingacting on a single time series has been renamed to
SingleTakensEmbeddingtransformer, and the internal logic employed in its
fitfor computing optimal hyperparameters is now available via a
takens_embedding_optimal_parametersconvenience function (#460).
SlidingWindowhas been made public and renamed into
GraphGeodesicDistancehas been improved as follows (#422):
The new parameters
methodhave been added.
The rules on the role of zero entries, infinity entries, and non-stored values have been made clearer.
Masked arrays are now supported.
modeparameter has been added to
KNeighborsGraph; as in
scikit-learn, it can be set to either
List input is now accepted by all transformers in
gtda.graphs, and outputs are consistently either lists or 3D arrays (#478).
Sparse matrices returned by
TransitionGraphnow have int dtype (0-1 adjacency matrices), and are not necessarily symmetric (#478).
Pullback cover set labels and partial cluster labels have been added to Mapper node hovertexts (#445).
Node and edge metadata are now accessible in output
igraph.Graphobjects by means of the
es(respectively). Graph-level dictionaries are no longer used.
Available node metadata can be accessed by
attr_nameis one of
Sizes of intersections are automatically stored as edge weights, accessible by
"store_intersections"keyword argument has been added to
make_mapper_pipelineto allow to store the indices defining node intersections as edge attributes, accessible via
contract_nodesoptional parameter has been added to both
make_mapper_pipeline; nodes which are subsets of other nodes are thrown away from the graph when this parameter is set to
graph_attribute is stored during
Two of the
min_intersectionand the new
contract_nodes) are now available in the widgets generated by
plot_interactive_mapper_graph, and the layout of these widgets has been improved (#456).
Nervehave been exposed in the documentation and in
plot_paramskwarg is available in plotting functions and methods throughout to allow user customisability of output figures. The user must pass a dictionary with keys
"traces"in some cases) (#441).
Several plots produced by
plotclass methods now have default titles (#453).
Infinite deaths are now plotted by
Possible multiplicities of persistence pairs in persistence diagram plots are now indicated in the hovertext (#454).
plot_heatmapnow accepts boolean array input (#444).
New tutorials and examples¶
The following new tutorials have been added:
Topology of time series, which explains the theory of the Takens time-delay embedding and its use with persistent homology, demonstrates the new
APIof several components in
gtda.time_series, and shows how to construct time series classification pipelines in
giotto-tdaby partially reproducing arXiv:1910:08245.
Topology in time series forecasting, which explains how to set up time series forecasting pipelines in
TransformerResamplerMixin``s and the ``giotto-tda
Topological feature extraction from graphs, which explains what the features extracted from directed or undirected graphs by
Classifying handwritten digits, which presents a fully-fledged machine learning pipeline in which cubical persistent homology is applied to the classification of handwritten images from he MNIST dataset, partially reproducing arXiv:1910.08345.
A bug has been fixed which could lead to features with negative lifetime in persistent homology transformers when
infinity_valueswas set too low (#339).
By relying on
graph_shortest_path, some errors in computing
GraphGeodesicDistance(e.g. when som edges are zero) have been fixed (#422).
A bug in the handling of COO matrices by the
ripserinterface has been fixed (#465).
A bug which led to the incorrect handling of the
Filteringhas been fixed (#439).
An issue with the use of
joblib.Parallel, which led to errors when attempting to run
PersistenceImage, and the corresponding amplitudes and distances on large datasets, has been fixed (#428 and #481).
A bug leading to plots of persistence diagrams not showing points with negative births or deaths has been fixed, as has a bug with the computation of the range to be shown in the plot (#437).
A bug in the handling of persistence pairs with negative death values by
Filteringhas been fixed (#436).
A bug in the handling of
homology_dimension_ix(now renamed to
homology_dimension_idx) in the
PersistenceImagehas been fixed (#452).
PersistenceLandscapeplots now show all homology dimensions, instead of just the first (#454).
A bug in the computation of amplitudes and pairwise distances based on persistence images has been fixed (#454).
Silhouettenow does not create NaNs when a subdiagram is trivial (#454).
CubicalPersistencenow does not create pairs with negative persistence when
infinity_valuesis set too low (#467).
Warnings are no longer thrown by
A bug in
Labeller.resampleaffecting cases in which
n_steps_future >= size - 1, has been fixed (#460).
A bug in
validate_params, affecting the case of tuples of allowed types, has been fixed (#502).
The minimum required versions from most of the dependencies have been bumped. The updated dependencies are
numpy >= 1.19.1,
scipy >= 1.5.0,
joblib >= 0.16.0,
scikit-learn >= 0.23.1,
python-igraph >= 0.8.2,
plotly >= 4.8.2, and
pyflagser >= 0.4.1(#457).
GraphGeodesicDistancenow returns either lists or 3D dense ndarrays for compatibility with the homology transformers - By relying on
graph_shortest_path, some errors in computing
GraphGeodesicDistance(e.g. when som edges are zero) have been fixed (#422).
The output of
PairwiseDistancehas been transposed to match
plotclass methods now return figures instead of showing them (#441).
Mapper node and edge attributes are no longer stored as graph-level dictionaries,
"node_id"is no longer an available node attribute, and the attributes
edges_previously stored by
Nerve.fithave been removed in favour of a
homology_dimension_ixparameter available in some transformers in
gtda.diagramshas been renamed to
The outputs of
HeatKerneland of the pairwise distances and amplitudes based on them is now different due to the improvements described above.
Weights are no longer stored in the
Scalerobjects when the metric is persistence-image–based; only the weight function is (#454).
homology_dimensions_attributes of several transformers have been converted from lists to tuples. When possible, homology dimensions stored as parts of attributes are now presented as ints (#454).
gaussian_filter(used to make heat– and persistence-image–based representations/pairwise distances/amplitudes) is now called with
The default value of
Amplitudehas been changed from
None, giving vector instead of scalar features (#454).
The meaning of the default
metric="persistence_image") has been changed from the identity function to the function returning a vector of ones (#454).
Due to the updates in the GUDHI components, some of the bindings and Python interfaces to the GUDHI C++ components in
gtda.externalshave changed (#468).
Labeller.transformnow returns a 1D array instead of a column array (#475).
PersistenceLandscapenow returns 3D arrays instead of 4D ones, for compatibility with the new
CubicalPersistencenow removes one infinite bar in H0 (#467, and see above).
Labellerhas been replaced with a more intuitive
sizeparameter. The relation between the two is:
size = width + 1(#460).
clustereris now a required parameter in
FirstHistogramGapnow indicates the floor of
max_fraction * n_samples; its default value has been changed from
Thanks to our Contributors¶
This release contains contributions from many people:
Umberto Lupo, Guillaume Tauzin, Julian Burella Pérez, Wojciech Reise, Lewis Tunstall, Nick Sale, and Anibal Medina-Mardones.
We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.