Overview¶
A high-performance topological machine learning toolbox in Python
giotto-tda is a high performance topological machine learning toolbox in Python built on top of
scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.
Guiding principles¶
Seamless integration withscikit-learnStrictly adhere to thescikit-learnAPI and development guidelines, inherit the strengths of that framework. Code modularityTopological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines. StandardisationImplement the most successful techniques from the literature into a generic framework with a consistent API. InnovationImprove on existing algorithms, and make new ones available in open source. PerformanceFor the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (withjoblib). Data structuresSupport for tabular data, time series, graphs, and images.
30s guide to giotto-tda¶
For installation instructions, see the installation instructions.
The functionalities of giotto-tda are provided in scikit-learn–style transformers.
This allows you to generate topological features from your data in a familiar way. Here is an example with the VietorisRipsPersistence transformer:
from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()
which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:
diagrams = VR.fit_transform(point_clouds)
A plotting API allows for quick visual inspection of the outputs of many of giotto-tda’s transformers. To visualize the i-th output sample, run
diagrams = VR.plot(diagrams, sample=i)
You can create scalar or vector features from persistence diagrams using giotto-tda’s dedicated transformers. Here is an example with the PersistenceEntropy transformer:
from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)
features is a two-dimensional numpy array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from scikit-learn.
In particular, topological feature creation steps can be fed to or used alongside models from scikit-learn, creating end-to-end pipelines which can be evaluated in cross-validation,
optimised via grid-searches, etc.:
from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
giotto-tda also implements the Mapper algorithm as a highly customisable scikit-learn Pipeline, and provides simple plotting functions
for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:
from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)
Resources¶
Tutorials and examples¶
We provide a number of tutorials and examples, which offer:
quick start guides to the API;
in-depth examples showcasing more of the library’s features;
intuitive explanations of topological techniques.
What’s new¶
Major Features and Improvements¶
This is a major release which adds substantial new functionality and introduces several improvements.
Persistent homology of directed flag complexes via pyflagser¶
The
pyflagserpackage (source, docs) is now an official dependency ofgiotto-tda.The
FlagserPersistencetransformer has been added togtda.homology(#339). It wrapspyflagser.flagser_weightedto allow for computations of persistence diagrams from directed or undirected weighted graphs. A new notebook demonstrates its use.
Edge collapsing and performance improvements for persistent homology¶
GUDHI C++ components have been updated to the state of GUDHI v3.3.0, yielding performance improvements in
SparseRipsPersistence,EuclideanCechPersistenceandCubicalPersistence(#468).Bindings for GUDHI’s edge collapser have been created and can now be used as an optional preprocessing step via the optional keyword argument
collapse_edgesinVietorisRipsPersistenceand ingtda.externals.ripser(#469 and #483). Whencollapse_edges=True, and the input data and/or number of required homology dimensions is sufficiently large, the resulting runtimes for Vietoris–Rips persistent homology are state of the art.The performance of the Ripser bindings has otherwise been improved by avoiding unnecessary data copies, better managing the memory, and using more efficient matrix routines (#501 and #507).
New transformers and functionality in gtda.homology¶
The
WeakAlphaPersistencetransformer has been added togtda.homology(#464). LikeVietorisRipsPersistence,SparseRipsPersistenceandEuclideanCechPersistence, it computes persistent homology from point clouds, but its runtime can scale much better with size in low dimensions.VietorisRipsPersistencenow accepts sparse input whenmetric="precomputed"(#424).CubicalPersistencenow accepts lists of 2D arrays (#503).A
reduced_homologyparameter has been added to all persistent homology transformers. WhenTrue, one infinite bar in the H0 barcode is removed for the user automatically. Previously, it was not possible to keep these bars in the simplicial homology transformers. The default is alwaysTrue, which implies a breaking change in the case ofCubicalPersistence(#467).
Persistence diagrams¶
A
ComplexPolynomialfeature extraction transformer has been added (#479).A
NumberOfPointsfeature extraction transformer has been added (#496).An option to normalize the entropy in
PersistenceEntropyaccording to a heuristic has been added, and anan_fill_valueparameter allows to replace any NaN produced by the entropy calculation with a fixed constant (#450).The computations in
HeatKernel,PersistenceImageand in the pairwise distances and amplitudes related to them has been changed to yield the continuum limit whenn_binstends to infinity;sigmais now measured in the same units as the filtration parameter and defaults to 0.1 (#454).
New curves subpackage¶
A new curves subpackage has been added to preprocess, and extract features from, collections of multi-channel curves such as returned by BettiCurve, PersistenceLandscape and Silhouette (#480). It contains:
A
StandardFeaturestransformer that can extract features channel-wise in a generic way.A
Derivativetransformer that computes channel-wise derivatives of any order by discrete differences (#492).
New metaestimators subpackage¶
A new metaestimator subpackage has been added with a CollectionTransformer meta-estimator which converts any transformer instance into a fit-transformer acting on collections (#495).
Images¶
Time series¶
TakensEmbeddingis now a new transformer acting on collections of time series (#460).The former
TakensEmbeddingacting on a single time series has been renamed toSingleTakensEmbeddingtransformer, and the internal logic employed in itsfitfor computing optimal hyperparameters is now available via atakens_embedding_optimal_parametersconvenience function (#460).The
_slice_windowsmethod ofSlidingWindowhas been made public and renamed intoslice_windows(#460).
Graphs¶
GraphGeodesicDistancehas been improved as follows (#422):The new parameters
directed,unweightedandmethodhave been added.The rules on the role of zero entries, infinity entries, and non-stored values have been made clearer.
Masked arrays are now supported.
A
modeparameter has been added toKNeighborsGraph; as inscikit-learn, it can be set to either"distance"or"connectivity"(#478).List input is now accepted by all transformers in
gtda.graphs, and outputs are consistently either lists or 3D arrays (#478).Sparse matrices returned by
KNeighborsGraphandTransitionGraphnow have int dtype (0-1 adjacency matrices), and are not necessarily symmetric (#478).
Mapper¶
Pullback cover set labels and partial cluster labels have been added to Mapper node hovertexts (#445).
The functionality of
Nerveandmake_mapper_pipelinehas been greatly extended (#447 and #456):Node and edge metadata are now accessible in output
igraph.Graphobjects by means of theVertexSeqandEdgeSeqattributesvsandes(respectively). Graph-level dictionaries are no longer used.Available node metadata can be accessed by
graph.vs[attr_name]where forattr_nameis one of"pullback_set_label","partial_cluster_label", or"node_elements".Sizes of intersections are automatically stored as edge weights, accessible by
graph.es["weight"].A
"store_intersections"keyword argument has been added toNerveandmake_mapper_pipelineto allow to store the indices defining node intersections as edge attributes, accessible viagraph.es["edge_elements"].A
contract_nodesoptional parameter has been added to bothNerveandmake_mapper_pipeline; nodes which are subsets of other nodes are thrown away from the graph when this parameter is set toTrue.A
graph_attribute is stored duringNerve.fit.
Two of the
Nerveparameters (min_intersectionand the newcontract_nodes) are now available in the widgets generated byplot_interactive_mapper_graph, and the layout of these widgets has been improved (#456).ParallelClusteringandNervehave been exposed in the documentation and ingtda.mapper’s__init__(#447).
Plotting¶
A
plot_paramskwarg is available in plotting functions and methods throughout to allow user customisability of output figures. The user must pass a dictionary with keys"layout"and/or"trace"(or"traces"in some cases) (#441).Several plots produced by
plotclass methods now have default titles (#453).Infinite deaths are now plotted by
plot_diagrams(#461).Possible multiplicities of persistence pairs in persistence diagram plots are now indicated in the hovertext (#454).
plot_heatmapnow accepts boolean array input (#444).
New tutorials and examples¶
The following new tutorials have been added:
Topology of time series, which explains the theory of the Takens time-delay embedding and its use with persistent homology, demonstrates the new
APIof several components ingtda.time_series, and shows how to construct time series classification pipelines ingiotto-tdaby partially reproducing arXiv:1910:08245.Topology in time series forecasting, which explains how to set up time series forecasting pipelines in
giotto-tdaviaTransformerResamplerMixin``s and the ``giotto-tdaPipelineclass.Topological feature extraction from graphs, which explains what the features extracted from directed or undirected graphs by
VietorisRipsPersistence,SparseRipsPersistenceandFlagserPersistenceare.Classifying handwritten digits, which presents a fully-fledged machine learning pipeline in which cubical persistent homology is applied to the classification of handwritten images from he MNIST dataset, partially reproducing arXiv:1910.08345.
Utils¶
Bug Fixes¶
A bug has been fixed which could lead to features with negative lifetime in persistent homology transformers when
infinity_valueswas set too low (#339).By relying on
scipy’sshortest_pathinstead ofscikit-learn’sgraph_shortest_path, some errors in computingGraphGeodesicDistance(e.g. when som edges are zero) have been fixed (#422).A bug in the handling of COO matrices by the
ripserinterface has been fixed (#465).A bug which led to the incorrect handling of the
homology_dimensionsparameter inFilteringhas been fixed (#439).An issue with the use of
joblib.Parallel, which led to errors when attempting to runHeatKernel,PersistenceImage, and the corresponding amplitudes and distances on large datasets, has been fixed (#428 and #481).A bug leading to plots of persistence diagrams not showing points with negative births or deaths has been fixed, as has a bug with the computation of the range to be shown in the plot (#437).
A bug in the handling of persistence pairs with negative death values by
Filteringhas been fixed (#436).A bug in the handling of
homology_dimension_ix(now renamed tohomology_dimension_idx) in theplotmethods ofHeatKernelandPersistenceImagehas been fixed (#452).A bug in the labelling of axes in
HeatKernelandPersistenceImageplots has ben fixed (#453 and #454).PersistenceLandscapeplots now show all homology dimensions, instead of just the first (#454).A bug in the computation of amplitudes and pairwise distances based on persistence images has been fixed (#454).
Silhouettenow does not create NaNs when a subdiagram is trivial (#454).CubicalPersistencenow does not create pairs with negative persistence wheninfinity_valuesis set too low (#467).Warnings are no longer thrown by
KNeighborsGraphwhenmetric="precomputed"(#506).A bug in
Labeller.resampleaffecting cases in whichn_steps_future >= size - 1, has been fixed (#460).A bug in
validate_params, affecting the case of tuples of allowed types, has been fixed (#502).
Backwards-Incompatible Changes¶
The minimum required versions from most of the dependencies have been bumped. The updated dependencies are
numpy >= 1.19.1,scipy >= 1.5.0,joblib >= 0.16.0,scikit-learn >= 0.23.1,python-igraph >= 0.8.2,plotly >= 4.8.2, andpyflagser >= 0.4.1(#457).GraphGeodesicDistancenow returns either lists or 3D dense ndarrays for compatibility with the homology transformers - By relying onscipy’sshortest_pathinstead ofscikit-learn’sgraph_shortest_path, some errors in computingGraphGeodesicDistance(e.g. when som edges are zero) have been fixed (#422).The output of
PairwiseDistancehas been transposed to matchscikit-learnconvention(n_samples_transform, n_samples_fit)(#420).plotclass methods now return figures instead of showing them (#441).Mapper node and edge attributes are no longer stored as graph-level dictionaries,
"node_id"is no longer an available node attribute, and the attributesnodes_andedges_previously stored byNerve.fithave been removed in favour of agraph_attribute (#447).The
homology_dimension_ixparameter available in some transformers ingtda.diagramshas been renamed tohomology_dimensions_idx(#452).The base of the logarithm used by
PersistenceEntropyis now 2 instead of e, and NaN values are replaced with -1 instead of 0 by default (#450 and #474).The outputs of
PersistenceImage,HeatKerneland of the pairwise distances and amplitudes based on them is now different due to the improvements described above.Weights are no longer stored in the
effective_metric_params_attribute ofPairwiseDistance,AmplitudeandScalerobjects when the metric is persistence-image–based; only the weight function is (#454).The
homology_dimensions_attributes of several transformers have been converted from lists to tuples. When possible, homology dimensions stored as parts of attributes are now presented as ints (#454).gaussian_filter(used to make heat– and persistence-image–based representations/pairwise distances/amplitudes) is now called withmode="constant"instead of"reflect"(#454).The default value of
orderinAmplitudehas been changed from2.toNone, giving vector instead of scalar features (#454).The meaning of the default
Noneforweight_functioninPersistenceImage(and inAmplitudeandPairwiseDistancewhenmetric="persistence_image") has been changed from the identity function to the function returning a vector of ones (#454).Due to the updates in the GUDHI components, some of the bindings and Python interfaces to the GUDHI C++ components in
gtda.externalshave changed (#468).Labeller.transformnow returns a 1D array instead of a column array (#475).PersistenceLandscapenow returns 3D arrays instead of 4D ones, for compatibility with the newcurvessubpackage (#480).By default,
CubicalPersistencenow removes one infinite bar in H0 (#467, and see above).The former
widthparameter inSlidingWindowandLabellerhas been replaced with a more intuitivesizeparameter. The relation between the two is:size = width + 1(#460).clustereris now a required parameter inParallelClustering(#508).The
max_fractionparameter inFirstSimpleGapandFirstHistogramGapnow indicates the floor ofmax_fraction * n_samples; its default value has been changed fromNoneto1(#412).
Thanks to our Contributors¶
This release contains contributions from many people:
Umberto Lupo, Guillaume Tauzin, Julian Burella Pérez, Wojciech Reise, Lewis Tunstall, Nick Sale, and Anibal Medina-Mardones.
We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.