Overview¶
A high-performance topological machine learning toolbox in Python
giotto-tda
is a high performance topological machine learning toolbox in Python built on top of
scikit-learn
and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.
Guiding principles¶
Seamless integration withscikit-learn
Strictly adhere to thescikit-learn
API and development guidelines, inherit the strengths of that framework. Code modularityTopological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines. StandardisationImplement the most successful techniques from the literature into a generic framework with a consistent API. InnovationImprove on existing algorithms, and make new ones available in open source. PerformanceFor the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (withjoblib
). Data structuresSupport for tabular data, time series, graphs, and images.
30s guide to giotto-tda
¶
For installation instructions, see the installation instructions.
The functionalities of giotto-tda
are provided in scikit-learn
–style transformers.
This allows you to generate topological features from your data in a familiar way. Here is an example with the VietorisRipsPersistence
transformer:
from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()
which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:
diagrams = VR.fit_transform(point_clouds)
A plotting API allows for quick visual inspection of the outputs of many of giotto-tda
’s transformers. To visualize the i-th output sample, run
diagrams = VR.plot(diagrams, sample=i)
You can create scalar or vector features from persistence diagrams using giotto-tda
’s dedicated transformers. Here is an example with the PersistenceEntropy
transformer:
from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)
features
is a two-dimensional numpy
array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from scikit-learn
.
In particular, topological feature creation steps can be fed to or used alongside models from scikit-learn
, creating end-to-end pipelines which can be evaluated in cross-validation,
optimised via grid-searches, etc.:
from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
giotto-tda
also implements the Mapper algorithm as a highly customisable scikit-learn
Pipeline
, and provides simple plotting functions
for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:
from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)
Resources¶
Tutorials and examples¶
We provide a number of tutorials and examples, which offer:
quick start guides to the API;
in-depth examples showcasing more of the library’s features;
intuitive explanations of topological techniques.
What’s new¶
Major Features and Improvements¶
This is a major release which adds substantial new functionality and introduces several improvements.
Persistent homology of directed flag complexes via pyflagser
¶
The
pyflagser
package (source, docs) is now an official dependency ofgiotto-tda
.The
FlagserPersistence
transformer has been added togtda.homology
(#339). It wrapspyflagser.flagser_weighted
to allow for computations of persistence diagrams from directed or undirected weighted graphs. A new notebook demonstrates its use.
Edge collapsing and performance improvements for persistent homology¶
GUDHI C++ components have been updated to the state of GUDHI v3.3.0, yielding performance improvements in
SparseRipsPersistence
,EuclideanCechPersistence
andCubicalPersistence
(#468).Bindings for GUDHI’s edge collapser have been created and can now be used as an optional preprocessing step via the optional keyword argument
collapse_edges
inVietorisRipsPersistence
and ingtda.externals.ripser
(#469 and #483). Whencollapse_edges=True
, and the input data and/or number of required homology dimensions is sufficiently large, the resulting runtimes for Vietoris–Rips persistent homology are state of the art.The performance of the Ripser bindings has otherwise been improved by avoiding unnecessary data copies, better managing the memory, and using more efficient matrix routines (#501 and #507).
New transformers and functionality in gtda.homology
¶
The
WeakAlphaPersistence
transformer has been added togtda.homology
(#464). LikeVietorisRipsPersistence
,SparseRipsPersistence
andEuclideanCechPersistence
, it computes persistent homology from point clouds, but its runtime can scale much better with size in low dimensions.VietorisRipsPersistence
now accepts sparse input whenmetric="precomputed"
(#424).CubicalPersistence
now accepts lists of 2D arrays (#503).A
reduced_homology
parameter has been added to all persistent homology transformers. WhenTrue
, one infinite bar in the H0 barcode is removed for the user automatically. Previously, it was not possible to keep these bars in the simplicial homology transformers. The default is alwaysTrue
, which implies a breaking change in the case ofCubicalPersistence
(#467).
Persistence diagrams¶
A
ComplexPolynomial
feature extraction transformer has been added (#479).A
NumberOfPoints
feature extraction transformer has been added (#496).An option to normalize the entropy in
PersistenceEntropy
according to a heuristic has been added, and anan_fill_value
parameter allows to replace any NaN produced by the entropy calculation with a fixed constant (#450).The computations in
HeatKernel
,PersistenceImage
and in the pairwise distances and amplitudes related to them has been changed to yield the continuum limit whenn_bins
tends to infinity;sigma
is now measured in the same units as the filtration parameter and defaults to 0.1 (#454).
New curves
subpackage¶
A new curves
subpackage has been added to preprocess, and extract features from, collections of multi-channel curves such as returned by BettiCurve
, PersistenceLandscape
and Silhouette
(#480). It contains:
A
StandardFeatures
transformer that can extract features channel-wise in a generic way.A
Derivative
transformer that computes channel-wise derivatives of any order by discrete differences (#492).
New metaestimators
subpackage¶
A new metaestimator
subpackage has been added with a CollectionTransformer
meta-estimator which converts any transformer instance into a fit-transformer acting on collections (#495).
Images¶
Time series¶
TakensEmbedding
is now a new transformer acting on collections of time series (#460).The former
TakensEmbedding
acting on a single time series has been renamed toSingleTakensEmbedding
transformer, and the internal logic employed in itsfit
for computing optimal hyperparameters is now available via atakens_embedding_optimal_parameters
convenience function (#460).The
_slice_windows
method ofSlidingWindow
has been made public and renamed intoslice_windows
(#460).
Graphs¶
GraphGeodesicDistance
has been improved as follows (#422):The new parameters
directed
,unweighted
andmethod
have been added.The rules on the role of zero entries, infinity entries, and non-stored values have been made clearer.
Masked arrays are now supported.
A
mode
parameter has been added toKNeighborsGraph
; as inscikit-learn
, it can be set to either"distance"
or"connectivity"
(#478).List input is now accepted by all transformers in
gtda.graphs
, and outputs are consistently either lists or 3D arrays (#478).Sparse matrices returned by
KNeighborsGraph
andTransitionGraph
now have int dtype (0-1 adjacency matrices), and are not necessarily symmetric (#478).
Mapper¶
Pullback cover set labels and partial cluster labels have been added to Mapper node hovertexts (#445).
The functionality of
Nerve
andmake_mapper_pipeline
has been greatly extended (#447 and #456):Node and edge metadata are now accessible in output
igraph.Graph
objects by means of theVertexSeq
andEdgeSeq
attributesvs
andes
(respectively). Graph-level dictionaries are no longer used.Available node metadata can be accessed by
graph.vs[attr_name]
where forattr_name
is one of"pullback_set_label"
,"partial_cluster_label"
, or"node_elements"
.Sizes of intersections are automatically stored as edge weights, accessible by
graph.es["weight"]
.A
"store_intersections"
keyword argument has been added toNerve
andmake_mapper_pipeline
to allow to store the indices defining node intersections as edge attributes, accessible viagraph.es["edge_elements"]
.A
contract_nodes
optional parameter has been added to bothNerve
andmake_mapper_pipeline
; nodes which are subsets of other nodes are thrown away from the graph when this parameter is set toTrue
.A
graph_
attribute is stored duringNerve.fit
.
Two of the
Nerve
parameters (min_intersection
and the newcontract_nodes
) are now available in the widgets generated byplot_interactive_mapper_graph
, and the layout of these widgets has been improved (#456).ParallelClustering
andNerve
have been exposed in the documentation and ingtda.mapper
’s__init__
(#447).
Plotting¶
A
plot_params
kwarg is available in plotting functions and methods throughout to allow user customisability of output figures. The user must pass a dictionary with keys"layout"
and/or"trace"
(or"traces"
in some cases) (#441).Several plots produced by
plot
class methods now have default titles (#453).Infinite deaths are now plotted by
plot_diagrams
(#461).Possible multiplicities of persistence pairs in persistence diagram plots are now indicated in the hovertext (#454).
plot_heatmap
now accepts boolean array input (#444).
New tutorials and examples¶
The following new tutorials have been added:
Topology of time series, which explains the theory of the Takens time-delay embedding and its use with persistent homology, demonstrates the new
API
of several components ingtda.time_series
, and shows how to construct time series classification pipelines ingiotto-tda
by partially reproducing arXiv:1910:08245.Topology in time series forecasting, which explains how to set up time series forecasting pipelines in
giotto-tda
viaTransformerResamplerMixin``s and the ``giotto-tda
Pipeline
class.Topological feature extraction from graphs, which explains what the features extracted from directed or undirected graphs by
VietorisRipsPersistence
,SparseRipsPersistence
andFlagserPersistence
are.Classifying handwritten digits, which presents a fully-fledged machine learning pipeline in which cubical persistent homology is applied to the classification of handwritten images from he MNIST dataset, partially reproducing arXiv:1910.08345.
Utils¶
Bug Fixes¶
A bug has been fixed which could lead to features with negative lifetime in persistent homology transformers when
infinity_values
was set too low (#339).By relying on
scipy
’sshortest_path
instead ofscikit-learn
’sgraph_shortest_path
, some errors in computingGraphGeodesicDistance
(e.g. when som edges are zero) have been fixed (#422).A bug in the handling of COO matrices by the
ripser
interface has been fixed (#465).A bug which led to the incorrect handling of the
homology_dimensions
parameter inFiltering
has been fixed (#439).An issue with the use of
joblib.Parallel
, which led to errors when attempting to runHeatKernel
,PersistenceImage
, and the corresponding amplitudes and distances on large datasets, has been fixed (#428 and #481).A bug leading to plots of persistence diagrams not showing points with negative births or deaths has been fixed, as has a bug with the computation of the range to be shown in the plot (#437).
A bug in the handling of persistence pairs with negative death values by
Filtering
has been fixed (#436).A bug in the handling of
homology_dimension_ix
(now renamed tohomology_dimension_idx
) in theplot
methods ofHeatKernel
andPersistenceImage
has been fixed (#452).A bug in the labelling of axes in
HeatKernel
andPersistenceImage
plots has ben fixed (#453 and #454).PersistenceLandscape
plots now show all homology dimensions, instead of just the first (#454).A bug in the computation of amplitudes and pairwise distances based on persistence images has been fixed (#454).
Silhouette
now does not create NaNs when a subdiagram is trivial (#454).CubicalPersistence
now does not create pairs with negative persistence wheninfinity_values
is set too low (#467).Warnings are no longer thrown by
KNeighborsGraph
whenmetric="precomputed"
(#506).A bug in
Labeller.resample
affecting cases in whichn_steps_future >= size - 1
, has been fixed (#460).A bug in
validate_params
, affecting the case of tuples of allowed types, has been fixed (#502).
Backwards-Incompatible Changes¶
The minimum required versions from most of the dependencies have been bumped. The updated dependencies are
numpy >= 1.19.1
,scipy >= 1.5.0
,joblib >= 0.16.0
,scikit-learn >= 0.23.1
,python-igraph >= 0.8.2
,plotly >= 4.8.2
, andpyflagser >= 0.4.1
(#457).GraphGeodesicDistance
now returns either lists or 3D dense ndarrays for compatibility with the homology transformers - By relying onscipy
’sshortest_path
instead ofscikit-learn
’sgraph_shortest_path
, some errors in computingGraphGeodesicDistance
(e.g. when som edges are zero) have been fixed (#422).The output of
PairwiseDistance
has been transposed to matchscikit-learn
convention(n_samples_transform, n_samples_fit)
(#420).plot
class methods now return figures instead of showing them (#441).Mapper node and edge attributes are no longer stored as graph-level dictionaries,
"node_id"
is no longer an available node attribute, and the attributesnodes_
andedges_
previously stored byNerve.fit
have been removed in favour of agraph_
attribute (#447).The
homology_dimension_ix
parameter available in some transformers ingtda.diagrams
has been renamed tohomology_dimensions_idx
(#452).The base of the logarithm used by
PersistenceEntropy
is now 2 instead of e, and NaN values are replaced with -1 instead of 0 by default (#450 and #474).The outputs of
PersistenceImage
,HeatKernel
and of the pairwise distances and amplitudes based on them is now different due to the improvements described above.Weights are no longer stored in the
effective_metric_params_
attribute ofPairwiseDistance
,Amplitude
andScaler
objects when the metric is persistence-image–based; only the weight function is (#454).The
homology_dimensions_
attributes of several transformers have been converted from lists to tuples. When possible, homology dimensions stored as parts of attributes are now presented as ints (#454).gaussian_filter
(used to make heat– and persistence-image–based representations/pairwise distances/amplitudes) is now called withmode="constant"
instead of"reflect"
(#454).The default value of
order
inAmplitude
has been changed from2.
toNone
, giving vector instead of scalar features (#454).The meaning of the default
None
forweight_function
inPersistenceImage
(and inAmplitude
andPairwiseDistance
whenmetric="persistence_image"
) has been changed from the identity function to the function returning a vector of ones (#454).Due to the updates in the GUDHI components, some of the bindings and Python interfaces to the GUDHI C++ components in
gtda.externals
have changed (#468).Labeller.transform
now returns a 1D array instead of a column array (#475).PersistenceLandscape
now returns 3D arrays instead of 4D ones, for compatibility with the newcurves
subpackage (#480).By default,
CubicalPersistence
now removes one infinite bar in H0 (#467, and see above).The former
width
parameter inSlidingWindow
andLabeller
has been replaced with a more intuitivesize
parameter. The relation between the two is:size = width + 1
(#460).clusterer
is now a required parameter inParallelClustering
(#508).The
max_fraction
parameter inFirstSimpleGap
andFirstHistogramGap
now indicates the floor ofmax_fraction * n_samples
; its default value has been changed fromNone
to1
(#412).
Thanks to our Contributors¶
This release contains contributions from many people:
Umberto Lupo, Guillaume Tauzin, Julian Burella Pérez, Wojciech Reise, Lewis Tunstall, Nick Sale, and Anibal Medina-Mardones.
We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.