# Overview¶

**A high-performance topological machine learning toolbox in Python**

`giotto-tda`

is a high performance topological machine learning toolbox in Python built on top of
`scikit-learn`

and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.

## Guiding principles¶

Seamless integration with`scikit-learn`

Strictly adhere to the`scikit-learn`

API and development guidelines, inherit the strengths of that framework.Code modularityTopological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines.StandardisationImplement the most successful techniques from the literature into a generic framework with a consistent API.InnovationImprove on existing algorithms, and make new ones available in open source.PerformanceFor the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (with`joblib`

).Data structuresSupport fortabular data,time series,graphs, andimages.

## 30s guide to `giotto-tda`

¶

For installation instructions, see the installation instructions.

The functionalities of `giotto-tda`

are provided in `scikit-learn`

–style transformers.
This allows you to generate topological features from your data in a familiar way. Here is an example with the `VietorisRipsPersistence`

transformer:

```
from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()
```

which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:

```
diagrams = VR.fit_transform(point_clouds)
```

A plotting API allows for quick visual inspection of the outputs of many of `giotto-tda`

’s transformers. To visualize the i-th output sample, run

```
diagrams = VR.plot(diagrams, sample=i)
```

You can create scalar or vector features from persistence diagrams using `giotto-tda`

’s dedicated transformers. Here is an example with the `PersistenceEntropy`

transformer:

```
from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)
```

`features`

is a two-dimensional `numpy`

array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from `scikit-learn`

.
In particular, topological feature creation steps can be fed to or used alongside models from `scikit-learn`

, creating end-to-end pipelines which can be evaluated in cross-validation,
optimised via grid-searches, etc.:

```
from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
```

`giotto-tda`

also implements the Mapper algorithm as a highly customisable `scikit-learn`

`Pipeline`

, and provides simple plotting functions
for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:

```
from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)
```

## Resources¶

### Tutorials and examples¶

We provide a number of tutorials and examples, which offer:

quick start guides to the API;

in-depth examples showcasing more of the library’s features;

intuitive explanations of topological techniques.

## What’s new¶

### Major Features and Improvements¶

This is a major release which adds substantial new functionality and introduces several improvements.

#### Persistent homology of directed flag complexes via `pyflagser`

¶

The

`pyflagser`

package (source, docs) is now an official dependency of`giotto-tda`

.The

`FlagserPersistence`

transformer has been added to`gtda.homology`

(#339). It wraps`pyflagser.flagser_weighted`

to allow for computations of persistence diagrams from directed or undirected weighted graphs. A new notebook demonstrates its use.

#### Edge collapsing and performance improvements for persistent homology¶

GUDHI C++ components have been updated to the state of GUDHI v3.3.0, yielding performance improvements in

`SparseRipsPersistence`

,`EuclideanCechPersistence`

and`CubicalPersistence`

(#468).Bindings for GUDHI’s edge collapser have been created and can now be used as an optional preprocessing step via the optional keyword argument

`collapse_edges`

in`VietorisRipsPersistence`

and in`gtda.externals.ripser`

(#469 and #483). When`collapse_edges=True`

, and the input data and/or number of required homology dimensions is sufficiently large, the resulting runtimes for Vietoris–Rips persistent homology are state of the art.The performance of the Ripser bindings has otherwise been improved by avoiding unnecessary data copies, better managing the memory, and using more efficient matrix routines (#501 and #507).

#### New transformers and functionality in `gtda.homology`

¶

The

`WeakAlphaPersistence`

transformer has been added to`gtda.homology`

(#464). Like`VietorisRipsPersistence`

,`SparseRipsPersistence`

and`EuclideanCechPersistence`

, it computes persistent homology from point clouds, but its runtime can scale much better with size in low dimensions.`VietorisRipsPersistence`

now accepts sparse input when`metric="precomputed"`

(#424).`CubicalPersistence`

now accepts lists of 2D arrays (#503).A

`reduced_homology`

parameter has been added to all persistent homology transformers. When`True`

, one infinite bar in the H0 barcode is removed for the user automatically. Previously, it was not possible to*keep*these bars in the simplicial homology transformers. The default is always`True`

, which implies a breaking change in the case of`CubicalPersistence`

(#467).

#### Persistence diagrams¶

A

`ComplexPolynomial`

feature extraction transformer has been added (#479).A

`NumberOfPoints`

feature extraction transformer has been added (#496).An option to normalize the entropy in

`PersistenceEntropy`

according to a heuristic has been added, and a`nan_fill_value`

parameter allows to replace any NaN produced by the entropy calculation with a fixed constant (#450).The computations in

`HeatKernel`

,`PersistenceImage`

and in the pairwise distances and amplitudes related to them has been changed to yield the continuum limit when`n_bins`

tends to infinity;`sigma`

is now measured in the same units as the filtration parameter and defaults to 0.1 (#454).

#### New `curves`

subpackage¶

A new `curves`

subpackage has been added to preprocess, and extract features from, collections of multi-channel curves such as returned by `BettiCurve`

, `PersistenceLandscape`

and `Silhouette`

(#480). It contains:

A

`StandardFeatures`

transformer that can extract features channel-wise in a generic way.A

`Derivative`

transformer that computes channel-wise derivatives of any order by discrete differences (#492).

#### New `metaestimators`

subpackage¶

A new `metaestimator`

subpackage has been added with a `CollectionTransformer`

meta-estimator which converts any transformer instance into a fit-transformer acting on collections (#495).

#### Images¶

#### Time series¶

`TakensEmbedding`

is now a new transformer acting on collections of time series (#460).The former

`TakensEmbedding`

acting on a single time series has been renamed to`SingleTakensEmbedding`

transformer, and the internal logic employed in its`fit`

for computing optimal hyperparameters is now available via a`takens_embedding_optimal_parameters`

convenience function (#460).The

`_slice_windows`

method of`SlidingWindow`

has been made public and renamed into`slice_windows`

(#460).

#### Graphs¶

`GraphGeodesicDistance`

has been improved as follows (#422):The new parameters

`directed`

,`unweighted`

and`method`

have been added.The rules on the role of zero entries, infinity entries, and non-stored values have been made clearer.

Masked arrays are now supported.

A

`mode`

parameter has been added to`KNeighborsGraph`

; as in`scikit-learn`

, it can be set to either`"distance"`

or`"connectivity"`

(#478).List input is now accepted by all transformers in

`gtda.graphs`

, and outputs are consistently either lists or 3D arrays (#478).Sparse matrices returned by

`KNeighborsGraph`

and`TransitionGraph`

now have int dtype (0-1 adjacency matrices), and are not necessarily symmetric (#478).

#### Mapper¶

Pullback cover set labels and partial cluster labels have been added to Mapper node hovertexts (#445).

The functionality of

`Nerve`

and`make_mapper_pipeline`

has been greatly extended (#447 and #456):Node and edge metadata are now accessible in output

`igraph.Graph`

objects by means of the`VertexSeq`

and`EdgeSeq`

attributes`vs`

and`es`

(respectively). Graph-level dictionaries are no longer used.Available node metadata can be accessed by

`graph.vs[attr_name]`

where for`attr_name`

is one of`"pullback_set_label"`

,`"partial_cluster_label"`

, or`"node_elements"`

.Sizes of intersections are automatically stored as edge weights, accessible by

`graph.es["weight"]`

.A

`"store_intersections"`

keyword argument has been added to`Nerve`

and`make_mapper_pipeline`

to allow to store the indices defining node intersections as edge attributes, accessible via`graph.es["edge_elements"]`

.A

`contract_nodes`

optional parameter has been added to both`Nerve`

and`make_mapper_pipeline`

; nodes which are subsets of other nodes are thrown away from the graph when this parameter is set to`True`

.A

`graph_`

attribute is stored during`Nerve.fit`

.

Two of the

`Nerve`

parameters (`min_intersection`

and the new`contract_nodes`

) are now available in the widgets generated by`plot_interactive_mapper_graph`

, and the layout of these widgets has been improved (#456).`ParallelClustering`

and`Nerve`

have been exposed in the documentation and in`gtda.mapper`

’s`__init__`

(#447).

#### Plotting¶

A

`plot_params`

kwarg is available in plotting functions and methods throughout to allow user customisability of output figures. The user must pass a dictionary with keys`"layout"`

and/or`"trace"`

(or`"traces"`

in some cases) (#441).Several plots produced by

`plot`

class methods now have default titles (#453).Infinite deaths are now plotted by

`plot_diagrams`

(#461).Possible multiplicities of persistence pairs in persistence diagram plots are now indicated in the hovertext (#454).

`plot_heatmap`

now accepts boolean array input (#444).

#### New tutorials and examples¶

The following new tutorials have been added:

Topology of time series, which explains the theory of the Takens time-delay embedding and its use with persistent homology, demonstrates the new

`API`

of several components in`gtda.time_series`

, and shows how to construct time series*classification*pipelines in`giotto-tda`

by partially reproducing arXiv:1910:08245.Topology in time series forecasting, which explains how to set up time series

*forecasting*pipelines in`giotto-tda`

via`TransformerResamplerMixin``s and the ``giotto-tda`

`Pipeline`

class.Topological feature extraction from graphs, which explains what the features extracted from directed or undirected graphs by

`VietorisRipsPersistence`

,`SparseRipsPersistence`

and`FlagserPersistence`

are.Classifying handwritten digits, which presents a fully-fledged machine learning pipeline in which cubical persistent homology is applied to the classification of handwritten images from he MNIST dataset, partially reproducing arXiv:1910.08345.

#### Utils¶

### Bug Fixes¶

A bug has been fixed which could lead to features with negative lifetime in persistent homology transformers when

`infinity_values`

was set too low (#339).By relying on

`scipy`

’s`shortest_path`

instead of`scikit-learn`

’s`graph_shortest_path`

, some errors in computing`GraphGeodesicDistance`

(e.g. when som edges are zero) have been fixed (#422).A bug in the handling of COO matrices by the

`ripser`

interface has been fixed (#465).A bug which led to the incorrect handling of the

`homology_dimensions`

parameter in`Filtering`

has been fixed (#439).An issue with the use of

`joblib.Parallel`

, which led to errors when attempting to run`HeatKernel`

,`PersistenceImage`

, and the corresponding amplitudes and distances on large datasets, has been fixed (#428 and #481).A bug leading to plots of persistence diagrams not showing points with negative births or deaths has been fixed, as has a bug with the computation of the range to be shown in the plot (#437).

A bug in the handling of persistence pairs with negative death values by

`Filtering`

has been fixed (#436).A bug in the handling of

`homology_dimension_ix`

(now renamed to`homology_dimension_idx`

) in the`plot`

methods of`HeatKernel`

and`PersistenceImage`

has been fixed (#452).A bug in the labelling of axes in

`HeatKernel`

and`PersistenceImage`

plots has ben fixed (#453 and #454).`PersistenceLandscape`

plots now show all homology dimensions, instead of just the first (#454).A bug in the computation of amplitudes and pairwise distances based on persistence images has been fixed (#454).

`Silhouette`

now does not create NaNs when a subdiagram is trivial (#454).`CubicalPersistence`

now does not create pairs with negative persistence when`infinity_values`

is set too low (#467).Warnings are no longer thrown by

`KNeighborsGraph`

when`metric="precomputed"`

(#506).A bug in

`Labeller.resample`

affecting cases in which`n_steps_future >= size - 1`

, has been fixed (#460).A bug in

`validate_params`

, affecting the case of tuples of allowed types, has been fixed (#502).

### Backwards-Incompatible Changes¶

The minimum required versions from most of the dependencies have been bumped. The updated dependencies are

`numpy >= 1.19.1`

,`scipy >= 1.5.0`

,`joblib >= 0.16.0`

,`scikit-learn >= 0.23.1`

,`python-igraph >= 0.8.2`

,`plotly >= 4.8.2`

, and`pyflagser >= 0.4.1`

(#457).`GraphGeodesicDistance`

now returns either lists or 3D dense ndarrays for compatibility with the homology transformers - By relying on`scipy`

’s`shortest_path`

instead of`scikit-learn`

’s`graph_shortest_path`

, some errors in computing`GraphGeodesicDistance`

(e.g. when som edges are zero) have been fixed (#422).The output of

`PairwiseDistance`

has been transposed to match`scikit-learn`

convention`(n_samples_transform, n_samples_fit)`

(#420).`plot`

class methods now return figures instead of showing them (#441).Mapper node and edge attributes are no longer stored as graph-level dictionaries,

`"node_id"`

is no longer an available node attribute, and the attributes`nodes_`

and`edges_`

previously stored by`Nerve.fit`

have been removed in favour of a`graph_`

attribute (#447).The

`homology_dimension_ix`

parameter available in some transformers in`gtda.diagrams`

has been renamed to`homology_dimensions_idx`

(#452).The base of the logarithm used by

`PersistenceEntropy`

is now 2 instead of*e*, and NaN values are replaced with -1 instead of 0 by default (#450 and #474).The outputs of

`PersistenceImage`

,`HeatKernel`

and of the pairwise distances and amplitudes based on them is now different due to the improvements described above.Weights are no longer stored in the

`effective_metric_params_`

attribute of`PairwiseDistance`

,`Amplitude`

and`Scaler`

objects when the metric is persistence-image–based; only the weight function is (#454).The

`homology_dimensions_`

attributes of several transformers have been converted from lists to tuples. When possible, homology dimensions stored as parts of attributes are now presented as ints (#454).`gaussian_filter`

(used to make heat– and persistence-image–based representations/pairwise distances/amplitudes) is now called with`mode="constant"`

instead of`"reflect"`

(#454).The default value of

`order`

in`Amplitude`

has been changed from`2.`

to`None`

, giving vector instead of scalar features (#454).The meaning of the default

`None`

for`weight_function`

in`PersistenceImage`

(and in`Amplitude`

and`PairwiseDistance`

when`metric="persistence_image"`

) has been changed from the identity function to the function returning a vector of ones (#454).Due to the updates in the GUDHI components, some of the bindings and Python interfaces to the GUDHI C++ components in

`gtda.externals`

have changed (#468).`Labeller.transform`

now returns a 1D array instead of a column array (#475).`PersistenceLandscape`

now returns 3D arrays instead of 4D ones, for compatibility with the new`curves`

subpackage (#480).By default,

`CubicalPersistence`

now removes one infinite bar in H0 (#467, and see above).The former

`width`

parameter in`SlidingWindow`

and`Labeller`

has been replaced with a more intuitive`size`

parameter. The relation between the two is:`size = width + 1`

(#460).`clusterer`

is now a required parameter in`ParallelClustering`

(#508).The

`max_fraction`

parameter in`FirstSimpleGap`

and`FirstHistogramGap`

now indicates the floor of`max_fraction * n_samples`

; its default value has been changed from`None`

to`1`

(#412).

### Thanks to our Contributors¶

This release contains contributions from many people:

Umberto Lupo, Guillaume Tauzin, Julian Burella Pérez, Wojciech Reise, Lewis Tunstall, Nick Sale, and Anibal Medina-Mardones.

We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.