Topological feature extraction using VietorisRipsPersistence
and PersistenceEntropy
¶
In this notebook, we showcase the ease of use of one of the core
components of giotto-tda
: VietorisRipsPersistence
, along with
vectorization methods. We first list steps in a typical,
topological-feature extraction routine and then show to encapsulate them
with a standard scikit-learn
–like pipeline.
If you are looking at a static version of this notebook and would like to run its contents, head over to GitHub and download the source.
License: AGPLv3
Generate data¶
Let’s begin by generating 3D point clouds of spheres and tori, along
with a label of 0 (1) for each sphere (torus). We also add noise to each
point cloud, whose effect is to displace the points sampling the
surfaces by a random amount in a random direction. Note: You will
need the auxiliary module
generate_datasets.py
to run this cell. You can change the second argument of
generate_point_clouds
to obtain a finer or coarser sampling, or tune
the level of noise via the third.
from data.generate_datasets import make_point_clouds
n_samples_per_class = 10
point_clouds, labels = make_point_clouds(n_samples_per_class, 10, 0.1)
point_clouds.shape
print(f"There are {point_clouds.shape[0]} point clouds in {point_clouds.shape[2]} dimensions, "
f"each with {point_clouds.shape[1]} points.")
There are 30 point clouds in 3 dimensions, each with 100 points.
Calculate persistent homology¶
Instantiate a VietorisRipsPersistence
transformer and calculate
so-called persistence diagrams for this collection of point clouds.
from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence(homology_dimensions=[0, 1, 2]) # Parameter explained in the text
diagrams = VR.fit_transform(point_clouds)
diagrams.shape
(30, 171, 3)
Important note: VietorisRipsPersistence
, and all other
“persistent homology” transformers in gtda.homology
, expect input in
the form of a 3D array or, in some cases, a list of 2D arrays. For each
entry in the input (here, for each point cloud in point_clouds
) they
compute a topological summary which is also a 2D array, and then stack
all these summaries into a single output 3D array. So, in our case,
diagrams[i]
represents the topology of point_clouds[i]
.
diagrams[i]
is interpreted as follows: - Each row is a triplet
describing a single topological feature found in point_clouds[i]
. -
The first and second entries (respectively) in the triplet denote the
values of the “filtration parameter” at which the feature appears or
disappears respectively. They are referred to as the “birth” and “death”
values of the feature (respectively). The meaning of “filtration
parameter” depends on the specific transformer, but in the case of
VietorisRipsPersistence
on point clouds it has the interpretation of
a length scale. - A topological feature can be a connected component, 1D
hole/loop, 2D cavity, or more generally \(d\)-dimensional “void”
which exists in the data at scales between its birth and death values.
The integer \(d\) is the homology dimension (or degree) of the
feature and is stored as the third entry in the triplet. In this
example, the shapes should have 2D cavities so we explicitly tell
VietorisRipsPersistence
to look for these by using the
homology_dimensions
parameter!
If we make one scatter plot per available homology dimension, and plot
births and deaths as x- and y-coordinates of points in 2D, we end up
with a 2D representation of diagrams[i]
, and the reason why it is
called a persistence diagram:
from gtda.plotting import plot_diagram
i = 0
plot_diagram(diagrams[i])
The notebook Plotting in
giotto-tda
delves deeper into the plotting functions and class methods available in
giotto-tda
.
Extract features¶
Instantiate a PersistenceEntropy
transformer and extract scalar
features from the persistence diagrams.
from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)
Use the new features in a standard classifier¶
Leverage the compatibility with scikit-learn
to perform a train-test
split and score the features.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(features, labels)
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
1.0
Encapsulate the steps above in a pipeline¶
Define an end-to-end pipeline by chaining transformers from
giotto-tda
withscikit-learn
onesTrain-test split the input point cloud data and labels.
Fir the pipeline on the training data.
Score the fitted pipeline on the test data.
from sklearn.pipeline import make_pipeline
steps = [VietorisRipsPersistence(homology_dimensions=[0, 1, 2]),
PersistenceEntropy(),
RandomForestClassifier()]
pipeline = make_pipeline(*steps)
pcs_train, pcs_valid, labels_train, labels_valid = train_test_split(point_clouds, labels)
pipeline.fit(pcs_train, labels_train)
pipeline.score(pcs_valid, labels_valid)
1.0