Getting started with Mapper¶
In this notebook we explore a few of the core features included in
giotto-tda’s implementation of the Mapper
algorithm.
If you are looking at a static version of this notebook and would like to run its contents, head over to GitHub and download the source.
Useful references¶
- An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists 
- An Introduction to Topological Data Analysis for Physicists: From LGM to FRBs 
License: AGPLv3
Import libraries¶
# Data wrangling
import numpy as np
import pandas as pd  # Not a requirement of giotto-tda, but is compatible with the gtda.mapper module
# Data viz
from gtda.plotting import plot_point_cloud
# TDA magic
from gtda.mapper import (
    CubicalCover,
    make_mapper_pipeline,
    Projection,
    plot_static_mapper_graph,
    plot_interactive_mapper_graph
)
# ML tools
from sklearn import datasets
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
Generate and visualise data¶
As a simple example, let’s generate a two-dimensional point cloud of two concentric circles. The goal will be to examine how Mapper can be used to generate a topological graph that captures the salient features of the data.
data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)
plot_point_cloud(data)
Configure the Mapper pipeline¶
Given a dataset \({\cal D}\) of points \(x \in \mathbb{R}^n\), the basic steps behind Mapper are as follows:
- Map \({\cal D}\) to a lower-dimensional space using a filter function \(f: \mathbb{R}^n \to \mathbb{R}^m\). Common choices for the filter function include projection onto one or more axes via PCA or density-based methods. In - giotto-tda, you can import a variety of filter functions as follows:
from gtda.mapper.filter import FilterFunctionName
- Construct a cover of the filter values \({\cal U} = (U_i)_{i\in I}\), typically in the form of a set of overlapping intervals which have constant length. As with the filter, a choice of cover can be imported as follows: 
from gtda.mapper.cover import CoverName
- For each interval \(U_i \in {\cal U}\) cluster the points in the preimage \(f^{-1}(U_i)\) into sets \(C_{i,1}, \ldots , C_{i,k_i}\). The choice of clustering algorithm can be any of - scikit-learn’s clustering methods or an implementation of agglomerative clustering in- giotto-tda:
# scikit-learn method
from sklearn.cluster import ClusteringAlgorithm
# giotto-tda method
from gtda.mapper.cluster import FirstSimpleGap
- Construct the topological graph whose vertices are the cluster sets \((C_{i,j})_{i\in I, j \in \{1,\ldots,k_i\}}\) and an edge exists between two nodes if they share points in common: \(C_{i,j} \cap C_{k,l} \neq \emptyset\). This step is handled automatically by - giotto-tda.
These four steps are implemented in the MapperPipeline object that
mimics the Pipeline class from scikit-learn. We provide a
convenience function make_mapper_pipeline that allows you to pass
the choice of filter function, cover, and clustering algorithm as
arguments. For example, to project our data onto the \(x\)- and
\(y\)-axes, we could setup the pipeline as follows:
# Define filter function – can be any scikit-learn transformer
filter_func = Projection(columns=[0, 1])
# Define cover
cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
# Choose clustering algorithm – default is DBSCAN
clusterer = DBSCAN()
# Configure parallelism of clustering step
n_jobs = 1
# Initialise pipeline
pipe = make_mapper_pipeline(
    filter_func=filter_func,
    cover=cover,
    clusterer=clusterer,
    verbose=False,
    n_jobs=n_jobs,
)
Visualise the Mapper graph¶
With the Mapper pipeline at hand, it is now a simple matter to visualise
it. To warm up, let’s examine the graph in two-dimensions using the
default arguments of giotto-tda’s plotting function:
fig = plot_static_mapper_graph(pipe, data)
fig.show(config={'scrollZoom': True})
From the figure we can see that we have captured the salient topological features of our underlying data, namely two holes!
Configure the coloring of the Mapper graph¶
By default, the nodes of the Mapper graph are colored by the mean value
of the points that belong to a given node. However, in this example it
is more instructive to colour by the \(x\)- and \(y\)-axes. This
can be achieved by toggling the color_by_columns_dropdown, which
calculates the coloring for each column in the input data array. At the
same time, let’s configure the choice of colorscale:
plotly_params = {"node_trace": {"marker_colorscale": "Blues"}}
fig = plot_static_mapper_graph(
    pipe, data, color_by_columns_dropdown=True, plotly_params=plotly_params
)
fig.show(config={'scrollZoom': True})
In the dropdown menu, the entry color_variable refers to a
user-defined quantity to color by – by default it is the average value
of the points in each node. In general, one can configure this quantity
to be an array, a scikit-learn transformer, or a list of indices to
select from the data. For example, coloring by a PCA component can be
implemented as follows:
# Initialise estimator to color graph by
pca = PCA(n_components=1).fit(data)
fig = plot_static_mapper_graph(
    pipe, data, color_by_columns_dropdown=True, color_variable=pca
)
fig.show(config={'scrollZoom': True})
Pass a pandas DataFrame as input¶
It is also possible to feed plot_static_mapper_graph a pandas
DataFrame:
data_df = pd.DataFrame(data, columns=["x", "y"])
data_df.head()
| x | y | |
|---|---|---|
| 0 | -0.711917 | -0.546609 | 
| 1 | 0.306951 | -0.007028 | 
| 2 | 0.288193 | 0.123284 | 
| 3 | -0.892223 | 0.502352 | 
| 4 | -0.143615 | 0.938935 | 
Before plotting we need to update the Mapper pipeline to know about the
projection onto the column names. This can be achieved using the
set_params method as follows:
pipe.set_params(filter_func=Projection(columns=["x", "y"]));
MapperPipeline(steps=[('pullback_cover',
                       ListFeatureUnion(transformer_list=[('clustering_preprocessing',
                                                           FunctionTransformer(validate=True)),
                                                          ('map_and_cover',
                                                           Pipeline(steps=[('scaler',
                                                                            FunctionTransformer()),
                                                                           ('filter_func',
                                                                            Projection(columns=['x',
                                                                                                'y'])),
                                                                           ('cover',
                                                                            CubicalCover(overlap_frac=0.3))]))])),
                      ('clustering',
                       ParallelClustering(clusterer=DBSCAN(), n_jobs=1)),
                      ('nerve', Nerve())])
fig = plot_static_mapper_graph(pipe, data_df, color_by_columns_dropdown=True)
fig.show(config={'scrollZoom': True})
Change the layout algorithm¶
By default, plot_static_mapper_graph uses the Kamada–Kawai algorithm
for the layout; however any of the layout algorithms defined in
python-igraph are supported (see
here for a
list of possible layouts). For example, we can switch to the
Fruchterman–Reingold layout as follows:
# Reset back to numpy projection
pipe.set_params(filter_func=Projection(columns=[0, 1]));
MapperPipeline(steps=[('pullback_cover',
                       ListFeatureUnion(transformer_list=[('clustering_preprocessing',
                                                           FunctionTransformer(validate=True)),
                                                          ('map_and_cover',
                                                           Pipeline(steps=[('scaler',
                                                                            FunctionTransformer()),
                                                                           ('filter_func',
                                                                            Projection(columns=[0,
                                                                                                1])),
                                                                           ('cover',
                                                                            CubicalCover(overlap_frac=0.3))]))])),
                      ('clustering',
                       ParallelClustering(clusterer=DBSCAN(), n_jobs=1)),
                      ('nerve', Nerve())])
fig = plot_static_mapper_graph(
    pipe, data, layout="fruchterman_reingold", color_by_columns_dropdown=True
)
fig.show(config={'scrollZoom': True})
Change the layout dimension¶
It is also possible to visualise the Mapper graph in 3 dimensions by
configuring the layout_dim argument:
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, color_by_columns_dropdown=True)
fig.show(config={'scrollZoom': True})
Change the node size scale¶
In general, node sizes are proportional to the number of dataset
elements contained in the nodes. Sometimes, however, the default scale
leads to graphs which are difficult to decipher, due to e.g. excessively
small nodes. The node_scale parameter can be used to configure this
scale.
node_scale = 30
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, node_scale=node_scale)
fig.show(config={'scrollZoom': True})
Run the Mapper pipeline¶
Behind the scenes of plot_static_mapper_graph is a
MapperPipeline object pipe that can be used like a typical
scikit-learn estimator. For example, to extract the underlying graph
data structure we can do the following:
graph = pipe.fit_transform(data)
The resulting graph is a python-igraph object which stores node metadata in the form of attributes. We can access this data as follows:
graph.vs.attributes()
['pullback_set_label', 'partial_cluster_label', 'node_elements']
Here 'pullback_set_label' and 'partial_cluster_label' refer to
the interval and cluster sets described above. 'node_elements'
refers to the indices of our original data that belong to each node. For
example, to find which points belong to the first node of the graph we
can access the desired data as follows:
node_id = 0
node_elements = graph.vs["node_elements"]
print(f"""
Node ID: {node_id}
Node elements: {node_elements[node_id]}
Data points: {data[node_elements[node_id]]}
""")
Node ID: 0
Node elements: [1675 1944 2425 4464]
Data points: [[-0.85719291 -0.74172245]
 [-0.88187702 -0.69106254]
 [-0.8881879  -0.63188784]
 [-0.84665374 -0.73453227]]
Creating custom filter functions¶
In some cases, the list of filter functions provided in
gtda.mapper.filter.py or scikit-learn may not be sufficient for
the task at hand. In such cases, one can pass any callable to the
pipeline that acts row-wise on the input data. For example, we can
project by taking the sum of the \((x,y)\) coordinates as follows:
filter_func = np.sum
pipe = make_mapper_pipeline(
    filter_func=filter_func,
    cover=cover,
    clusterer=clusterer,
    verbose=True,
    n_jobs=n_jobs,
)
fig = plot_static_mapper_graph(pipe, data)
fig.show(config={'scrollZoom': True})
[Pipeline] ............ (step 1 of 3) Processing scaler, total=   0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total=   0.0s
[Pipeline] ............. (step 3 of 3) Processing cover, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total=   0.0s
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   0.1s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.0s
Visualise the 2D Mapper graph interactively (Live Jupyter session needed)¶
In general, building useful Mapper graphs requires some iteration
through the various parameters in the cover and clustering algorithm. To
simplify that process, giotto-tda provides an interactive figure
that can be configured in real time.
If invalid parameters are selected, the Show logs checkbox can be used to see what went wrong.
To see the interactive output, please download the notebook from GitHub and execute it locally.
pipe = make_mapper_pipeline()
# Generate interactive widget
plot_interactive_mapper_graph(pipe, data, color_by_columns_dropdown=True)
VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Cover parameters</b>'), Text(value='uniform', cont…