Getting started with Mapper¶
In this notebook we explore a few of the core features included in
giotto-tda
’s implementation of the Mapper
algorithm.
If you are looking at a static version of this notebook and would like to run its contents, head over to GitHub and download the source.
Useful references¶
An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists
An Introduction to Topological Data Analysis for Physicists: From LGM to FRBs
License: AGPLv3
Import libraries¶
# Data wrangling
import numpy as np
import pandas as pd # Not a requirement of giotto-tda, but is compatible with the gtda.mapper module
# Data viz
from gtda.plotting import plot_point_cloud
# TDA magic
from gtda.mapper import (
CubicalCover,
make_mapper_pipeline,
Projection,
plot_static_mapper_graph,
plot_interactive_mapper_graph,
MapperInteractivePlotter
)
# ML tools
from sklearn import datasets
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
Generate and visualise data¶
As a simple example, let’s generate a two-dimensional point cloud of two concentric circles. The goal will be to examine how Mapper can be used to generate a topological graph that captures the salient features of the data.
data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)
plot_point_cloud(data)
Configure the Mapper pipeline¶
Given a dataset \({\cal D}\) of points \(x \in \mathbb{R}^n\), the basic steps behind Mapper are as follows:
Map \({\cal D}\) to a lower-dimensional space using a filter function \(f: \mathbb{R}^n \to \mathbb{R}^m\). Common choices for the filter function include projection onto one or more axes via PCA or density-based methods. In
giotto-tda
, you can import a variety of filter functions as follows:
from gtda.mapper.filter import FilterFunctionName
Construct a cover of the filter values \({\cal U} = (U_i)_{i\in I}\), typically in the form of a set of overlapping intervals which have constant length. As with the filter, a choice of cover can be imported as follows:
from gtda.mapper.cover import CoverName
For each interval \(U_i \in {\cal U}\) cluster the points in the preimage \(f^{-1}(U_i)\) into sets \(C_{i,1}, \ldots , C_{i,k_i}\). The choice of clustering algorithm can be any of
scikit-learn
’s clustering methods or an implementation of agglomerative clustering ingiotto-tda
:
# scikit-learn method
from sklearn.cluster import ClusteringAlgorithm
# giotto-tda method
from gtda.mapper.cluster import FirstSimpleGap
Construct the topological graph whose vertices are the cluster sets \((C_{i,j})_{i\in I, j \in \{1,\ldots,k_i\}}\) and an edge exists between two nodes if they share points in common: \(C_{i,j} \cap C_{k,l} \neq \emptyset\). This step is handled automatically by
giotto-tda
.
These four steps are implemented in the MapperPipeline
object that
mimics the Pipeline
class from scikit-learn
. We provide a
convenience function make_mapper_pipeline
that allows you to pass
the choice of filter function, cover, and clustering algorithm as
arguments. For example, to project our data onto the \(x\)- and
\(y\)-axes, we could setup the pipeline as follows:
# Define filter function – can be any scikit-learn transformer
filter_func = Projection(columns=[0, 1])
# Define cover
cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
# Choose clustering algorithm – default is DBSCAN
clusterer = DBSCAN()
# Configure parallelism of clustering step
n_jobs = 1
# Initialise pipeline
pipe = make_mapper_pipeline(
filter_func=filter_func,
cover=cover,
clusterer=clusterer,
verbose=False,
n_jobs=n_jobs,
)
Visualise the Mapper graph¶
With the Mapper pipeline at hand, it is now a simple matter to visualise
it. To warm up, let’s examine the graph in two-dimensions using the
default arguments of giotto-tda
’s plotting function:
fig = plot_static_mapper_graph(pipe, data)
fig.show(config={'scrollZoom': True})
By default, nodes are coloured according to the average row index of the data points they represent.
From the figure we can see that we have captured the salient topological features of our underlying data, namely two holes!
Configure the colouring of the Mapper graph¶
In this example, it is more instructive to colour by the average values
of the \(x\)- and \(y\)-coordinates. This can be achieved by
passing the input data again as the keyword argument color_data
. In
general, any numpy
array or pandas
dataframe explicitly passed
as color_data
will be used to calculate one colouring per column
present. A dropdown menu is automatically created if color_data
has
more than one column, to easily switch between column-based colourings.
At the same time, let’s configure the choice of colorscale:
plotly_params = {"node_trace": {"marker_colorscale": "Blues"}}
fig = plot_static_mapper_graph(
pipe, data, color_data=data, plotly_params=plotly_params
)
fig.show(config={'scrollZoom': True})
Even finer control over the colouring can be achieved by making use of
the additional keyword arguments color_features
and
node_color_statistic
. The former can be a scikit-learn
transformer or a list of indices or column names to select from the
data. For example, coloring by a PCA component can be neatly implemented
as follows:
# Initialise estimator to color graph by
pca = PCA(n_components=1)
fig = plot_static_mapper_graph(
pipe, data, color_data=data, color_features=pca
)
fig.show(config={'scrollZoom': True})
node_color_statistic
refers to the function used to extract single
colour values for each node, starting from the values of
color_features
at each data point. The default, as we have seen, is
np.mean
. But any other callable is acceptable which sends column
vectors to scalars:
fig = plot_static_mapper_graph(
pipe, data, color_data=data, color_features=pca, node_color_statistic=lambda x: np.mean(x) / 2
)
fig.show(config={'scrollZoom': True})
If you prefer to just input custom node colours directly, you can do so
by passing a numpy
array or pandas
dataframe of the correct
length as node_color_statistic
. For example (see “Run the Mapper
pipeline” below), the Mapper nodes for this particular Mapper pipeline
would be as follows:
graph = pipe.fit_transform(data)
node_elements = graph.vs["node_elements"]
print(f"There are {len(node_elements)} nodes.\nThe first node consists of row indices {node_elements[0]}.")
There are 90 nodes.
The first node consists of row indices [ 0 85 100 167 253 264 276 372 385 390 427 462 504 508
542 616 626 627 711 762 801 811 825 856 911 921 945 956
963 1025 1029 1038 1058 1122 1134 1211 1280 1315 1347 1352 1357 1383
1390 1398 1406 1450 1507 1585 1616 1650 1725 1738 1739 1744 1774 1786
1817 1864 1899 1902 1932 1944 1964 1997 2009 2081 2231 2234 2270 2274
2425 2450 2500 2512 2615 2632 2716 2756 2769 2773 2818 2830 2843 2876
2893 2942 2956 2957 3010 3100 3101 3118 3150 3229 3321 3352 3383 3429
3451 3455 3527 3531 3567 3589 3600 3666 3697 3708 3726 3738 3748 3761
3779 3823 3947 3988 4048 4223 4282 4501 4506 4561 4587 4628 4749 4865
4887 4912 4938 4945 4952 4967].
Let us try this:
fig = plot_static_mapper_graph(
pipe, data, node_color_statistic=np.arange(len(node_elements))
)
fig.show(config={'scrollZoom': True})
Pass a pandas DataFrame as input¶
It is also possible to feed plot_static_mapper_graph
a pandas
DataFrame:
df = pd.DataFrame(data, columns=["x", "y"])
df.head()
x | y | |
---|---|---|
0 | -0.711917 | -0.546609 |
1 | 0.306951 | -0.007028 |
2 | 0.288193 | 0.123284 |
3 | -0.892223 | 0.502352 |
4 | -0.143615 | 0.938935 |
Before plotting we need to update the Mapper pipeline to know about the
projection onto the column names. This can be achieved using the
set_params
method as follows:
pipe.set_params(filter_func=Projection(columns=["x", "y"]));
MapperPipeline(steps=[('pullback_cover',
ListFeatureUnion(transformer_list=[('clustering_preprocessing',
FunctionTransformer(validate=True)),
('map_and_cover',
Pipeline(steps=[('scaler',
FunctionTransformer()),
('filter_func',
Projection(columns=['x',
'y'])),
('cover',
CubicalCover(overlap_frac=0.3))]))])),
('clustering',
ParallelClustering(clusterer=DBSCAN(), n_jobs=1)),
('nerve', Nerve())])
fig = plot_static_mapper_graph(pipe, df, color_data=df)
fig.show(config={'scrollZoom': True})
Example: colour by proportions of categorical variables¶
Often one has a dataset of observations, each belonging to a category
(e.g. a country or region name). It can be very useful to visualize the
distributions of each category in the nodes of the Mapper graph. As a
trivial example, let us add a categorical column to our dataframe, with
value equal to "A"
for the outer circle, and "B"
for the inner
one:
df["Circle"] = df["x"] ** 2 + df["y"] ** 2 < 0.25
df["Circle"] = df["Circle"].replace([False, True], ["A", "B"])
To visualize the proportions of data points in each Mapper node
belonging to either circle, we can create a dataframe of one-hot
encodings of the categorical variable "Circle"
, and pass it to
plot_static_mapper_graph
as color_data
:
color_data = pd.get_dummies(df["Circle"], prefix="Circle")
fig = plot_static_mapper_graph(pipe, df[["x", "y"]], color_data=color_data)
fig.show(config={'scrollZoom': True})
The dropdown menu allows us to quickly switch colourings according to each category, without needing to recompute the underlying graph.
Change the layout algorithm¶
By default, plot_static_mapper_graph
uses the Kamada–Kawai algorithm
for the layout; however any of the layout algorithms defined in
python-igraph are supported (see
here for a
list of possible layouts). For example, we can switch to the
Fruchterman–Reingold layout as follows:
# Reset back to numpy projection
pipe.set_params(filter_func=Projection(columns=[0, 1]));
MapperPipeline(steps=[('pullback_cover',
ListFeatureUnion(transformer_list=[('clustering_preprocessing',
FunctionTransformer(validate=True)),
('map_and_cover',
Pipeline(steps=[('scaler',
FunctionTransformer()),
('filter_func',
Projection(columns=[0,
1])),
('cover',
CubicalCover(overlap_frac=0.3))]))])),
('clustering',
ParallelClustering(clusterer=DBSCAN(), n_jobs=1)),
('nerve', Nerve())])
fig = plot_static_mapper_graph(
pipe, data, layout="fruchterman_reingold", color_data=data
)
fig.show(config={'scrollZoom': True})
Change the layout dimension¶
It is also possible to visualise the Mapper graph in 3 dimensions by
configuring the layout_dim
argument:
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, color_data=data)
fig.show(config={'scrollZoom': True})
Change the node size scale¶
In general, node sizes are proportional to the number of dataset
elements contained in the nodes. Sometimes, however, the default scale
leads to graphs which are difficult to decipher, due to e.g. excessively
small nodes. The node_scale
parameter can be used to configure this
scale.
node_scale = 30
fig = plot_static_mapper_graph(pipe, data, layout_dim=3, node_scale=node_scale)
fig.show(config={'scrollZoom': True})
Run the Mapper pipeline¶
Behind the scenes of plot_static_mapper_graph
is a
MapperPipeline
object pipe
that can be used like a typical
scikit-learn
estimator. For example, to extract the underlying graph
data structure we can do the following:
graph = pipe.fit_transform(data)
The resulting graph is a python-igraph object which stores node metadata in the form of attributes. We can access this data as follows:
graph.vs.attributes()
['pullback_set_label', 'partial_cluster_label', 'node_elements']
Here 'pullback_set_label'
and 'partial_cluster_label'
refer to
the interval and cluster sets described above. 'node_elements'
refers to the indices of our original data that belong to each node. For
example, to find which points belong to the first node of the graph we
can access the desired data as follows:
node_id = 0
node_elements = graph.vs["node_elements"]
print(f"""
Node ID: {node_id}
Node elements: {node_elements[node_id]}
Data points: {data[node_elements[node_id]]}
""")
Node ID: 0
Node elements: [ 0 85 100 167 253 264 276 372 385 390 427 462 504 508
542 616 626 627 711 762 801 811 825 856 911 921 945 956
963 1025 1029 1038 1058 1122 1134 1211 1280 1315 1347 1352 1357 1383
1390 1398 1406 1450 1507 1585 1616 1650 1725 1738 1739 1744 1774 1786
1817 1864 1899 1902 1932 1944 1964 1997 2009 2081 2231 2234 2270 2274
2425 2450 2500 2512 2615 2632 2716 2756 2769 2773 2818 2830 2843 2876
2893 2942 2956 2957 3010 3100 3101 3118 3150 3229 3321 3352 3383 3429
3451 3455 3527 3531 3567 3589 3600 3666 3697 3708 3726 3738 3748 3761
3779 3823 3947 3988 4048 4223 4282 4501 4506 4561 4587 4628 4749 4865
4887 4912 4938 4945 4952 4967]
Data points: [[-0.7119167 -0.54660896]
[-0.91976898 -0.43025704]
[-0.73571693 -0.62569565]
[-0.73630218 -0.66957306]
[-0.79134262 -0.57047771]
[-0.84548577 -0.61571079]
[-0.85448295 -0.60924645]
[-0.78585898 -0.67334614]
[-0.75059868 -0.72410451]
[-0.63086658 -0.66158518]
[-0.8793931 -0.45210713]
[-0.89681717 -0.55974007]
[-0.91495418 -0.45362697]
[-0.64268627 -0.62058323]
[-0.71492575 -0.6276173 ]
[-0.87427643 -0.43744983]
[-0.74185898 -0.51963635]
[-0.9155821 -0.46166875]
[-0.82050944 -0.60449751]
[-0.62808579 -0.66000767]
[-0.79501518 -0.53502405]
[-0.80007606 -0.54826085]
[-0.82127487 -0.57208059]
[-0.82126992 -0.59692436]
[-0.83383599 -0.55533627]
[-0.77936643 -0.66418784]
[-0.68137772 -0.67191541]
[-0.82137261 -0.59406356]
[-0.64133206 -0.63149164]
[-0.8123391 -0.63014858]
[-0.68707588 -0.71031759]
[-0.84179048 -0.53209953]
[-0.6977265 -0.6259798 ]
[-0.83586226 -0.48715506]
[-0.75300801 -0.47301243]
[-0.84167859 -0.43589552]
[-0.88751953 -0.46751497]
[-0.87069424 -0.54158073]
[-0.85702681 -0.626 ]
[-0.79042123 -0.63569662]
[-0.87609199 -0.46483997]
[-0.72253729 -0.69784363]
[-0.76463663 -0.55821976]
[-0.74971665 -0.71296053]
[-0.82529518 -0.51891302]
[-0.81612374 -0.65115425]
[-0.80232395 -0.58526501]
[-0.73593052 -0.53508381]
[-0.86462132 -0.56373356]
[-0.86929375 -0.55961294]
[-0.78298321 -0.57210066]
[-0.7872312 -0.56868301]
[-0.73258265 -0.71735012]
[-0.8254331 -0.53157209]
[-0.85181121 -0.52891324]
[-0.87370717 -0.5516612 ]
[-0.71240599 -0.66143786]
[-0.87734726 -0.46852084]
[-0.6527069 -0.69089512]
[-0.80780021 -0.69128276]
[-0.8659319 -0.48910451]
[-0.88187702 -0.69106254]
[-0.74491876 -0.62820799]
[-0.86400856 -0.49989724]
[-0.85938208 -0.52788963]
[-0.68336712 -0.70497211]
[-0.75952652 -0.53473902]
[-0.78422787 -0.60809146]
[-0.92048135 -0.41332241]
[-0.83284535 -0.57647591]
[-0.8881879 -0.63188784]
[-0.86636643 -0.48300648]
[-0.76247244 -0.60433738]
[-0.71068803 -0.70678112]
[-0.71200444 -0.61667848]
[-0.71638204 -0.69011083]
[-0.79823316 -0.48635254]
[-0.77289574 -0.44032077]
[-0.91803077 -0.49548077]
[-0.72544168 -0.57008828]
[-0.67945734 -0.67580757]
[-0.72848698 -0.65290641]
[-0.89024962 -0.52350426]
[-0.85152872 -0.53610581]
[-0.87454306 -0.48924153]
[-0.85185385 -0.50291956]
[-0.86305593 -0.49496938]
[-0.89123057 -0.59910578]
[-0.76766698 -0.58284467]
[-0.8275488 -0.47319 ]
[-0.88838038 -0.54427538]
[-0.72674348 -0.71441004]
[-0.73130649 -0.59098862]
[-0.90095827 -0.44271079]
[-0.72385179 -0.71359489]
[-0.78660712 -0.61703413]
[-0.8871533 -0.52372038]
[-0.82032139 -0.45262681]
[-0.70972921 -0.64363582]
[-0.81668143 -0.57801295]
[-0.83557157 -0.41711814]
[-0.79842186 -0.67499645]
[-0.75584145 -0.69893297]
[-0.61715239 -0.71845733]
[-0.82439243 -0.4373306 ]
[-0.9212387 -0.54977963]
[-0.75484995 -0.47032845]
[-0.78994636 -0.61715555]
[-0.78734765 -0.67127232]
[-0.77472141 -0.64491004]
[-0.73740532 -0.47463457]
[-0.82270782 -0.4652456 ]
[-0.73300033 -0.58335525]
[-0.90371319 -0.42713173]
[-0.77921032 -0.68092808]
[-0.81474095 -0.53970325]
[-0.87536476 -0.58972263]
[-0.62898173 -0.71793583]
[-0.76162876 -0.63240972]
[-0.75568871 -0.65138702]
[-0.70943504 -0.5606257 ]
[-0.82395527 -0.60970858]
[-0.67140111 -0.72273923]
[-0.87834745 -0.48729418]
[-0.76484614 -0.58745599]
[-0.79262396 -0.67690592]
[-0.81175736 -0.49967249]
[-0.79016125 -0.57104991]
[-0.82238559 -0.58930577]
[-0.73359008 -0.68840631]
[-0.87622509 -0.5704099 ]
[-0.76044518 -0.66671203]]
Creating custom filter functions¶
In some cases, the list of filter functions provided in
gtda.mapper.filter.py
or scikit-learn
may not be sufficient for
the task at hand. In such cases, one can pass any callable to the
pipeline that acts row-wise on the input data. For example, we can
project by taking the sum of the \((x,y)\) coordinates as follows:
filter_func = np.sum
pipe = make_mapper_pipeline(
filter_func=filter_func,
cover=cover,
clusterer=clusterer,
verbose=True,
n_jobs=n_jobs,
)
fig = plot_static_mapper_graph(pipe, data)
fig.show(config={'scrollZoom': True})
[Pipeline] ............ (step 1 of 3) Processing scaler, total= 0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total= 0.0s
[Pipeline] ............. (step 3 of 3) Processing cover, total= 0.0s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total= 0.0s
[Pipeline] ........ (step 2 of 3) Processing clustering, total= 0.1s
[Pipeline] ............. (step 3 of 3) Processing nerve, total= 0.0s
Visualise the Mapper graph interactively (Live Jupyter session needed)¶
In general, building useful Mapper graphs requires some iteration
through the various parameters in the cover and clustering algorithm. To
simplify that process, giotto-tda
provides an interactive figure
that can be configured in real time by tweaking the pipeline
hyperparameters. You can produce it in two ways, namely:
by using the
plot_interactive_mapper_graph
function in a similar same way asplot_static_mapper_graph
;
pipe = make_mapper_pipeline()
# Generate interactive widget
plot_interactive_mapper_graph(pipe, data, color_data=data)
VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Cover parameters</b>'), Text(value='uniform', cont…
(recommended, new in
giotto-tda
0.5.0) in an object-oriented way, by instantiating aMapperInteractivePlotter
onject and then calling itsplot
method to create the widget.
# Create the plotter object
MIP = MapperInteractivePlotter(pipe, data)
# Generate interactive widget
MIP.plot(color_data=data)
VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Cover parameters</b>'), Text(value='uniform', cont…
The advantage of using the class API with MapperInteractivePlotter
is that, once you are done tweaking the hyperparameters, you can inspect
the latest state of the objects (graph, colours, pipeline, inner figure)
which got changed during the interactive session.
print("Attributes created by `.plot` and updated during the interactive session:\n",
[attr for attr in dir(MIP) if attr.endswith("_") and attr[0] != "_"])
Attributes created by .plot and updated during the interactive session: ['color_features_', 'figure_', 'graph_', 'node_summaries_', 'pipeline_']
In the widgets, if invalid parameters are selected, the Show logs checkbox can be used to see what went wrong.
To see the interactive outputs above, please download the notebook from GitHub and execute it locally.