make_mapper_pipeline

gtda.mapper.pipeline.make_mapper_pipeline(scaler=None, filter_func=None, cover=None, clustering_preprocessing=None, clusterer=None, n_jobs=None, parallel_backend_prefer='threads', graph_step=True, min_intersection=1, memory=None, verbose=False)[source]

Construct a MapperPipeline object according to the specified Mapper steps. 1

The role of this function’s main parameters is illustrated in this diagram. All computational steps may be scikit-learn estimators, including Pipeline objects.

Parameters
  • scaler (object or None, optional, default: None) – If None, no scaling is performed. Otherwise, it must be an object with a fit_transform method.

  • filter_func (object, callable or None, optional, default: None) – If None`, PCA (sklearn.decomposition.PCA) with 2 components and default parameters is used as a default filter function. Otherwise, it may be an object with a fit_transform method, or a callable acting on one-dimensional arrays – in which case the callable is applied independently to each row of the (scaled) data.

  • cover (object or None, optional, default: None) – Covering transformer, e.g. an instance of OneDimensionalCover or of CubicalCover. None is equivalent to passing an instance of CubicalCover with its default parameters.

  • clustering_preprocessing (object or None, optional, default: None) – If not None, it is a transformer which is applied to the data independently to the scaler -> filter_func -> cover pipeline. Clustering is then performed on portions (determined by the scaler -> filter_func -> cover pipeline) of the transformed data.

  • clusterer (object or None, optional, default: None) – Clustering object with a fit method which stores cluster labels. None is equivalent to passing an instance of sklearn.cluster.DBSCAN with its default parameters.

  • n_jobs (int or None, optional, default: None) – The number of jobs to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with parallel_backend_prefer. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • parallel_backend_prefer ('processes' | 'threads', optional, default: 'threads') – Soft hint for the default joblib backend to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with n_jobs. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. See 2.

  • graph_step (bool, optional, default: True) – Whether the resulting pipeline should stop at the calculation of the Mapper cover, or include the construction of the Mapper graph.

  • min_intersection (int, optional, default: 1) – Minimum size of the intersection between clusters required for creating an edge in the Mapper graph. Ignored if graph_step is set to False.

  • memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the fitted transformers which make up the pipeline. This is advantageous when the fitting of early steps is time consuming and only later steps in the pipeline are modified (e.g. using set_params) before refitting on the same data. To be used exactly as for sklearn.pipeline.make_pipeline. By default, no no caching is performed. If a string is given, it is the path to the caching directory. See 3.

  • verbose (bool, optional, default: False) – If True, the time elapsed while fitting each step will be printed as it is completed.

Returns

mapper_pipeline – Output Mapper pipeline.

Return type

MapperPipeline object

Examples

Basic usage with default parameters

>>> import numpy as np
>>> from gtda.mapper import make_mapper_pipeline
>>> mapper = make_mapper_pipeline()
>>> print(mapper.__class__)
<class 'gtda.mapper.pipeline.MapperPipeline'>
>>> mapper_params = mapper.get_mapper_params()
>>> print(mapper_params['filter_func'].__class__)
<class 'sklearn.decomposition._pca.PCA'>
>>> print(mapper_params['cover'].__class__)
<class 'gtda.mapper.cover.CubicalCover'>
>>> print(mapper_params['clusterer'].__class__)
<class 'sklearn.cluster._dbscan.DBSCAN'>
>>> X = np.random.random((10000, 4))  # 10000 points in 4-dimensional space
>>> mapper_graph = mapper.fit_transform(X)  # Create the mapper graph
>>> print(type(mapper_graph))
igraph.Graph
>>> # Node metadata stored as dict in graph object
>>> print(mapper_graph['node_metadata'].keys())
dict_keys(['node_id', 'pullback_set_label', 'partial_cluster_label',
           'node_elements'])
>>> # Find which points belong to first node of graph
>>> node_id = mapper_graph['node_metadata']['node_id']
>>> node_elements = mapper_graph['node_metadata']['node_elements']
>>> print(f"Node ID: {node_id[0]}, Node elements: {node_elements[0]}, "
...       f"Data points: {X[node_elements[0]]}")
Node Id: 0,
Node elements: [8768],
Data points: [[0.01838998 0.76928754 0.98199244 0.0074299 ]]

Using a scaler from scikit-learn, a filter function from gtda.mapper.filter, and a clusterer from gtda.mapper.cluster

>>> from sklearn.preprocessing import MinMaxScaler
>>> from gtda.mapper import Projection, FirstHistogramGap
>>> scaler = MinMaxScaler()
>>> filter_func = Projection(columns=[0, 1])
>>> clusterer = FirstHistogramGap()
>>> mapper = make_mapper_pipeline(scaler=scaler,
...                               filter_func=filter_func,
...                               clusterer=clusterer)

Using a callable acting on each row of X separately

>>> import numpy as np
>>> from gtda.mapper import OneDimensionalCover
>>> cover = OneDimensionalCover()
>>> mapper.set_params(scaler=None, filter_func=np.sum, cover=cover)

Setting the memory parameter to cache each step and avoid recomputation of early steps

>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> cachedir = mkdtemp()
>>> mapper.set_params(memory=cachedir, verbose=True)
>>> mapper_graph = mapper.fit_transform(X)
[Pipeline] ............ (step 1 of 3) Processing scaler, total=   0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total=   0.0s
[Pipeline] ............. (step 3 of 3) Processing cover, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total=   0.0s
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   0.3s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.0s
>>> mapper.set_params(min_intersection=3)
>>> mapper_graph = mapper.fit_transform(X)
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.0s
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)

Using a large dataset for which parallelism in clustering across the pullback cover sets can be beneficial

>>> from sklearn.cluster import DBSCAN
>>> mapper = make_mapper_pipeline(clusterer=DBSCAN(),
...                               n_jobs=6,
...                               memory=mkdtemp(),
...                               verbose=True)
>>> X = np.random.random((100000, 4))
>>> mapper.fit_transform(X)
[Pipeline] ............ (step 1 of 3) Processing scaler, total=   0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total=   0.1s
[Pipeline] ............. (step 3 of 3) Processing cover, total=   0.6s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total=   0.7s
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   1.9s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.3s
>>> mapper.set_params(n_jobs=1)
>>> mapper.fit_transform(X)
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   5.3s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.3s

References

1

G. Singh, F. Mémoli, and G. Carlsson, “Topological methods for the analysis of high dimensional data sets and 3D object recognition”; in SPBG, pp. 91–100, 2007.

2

“Thread-based parallelism vs process-based parallelism”, in joblib documentation.

3

“Caching transformers: avoid repeated computation”, in scikit-learn documentation.