make_mapper_pipeline¶

gtda.mapper.pipeline.make_mapper_pipeline(scaler=None, filter_func=None, cover=None, clustering_preprocessing=None, clusterer=None, n_jobs=None, parallel_backend_prefer='threads', graph_step=True, min_intersection=1, memory=None, verbose=False)[source]¶

Construct a MapperPipeline object according to the specified Mapper steps. 1

The role of this function’s main parameters is illustrated in this diagram. All computational steps may be scikit-learn estimators, including Pipeline objects.

Parameters

scaler (object or None, optional, default: None) – If None, no scaling is performed. Otherwise, it must be an object with a fit_transform method.
filter_func (object, callable or None, optional, default: None) – If None`, PCA (sklearn.decomposition.PCA) with 2 components and default parameters is used as a default filter function. Otherwise, it may be an object with a fit_transform method, or a callable acting on one-dimensional arrays – in which case the callable is applied independently to each row of the (scaled) data.
cover (object or None, optional, default: None) – Covering transformer, e.g. an instance of OneDimensionalCover or of CubicalCover. None is equivalent to passing an instance of CubicalCover with its default parameters.
clustering_preprocessing (object or None, optional, default: None) – If not None, it is a transformer which is applied to the data independently to the scaler -> filter_func -> cover pipeline. Clustering is then performed on portions (determined by the scaler -> filter_func -> cover pipeline) of the transformed data.
clusterer (object or None, optional, default: None) – Clustering object with a fit method which stores cluster labels. None is equivalent to passing an instance of sklearn.cluster.DBSCAN with its default parameters.
n_jobs (int or None, optional, default: None) – The number of jobs to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with parallel_backend_prefer. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
parallel_backend_prefer ('processes' | 'threads', optional, default: 'threads') – Soft hint for the default joblib backend to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with n_jobs. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. See 2.
graph_step (bool, optional, default: True) – Whether the resulting pipeline should stop at the calculation of the Mapper cover, or include the construction of the Mapper graph.
min_intersection (int, optional, default: 1) – Minimum size of the intersection between clusters required for creating an edge in the Mapper graph. Ignored if graph_step is set to False.
memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
verbose (bool, optional, default: False) – If True, the time elapsed while fitting each step will be printed as it is completed.

Returns

mapper_pipeline – Output Mapper pipeline.

Return type

MapperPipeline object

Examples

>>> # Example of basic usage with default parameters
>>> import numpy as np
>>> from gtda.mapper import make_mapper_pipeline
>>> mapper = make_mapper_pipeline()
>>> print(mapper.__class__)
<class 'gtda.mapper.pipeline.MapperPipeline'>
>>> mapper_params = mapper.get_mapper_params()
>>> print(mapper_params['filter_func'].__class__)
<class 'sklearn.decomposition._pca.PCA'>
>>> print(mapper_params['cover'].__class__)
<class 'gtda.mapper.cover.CubicalCover'>
>>> print(mapper_params['clusterer'].__class__)
<class 'sklearn.cluster._dbscan.DBSCAN'>
>>> X = np.random.random((10000, 4))  # 10000 points in 4-dimensional space
>>> mapper_graph = mapper.fit_transform(X)  # Create the mapper graph
>>> print(type(mapper_graph))
igraph.Graph
>>> # Node metadata stored as dict in graph object
>>> print(mapper_graph['node_metadata'].keys())
dict_keys(['node_id', 'pullback_set_label', 'partial_cluster_label',
           'node_elements'])
>>> # Find which points belong to first node of graph
>>> node_id, node_elements = mapper_graph['node_metadata']['node_id'],
... mapper_graph['node_metadata']['node_elements']
>>> print(f'Node Id: {node_id[0]}, Node elements: {node_elements[0]}, '
          f'Data points: {X[node_elements[0]]}')
Node Id: 0,
Node elements: [8768],
Data points: [[0.01838998 0.76928754 0.98199244 0.0074299 ]]
>>> #######################################################################
>>> # Example using a scaler from scikit-learn, a filter function from
>>> # gtda.mapper.filter, and a clusterer from gtda.mapper.cluster
>>> from sklearn.preprocessing import MinMaxScaler
>>> from gtda.mapper import Projection, FirstHistogramGap
>>> scaler = MinMaxScaler()
>>> filter_func = Projection(columns=[0, 1])
>>> clusterer = FirstHistogramGap()
>>> mapper = make_mapper_pipeline(scaler=scaler,
...                               filter_func=filter_func,
...                               clusterer=clusterer)
>>> #######################################################################
>>> # Example using a callable acting on each row of X separately
>>> import numpy as np
>>> from gtda.mapper import OneDimensionalCover
>>> cover = OneDimensionalCover()
>>> mapper.set_params(scaler=None, filter_func=np.sum, cover=cover)
>>> #######################################################################
>>> # Example setting the memory parameter to cache each step and avoid
>>> # recomputation of early steps
>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> cachedir = mkdtemp()
>>> mapper.set_params(memory=cachedir, verbose=True)
>>> mapper_graph = mapper.fit_transform(X)
[Pipeline] ............ (step 1 of 3) Processing scaler, total=   0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total=   0.0s
[Pipeline] ............. (step 3 of 3) Processing cover, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total=   0.0s
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   0.3s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.0s
>>> mapper.set_params(min_intersection=3)
>>> mapper_graph = mapper.fit_transform(X)
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.0s
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)
>>> #######################################################################
>>> # Example using a large dataset for which parallelism in
>>> # clustering across the pullback cover sets can be beneficial
>>> from sklearn.cluster import DBSCAN
>>> mapper = make_mapper_pipeline(clusterer=DBSCAN(),
...                               n_jobs=6,
...                               memory=mkdtemp(),
...                               verbose=True)
>>> X = np.random.random((100000, 4))
>>> mapper.fit_transform(X)
[Pipeline] ............ (step 1 of 3) Processing scaler, total=   0.0s
[Pipeline] ....... (step 2 of 3) Processing filter_func, total=   0.1s
[Pipeline] ............. (step 3 of 3) Processing cover, total=   0.6s
[Pipeline] .... (step 1 of 3) Processing pullback_cover, total=   0.7s
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   1.9s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.3s
>>> mapper.set_params(n_jobs=1)
>>> mapper.fit_transform(X)
[Pipeline] ........ (step 2 of 3) Processing clustering, total=   5.3s
[Pipeline] ............. (step 3 of 3) Processing nerve, total=   0.3s