make_mapper_pipeline¶
-
gtda.mapper.pipeline.
make_mapper_pipeline
(scaler=None, filter_func=None, cover=None, clustering_preprocessing=None, clusterer=None, n_jobs=None, parallel_backend_prefer='threads', graph_step=True, min_intersection=1, memory=None, verbose=False)[source]¶ Construct a MapperPipeline object according to the specified Mapper steps. 1
The role of this function’s main parameters is illustrated in this diagram. All computational steps may be scikit-learn estimators, including Pipeline objects.
- Parameters
scaler (object or None, optional, default:
None
) – IfNone
, no scaling is performed. Otherwise, it must be an object with afit_transform
method.filter_func (object, callable or None, optional, default:
None
) – If None`, PCA (sklearn.decomposition.PCA
) with 2 components and default parameters is used as a default filter function. Otherwise, it may be an object with afit_transform
method, or a callable acting on one-dimensional arrays – in which case the callable is applied independently to each row of the (scaled) data.cover (object or None, optional, default:
None
) – Covering transformer, e.g. an instance ofOneDimensionalCover
or ofCubicalCover
.None
is equivalent to passing an instance ofCubicalCover
with its default parameters.clustering_preprocessing (object or None, optional, default:
None
) – If notNone
, it is a transformer which is applied to the data independently to the scaler -> filter_func -> cover pipeline. Clustering is then performed on portions (determined by the scaler -> filter_func -> cover pipeline) of the transformed data.clusterer (object or None, optional, default:
None
) – Clustering object with afit
method which stores cluster labels.None
is equivalent to passing an instance ofsklearn.cluster.DBSCAN
with its default parameters.n_jobs (int or None, optional, default:
None
) – The number of jobs to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with parallel_backend_prefer.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.parallel_backend_prefer (
'processes'
|'threads'
, optional, default:'threads'
) – Soft hint for the default joblib backend to use in a joblib-parallel application of the clustering step across pullback cover sets. To be used in conjunction with n_jobs. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. See 2.graph_step (bool, optional, default:
True
) – Whether the resulting pipeline should stop at the calculation of the Mapper cover, or include the construction of the Mapper graph.min_intersection (int, optional, default:
1
) – Minimum size of the intersection between clusters required for creating an edge in the Mapper graph. Ignored if graph_step is set toFalse
.memory (None, str or object with the joblib.Memory interface, optional, default:
None
) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attributenamed_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.verbose (bool, optional, default:
False
) – If True, the time elapsed while fitting each step will be printed as it is completed.
- Returns
mapper_pipeline – Output Mapper pipeline.
- Return type
MapperPipeline
object
Examples
>>> # Example of basic usage with default parameters >>> import numpy as np >>> from gtda.mapper import make_mapper_pipeline >>> mapper = make_mapper_pipeline() >>> print(mapper.__class__) <class 'gtda.mapper.pipeline.MapperPipeline'> >>> mapper_params = mapper.get_mapper_params() >>> print(mapper_params['filter_func'].__class__) <class 'sklearn.decomposition._pca.PCA'> >>> print(mapper_params['cover'].__class__) <class 'gtda.mapper.cover.CubicalCover'> >>> print(mapper_params['clusterer'].__class__) <class 'sklearn.cluster._dbscan.DBSCAN'> >>> X = np.random.random((10000, 4)) # 10000 points in 4-dimensional space >>> mapper_graph = mapper.fit_transform(X) # Create the mapper graph >>> print(type(mapper_graph)) igraph.Graph >>> # Node metadata stored as dict in graph object >>> print(mapper_graph['node_metadata'].keys()) dict_keys(['node_id', 'pullback_set_label', 'partial_cluster_label', 'node_elements']) >>> # Find which points belong to first node of graph >>> node_id, node_elements = mapper_graph['node_metadata']['node_id'], ... mapper_graph['node_metadata']['node_elements'] >>> print(f'Node Id: {node_id[0]}, Node elements: {node_elements[0]}, ' f'Data points: {X[node_elements[0]]}') Node Id: 0, Node elements: [8768], Data points: [[0.01838998 0.76928754 0.98199244 0.0074299 ]] >>> ####################################################################### >>> # Example using a scaler from scikit-learn, a filter function from >>> # gtda.mapper.filter, and a clusterer from gtda.mapper.cluster >>> from sklearn.preprocessing import MinMaxScaler >>> from gtda.mapper import Projection, FirstHistogramGap >>> scaler = MinMaxScaler() >>> filter_func = Projection(columns=[0, 1]) >>> clusterer = FirstHistogramGap() >>> mapper = make_mapper_pipeline(scaler=scaler, ... filter_func=filter_func, ... clusterer=clusterer) >>> ####################################################################### >>> # Example using a callable acting on each row of X separately >>> import numpy as np >>> from gtda.mapper import OneDimensionalCover >>> cover = OneDimensionalCover() >>> mapper.set_params(scaler=None, filter_func=np.sum, cover=cover) >>> ####################################################################### >>> # Example setting the memory parameter to cache each step and avoid >>> # recomputation of early steps >>> from tempfile import mkdtemp >>> from shutil import rmtree >>> cachedir = mkdtemp() >>> mapper.set_params(memory=cachedir, verbose=True) >>> mapper_graph = mapper.fit_transform(X) [Pipeline] ............ (step 1 of 3) Processing scaler, total= 0.0s [Pipeline] ....... (step 2 of 3) Processing filter_func, total= 0.0s [Pipeline] ............. (step 3 of 3) Processing cover, total= 0.0s [Pipeline] .... (step 1 of 3) Processing pullback_cover, total= 0.0s [Pipeline] ........ (step 2 of 3) Processing clustering, total= 0.3s [Pipeline] ............. (step 3 of 3) Processing nerve, total= 0.0s >>> mapper.set_params(min_intersection=3) >>> mapper_graph = mapper.fit_transform(X) [Pipeline] ............. (step 3 of 3) Processing nerve, total= 0.0s >>> # Clear the cache directory when you don't need it anymore >>> rmtree(cachedir) >>> ####################################################################### >>> # Example using a large dataset for which parallelism in >>> # clustering across the pullback cover sets can be beneficial >>> from sklearn.cluster import DBSCAN >>> mapper = make_mapper_pipeline(clusterer=DBSCAN(), ... n_jobs=6, ... memory=mkdtemp(), ... verbose=True) >>> X = np.random.random((100000, 4)) >>> mapper.fit_transform(X) [Pipeline] ............ (step 1 of 3) Processing scaler, total= 0.0s [Pipeline] ....... (step 2 of 3) Processing filter_func, total= 0.1s [Pipeline] ............. (step 3 of 3) Processing cover, total= 0.6s [Pipeline] .... (step 1 of 3) Processing pullback_cover, total= 0.7s [Pipeline] ........ (step 2 of 3) Processing clustering, total= 1.9s [Pipeline] ............. (step 3 of 3) Processing nerve, total= 0.3s >>> mapper.set_params(n_jobs=1) >>> mapper.fit_transform(X) [Pipeline] ........ (step 2 of 3) Processing clustering, total= 5.3s [Pipeline] ............. (step 3 of 3) Processing nerve, total= 0.3s
See also
References
- 1
G. Singh, F. Mémoli, and G. Carlsson, “Topological methods for the analysis of high dimensional data sets and 3D object recognition”; in SPBG, pp. 91–100, 2007.
- 2
“Thread-based parallelism vs process-based parallelism”, in joblib documentation.