CollectionTransformer

class gtda.metaestimators.CollectionTransformer(transformer, n_jobs=None, parallel_backend_prefer=None, parallel_backend_require=None)[source]

Meta-transformer for applying a fit-transformer to each input in a collection.

If transformer possesses a fit_transform method, CollectionTransformer(transformer) also possesses a fit_transform method which, on each entry in its input X, fit-transforms a clone of transformer. A collection (list or ndarray) of outputs is returned.

Note: to have compatibility with scikit-learn and giotto-tda pipelines, a transform method is also present but it is simply an alias for fit_transform.

Parameters
  • transformer (object) – The fit-transformer instance from which the transformer acting on collections is built. Should implement fit_transform.

  • n_jobs (int or None, optional, default: None) – The number of jobs to use in a joblib-parallel application of transformer’s fit_transform to each input. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • parallel_backend_prefer ("processes" | "threads" | None, optional, default: None) – Soft hint for the default joblib backend to use in a joblib-parallel application of transformer’s fit_transform to each input. See 1.

  • parallel_backend_require ("sharedmem" or None, optional, default: None) – Hard constraint to select the backend. If set to 'sharedmem', the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with parallel_backend.

Examples

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> from gtda.metaestimators import CollectionTransformer
>>> rng = np.random.default_rng()

Create a collection of 1000 2D inputs for PCA, as a single 3D ndarray (we could also create a list of 2D inputs instead).

>>> X = rng.random((1000, 100, 50))

In the case of PCA, joblib parallelism can be very beneficial!

>>> multi_pca = CollectionTransformer(PCA(n_components=3), n_jobs=-1)
>>> Xt = multi_pca.fit_transform(X)

Since all PCA outputs have the same shape, Xt is an ndarray. >>> print(Xt.shape) (1000, 100, 3)

See also

gtda.mapper.utils.pipeline.transformer_from_callable_on_rows, gtda.mapper.utils.decorators.method_to_transform

References

1

“Thread-based parallelism vs process-based parallelism”, in joblib documentation.

__init__(transformer, n_jobs=None, parallel_backend_prefer=None, parallel_backend_require=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None)[source]

Do nothing and return the estimator unchanged.

This method is here to implement the usual scikit-learn API and hence work in pipelines.

Parameters
  • X (list of length n_samples, or ndarray of shape (n_samples, ..)) – Collection of inputs to be fit-transformed by transformer.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

Returns

self

Return type

object

fit_transform(X, y=None)[source]

Fit-transform a clone of transformer to each element in the collection X.

Parameters
  • X (list of length n_samples, or ndarray of shape (n_samples, ..)) – Collection of inputs to be fit-transformed by transformer.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

Returns

Xt – Collection of outputs. It is a list unless all outputs have the same shape, in which case it is converted to an ndarray.

Return type

list of length n_samples, or ndarray of shape (n_samples, ..)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

transform(X, y=None)[source]

Alias for fit_transform.

Allows for this class to be used as an intermediate step in a scikit-learn pipeline.

Parameters
  • X (list of length n_samples, or ndarray of shape (n_samples, ..)) – Collection of inputs to be fit-transformed by transformer.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

Returns

Xt – Collection of outputs. It is a list unless all outputs have the same shape, in which case it is converted to an ndarray.

Return type

list of length n_samples, or ndarray of shape (n_samples, ..)