ParallelClustering

class gtda.mapper.ParallelClustering(clusterer, n_jobs=None, parallel_backend_prefer=None)[source]

Employ joblib parallelism to cluster different portions of a dataset.

An arbitrary clustering class which stores a labels_ attribute in fit can be passed to the constructor. Examples are most classes in sklearn.cluster. The input of fit is of the form [X_tot, masks] where X_tot is the full dataset, and masks is a two-dimensional boolean array, each column of which indicates the location of a portion of X_tot to cluster separately. Parallelism is achieved over the columns of masks.

Parameters
  • clusterer (object) – Clustering object derived from sklearn.base.ClusterMixin.

  • n_jobs (int or None, optional, default: None) – The number of jobs to use for the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • parallel_backend_prefer ("processes" | "threads" | None, optional, default: None) – Soft hint for the selection of the default joblib backend. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. See 1.

clusterers_

Clones of clusterer fitted to the portions of the full data array specified in fit.

Type

tuple of object

clusters_

Labels and indices of each cluster found in fit. The i-th entry corresponds to the i-th portion of the data; it is a list of triples of the form (i, label, indices), where label is a cluster label and indices is the array of indices of points belonging to cluster (i, label).

Type

list of list of tuple

References

1

“Thread-based parallelism vs process-based parallelism”, in joblib documentation.

__init__(clusterer, n_jobs=None, parallel_backend_prefer=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None, sample_weight=None)[source]

Fit the clusterer on each portion of the data.

clusterers_ and clusters_ are computed and stored.

Parameters
  • X (list-like of form [X_tot, masks]) – Input data as a list of length 2. X_tot is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data. masks is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks on X_tot, specifying the portions of X_tot to be independently clustered.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

  • sample_weight (array-like or None, optional, default: None) – The weights for each observation in the full data. If None, all observations are assigned equal weight. Otherwise, it has shape (n_samples,).

Returns

self

Return type

object

fit_predict(X, y=None, sample_weight=None)[source]

Fit to the data, and return the found clusters.

Parameters
  • X (list-like of form [X_tot, masks]) – Input data as a list of length 2. X_tot is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data. masks is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks on X_tot, specifying the portions of X_tot to be independently clustered.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

  • sample_weight (array-like or None, optional, default: None) – The weights for each observation in the full data. If None, all observations are assigned equal weight. Otherwise, it has shape (n_samples,).

Returns

clusters – See clusters_.

Return type

list of list of tuple

fit_transform(X, y=None, **fit_params)[source]

Alias for fit_predict.

Allows for this class to be used as an intermediate step in a scikit-learn pipeline.

Parameters
  • X (list-like of form [X_tot, masks]) – Input data as a list of length 2. X_tot is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data. masks is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks on X_tot, specifying the portions of X_tot to be independently clustered.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.

Returns

Xt – See clusters_.

Return type

list of list of tuple

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

transform(X, y=None)[source]

Not implemented.

Only present so that the class is a valid step in a scikit-learn pipeline.

Parameters
  • X (Ignored) – Ignored.

  • y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.