ParallelClustering¶
-
class
gtda.mapper.
ParallelClustering
(clusterer, n_jobs=None, parallel_backend_prefer=None)[source]¶ Employ joblib parallelism to cluster different portions of a dataset.
An arbitrary clustering class which stores a
labels_
attribute infit
can be passed to the constructor. Examples are most classes insklearn.cluster
. The input offit
is of the form[X_tot, masks]
whereX_tot
is the full dataset, andmasks
is a 2D boolean array, each column of which indicates the location of a portion ofX_tot
to cluster separately. Parallelism is achieved over the columns ofmasks
.- Parameters
clusterer (object) – Clustering object derived from
sklearn.base.ClusterMixin
.n_jobs (int or None, optional, default:
None
) – The number of jobs to use for the computation.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.parallel_backend_prefer (
"processes"
|"threads"
|None
, optional, default:None
) – Soft hint for the selection of the default joblib backend. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. See 1.
-
labels_
¶ For each point in the dataset passed to
fit
, a tuple of pairs of the form(i, partial_label)
wherei
is the index of a boolean mask which selects that point andpartial_label
is the cluster label assigned to the point when clustering the subset of the data selected by maski
.- Type
ndarray of shape (n_samples,)
References
- 1
“Thread-based parallelism vs process-based parallelism”, in joblib documentation.
-
__init__
(clusterer, n_jobs=None, parallel_backend_prefer=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y=None, sample_weight=None)[source]¶ Fit the clusterer on each portion of the data.
clusterers_
andclusters_
are computed and stored.- Parameters
X (list-like of form
[X_tot, masks]
) – Input data as a list of length 2.X_tot
is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data.masks
is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks onX_tot
, specifying the portions ofX_tot
to be independently clustered.y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.
sample_weight (array-like or None, optional, default:
None
) – The weights for each observation in the full data. IfNone
, all observations are assigned equal weight. Otherwise, it has shape (n_samples,).
- Returns
self
- Return type
object
-
fit_predict
(X, y=None, sample_weight=None)[source]¶ Fit to the data, and return the found clusters.
- Parameters
X (list-like of form
[X_tot, masks]
) – Input data as a list of length 2.X_tot
is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data.masks
is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks onX_tot
, specifying the portions ofX_tot
to be independently clustered.y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.
sample_weight (array-like or None, optional, default:
None
) – The weights for each observation in the full data. IfNone
, all observations are assigned equal weight. Otherwise, it has shape (n_samples,).
- Returns
labels – See
labels_
.- Return type
ndarray of shape (n_samples,)
-
fit_transform
(X, y=None, **fit_params)[source]¶ Alias for
fit_predict
.Allows for this class to be used as an intermediate step in a scikit-learn pipeline.
- Parameters
X (list-like of form
[X_tot, masks]
) – Input data as a list of length 2.X_tot
is an ndarray of shape (n_samples, n_features) or (n_samples, n_samples) specifying the full data.masks
is a boolean ndarray of shape (n_samples, n_portions) whose columns are boolean masks onX_tot
, specifying the portions ofX_tot
to be independently clustered.y (None) – There is no need for a target in a transformer, yet the pipeline API requires this parameter.
- Returns
Xt – See
labels_
.- Return type
ndarray of shape (n_samples,)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object