FirstSimpleGap¶
-
class
gtda.mapper.
FirstSimpleGap
(relative_gap_size=0.3, max_fraction=None, affinity='euclidean', memory=None, linkage='single')[source]¶ Agglomerative clustering cutting the dendrogram at the first instance of a sufficiently large gap.
A simple threshold is determined as a fraction of the largest linkage value in the full dendrogram. If possible, the dendrogram is cut at the first occurrence of a gap, between the linkage values of successive merges, which exceeds this threshold. Otherwise, a single cluster is returned. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.
- Parameters
relative_gap_size (float, optional, default:
0.3
) – The fraction of the largest linkage in the dendrogram to be used as a threshold for determining a large enough gap.max_fraction (float or None, optional, default:
None
) – When notNone
, the algorithm is constrained to produce no more thanmax_fraction * (n_samples - 1)
clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.affinity (str, optional, default:
'euclidean'
) – Metric used to compute the linkage. Can be'euclidean'
,'l1'
,'l2'
,'manhattan'
,'cosine'
, or'precomputed'
. If linkage is'ward'
, only'euclidean'
is accepted. If'precomputed'
, a distance matrix (instead of a similarity matrix) is needed as input forfit
.memory (None, str or object with the joblib.Memory interface, optional, default:
None
) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.linkage (
'ward'
|'complete'
|'average'
|'single'
, optional, default:'single'
) –Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
'ward'
minimizes the variance of the clusters being merged.'average'
uses the average of the distances of each observation of the two sets.'complete'
linkage uses the maximum distances between all observations of the two sets.'single'
uses the minimum of the distances between all observations of the two sets.
-
n_clusters_
¶ The number of clusters found by the algorithm.
- Type
int
-
labels_
¶ Cluster labels for each sample.
- Type
ndarray of shape (n_samples,)
-
children_
¶ The children of each non-leaf node. Values less than
n_samples
correspond to leaves of the tree which are the original samples. A nodei
greater than or equal ton_samples
is a non-leaf node and has childrenchildren_[i - n_samples]
. Alternatively at thei
-th iteration,children[i][0]
andchildren[i][1]
are merged to form noden_samples + i
.- Type
ndarray of shape (n_nodes - 1, 2)
-
n_leaves_
¶ Number of leaves in the hierarchical tree.
- Type
int
-
distances_
¶ Distances between nodes in the corresponding place in
children_
.- Type
ndarray of shape (n_nodes - 1,)
See also
-
__init__
(relative_gap_size=0.3, max_fraction=None, affinity='euclidean', memory=None, linkage='single')[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y=None)[source]¶ Fit the agglomerative clustering from features or distance matrix.
The stopping rule is used to determine
n_clusters_
, and the full dendrogram is cut there to computelabels_
.- Parameters
X (ndarray of shape (n_samples, n_features) or (n_samples, n_samples)) – Training instances to cluster, or distances between instances if
affinity='precomputed'
.y (ignored) – Not used, present here for API consistency by convention.
- Returns
- Return type
self
-
fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
- Parameters
X (ndarray, shape (n_samples, n_features)) – Input data.
y (Ignored) – Not used, present for API consistency by convention.
- Returns
labels – Cluster labels.
- Return type
ndarray, shape (n_samples,)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object