FirstSimpleGap

class gtda.mapper.FirstSimpleGap(relative_gap_size=0.3, max_fraction=None, affinity='euclidean', memory=None, linkage='single')[source]

Agglomerative clustering cutting the dendrogram at the first instance of a sufficiently large gap.

A simple threshold is determined as a fraction of the largest linkage value in the full dendrogram. If possible, the dendrogram is cut at the first occurrence of a gap, between the linkage values of successive merges, which exceeds this threshold. Otherwise, a single cluster is returned. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.

Parameters
  • relative_gap_size (float, optional, default: 0.3) – The fraction of the largest linkage in the dendrogram to be used as a threshold for determining a large enough gap.

  • max_fraction (float or None, optional, default: None) – When not None, the algorithm is constrained to produce no more than max_fraction * (n_samples - 1) clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.

  • affinity (str, optional, default: 'euclidean') – Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted. If 'precomputed', a distance matrix (instead of a similarity matrix) is needed as input for fit.

  • memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

  • linkage ('ward' | 'complete' | 'average' | 'single', optional, default: 'single') –

    Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

    • 'ward' minimizes the variance of the clusters being merged.

    • 'average' uses the average of the distances of each observation of the two sets.

    • 'complete' linkage uses the maximum distances between all observations of the two sets.

    • 'single' uses the minimum of the distances between all observations of the two sets.

n_clusters_

The number of clusters found by the algorithm.

Type

int

labels_

Cluster labels for each sample.

Type

ndarray of shape (n_samples,)

children_

The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.

Type

ndarray of shape (n_nodes - 1, 2)

n_leaves_

Number of leaves in the hierarchical tree.

Type

int

distances_

Distances between nodes in the corresponding place in children_.

Type

ndarray of shape (n_nodes - 1,)

__init__(relative_gap_size=0.3, max_fraction=None, affinity='euclidean', memory=None, linkage='single')[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None)[source]

Fit the agglomerative clustering from features or distance matrix.

The stopping rule is used to determine n_clusters_, and the full dendrogram is cut there to compute labels_.

Parameters
  • X (ndarray of shape (n_samples, n_features) or (n_samples, n_samples)) – Training instances to cluster, or distances between instances if affinity='precomputed'.

  • y (ignored) – Not used, present here for API consistency by convention.

Returns

Return type

self

fit_predict(X, y=None)

Perform clustering on X and returns cluster labels.

Parameters
  • X (ndarray, shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

labels – Cluster labels.

Return type

ndarray, shape (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object