FirstSimpleGap¶

class gtda.mapper.FirstSimpleGap(relative_gap_size=0.3, max_fraction=None, affinity='euclidean', memory=None, linkage='single')[source]¶

Agglomerative clustering cutting the dendrogram at the first instance of a sufficiently large gap.

A simple threshold is determined as a fraction of the largest linkage value in the full dendrogram. If possible, the dendrogram is cut at the first occurrence of a gap, between the linkage values of successive merges, which exceeds this threshold. Otherwise, a single cluster is returned. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.

Parameters

relative_gap_size (float, optional, default: 0.3) – The fraction of the largest linkage in the dendrogram to be used as a threshold for determining a large enough gap.
max_fraction (float or None, optional, default: None) – When not None, the algorithm is constrained to produce no more than max_fraction * (n_samples - 1) clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.
affinity (str, optional, default: 'euclidean') – Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted. If 'precomputed', a distance matrix (instead of a similarity matrix) is needed as input for fit.
memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
linkage ('ward' | 'complete' | 'average' | 'single', optional, default: 'single') –
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
- 'ward' minimizes the variance of the clusters being merged.
- 'average' uses the average of the distances of each observation of the two sets.
- 'complete' linkage uses the maximum distances between all observations of the two sets.
- 'single' uses the minimum of the distances between all observations of the two sets.

n_clusters\_

The number of clusters found by the algorithm.

Type: int

labels\_

Cluster labels for each sample.

Type: ndarray of shape (n_samples,)

children\_

The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.

Type: ndarray of shape (n_nodes - 1, 2)

n_leaves\_

Number of leaves in the hierarchical tree.

Type: int

distances\_

Distances between nodes in the corresponding place in children_.

Type: ndarray of shape (n_nodes - 1,)