FirstHistogramGap¶

class gtda.mapper.FirstHistogramGap(linkage='single', affinity='euclidean', freq_threshold=0, max_fraction=1.0, n_bins_start=5, memory=None)[source]¶

Agglomerative clustering with stopping rule given by a histogram-based version of the first gap method, introduced in 1.

Given a frequency threshold f and an initial integer k: 1) create a histogram of k equally spaced bins of the number of merges in the dendrogram, as a function of the linkage parameter; 2) the value of linkage at which the tree is to be cut is the first one after which a bin of height no greater than f (i.e. a “gap”) is observed; 3) if no gap is observed, increase k and repeat 1) and 2) until termination. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.

Parameters

linkage ('ward' | 'complete' | 'average' | 'single', optional, default: 'single') –
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
- 'ward' minimizes the variance of the clusters being merged.
- 'average' uses the average of the distances of each observation of the two sets.
- 'complete' linkage uses the maximum distances between all observations of the two sets.
- 'single' uses the minimum of the distances between all observations of the two sets.
affinity (str, optional, default: 'euclidean') – Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted. If 'precomputed', a distance matrix (instead of a similarity matrix) is needed as input for fit.
freq_threshold (int, optional, default: 0) – The frequency threshold for declaring that a gap in the histogram of merges is present.
max_fraction (float, optional, default: 1.) – When not None, the algorithm is constrained to produce no more than max_fraction * n_samples clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.
n_bins_start (int, optional, default: 5) – The initial number of bins in the iterative process for finding a gap in the histogram of merges.
memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the output of the computation of the tree. By default, no caching is performed. If a string is given, it is the path to the caching directory.

n_clusters_¶

The number of clusters found by the algorithm.

Type: int

labels_¶

Cluster labels for each sample.

Type: ndarray of shape (n_samples,)

children_¶

The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.

Type: ndarray of shape (n_nodes - 1, 2)

n_leaves_¶

Number of leaves in the hierarchical tree.

Type: int

distances_¶

Distances between nodes in the corresponding place in children_.

Type: ndarray of shape (n_nodes - 1,)