User:Fredseadroid/Draft of the page for HCS clustering Algorithm

Fredseadroid/Draft of the page for HCS clustering Algorithm
Class	Cluster analysis (on a similarity graph)
Data structure	Graph
Worst-case performance	O(2N x f(n,m))

The HCS clustering algorithm (also known as the HCS algorithm) is an algorithm for Cluster analysis, by first representing the similarity data in a similarity graph, and afterwards finding all the highly connected subgraphs as clusters. The algorithm does not make any prior assumptions on the number of the clusters. This algorithm was published by Erez Hartuv and Ron Shamir in 1999.

There are different approaches towards how to model the similarity data for the problem of cluster analysis. For each of these cluster models again different algorithms can be given. HCS clustering algorithm works on connectivity models, where a high density of edges between vertices indicates high similarity between elements.

Similarity Modeling and Preprocessing

The goal of cluster analysis is to group elements into disjoint subsets, or clusters, based on similarity between elements, so that elements in the same cluster are highly similar to each other (homogeneity), while elements from different clusters have low similarity to each other (separation). Similarity graph is one of the models to represent the similarity between elements, and in turn facilitate generating of clusters. To construct a similarity graph from similarity data, represent elements as vertices, and elecit edges between vertices when the similarity value between them is above some threshold.

Algorithm

In the similarity graph, the more edges exist for a given number of vertices, the more similar such a set of vertices are between each other. In another word, if we try to disconnect a similarity graph by removing edges, the more edges we need to remove before the graph becomes disconnected, the more similar the vertices in this graph. Minimum cut is a minimum set of edges without which the graph will become disconnected.

HCS clustering algorithm finds all the subgraphs with n vertices that the munimum cut on those subgraphs contains at least n/2 edges, and identifies them as clusters. Such a subgraph is called a Highly Connected Subgraph (HCS). Single vertices are not considered clusters and are grouped into a singletons set S.

Given a similarity graph G(V,E), HCS clustering algorithm will check if it is already highly connected, if yes, returns G, otherwise uses the minimum cut of G to partition G into two subgraphs H and H', and recursively run HCS clustering algorithm on H and H'.

Example

The following animation shows how the HCS clustering algorithm partitions a similarity graph into three clusters.

Pseudocode

1  function HCS(G(V,E))
2    (H1,H2,C) ← MINIMUMCUT(G)
3    if G is highly connected
4      then return (G)
5    else
6      HCS(H1)
7      HCS(H2)
8    end if
9  end

The step of finding the minimum cut on graph G is a subroutine that can be implemented using different algorithms for this problem. See below for an example algorithm for finding minimum cut using randomization.

Complexity

The running time of the HCS clustering algorithm is bounded by N x f(n,m). f(n,m) is the time complexity of computing a minimun cut in a graph with n vertices and m edges, and N is the number of clusters found. In many applications N < < n.

For fast algorithms for finding a minimum cut in an unwieghted graph:

Proof of correctness

The clusters produced by the HCS clustering algorithm possess several properties, which can demonstrate the homogeneity and the separation of the solution.

Theorem 1 The diameter of every highly connect graph is at most two.

Proof: We know the edges of minimun cut must be greater or equal than the minimum degree of the graph. If the graph G is highly connected, then the edges of the minimum cut must be greater than the number of vertices divided by 2. So the degree of vertices in the highly connected graph G must be greater than half the vertices. Therefore for any two vertices in this graph G, there must be at least one common neigbhor, as the distance between them is two.

Theorem 2 (a) The number of edges in a highly connected subgraph is quadratic. (b) The number of edges removed by each iteration of the HCS algorithm is at most linear.

Proof: (For a) From Theorem 1 we know every vertex must have more than half of the total vertices as neigbhors. Therefore the total number of edges in a highly connect subgraph must be at least (n/2) x n x 1/2, where we sum all the degrees of each vertex and divide by 2.

(For b) Each iteration HCS algorithm will separate a graph containing n vertices into two subgraphs, so the number of edges between those two components is at most n/2.

Theorem 1 and 2a provide a strong indication to the homogeneity, as the only better possibility in terms of the diameter is that every two vertices of a cluster are connected by an edge, which is both too stringent and also a NP-hard problem.

Theorem 2b also indicates separation since the number of edges removed by each iteration of the HCS algorithm is at most linear inthe size of the underlying subgraph, contrast to the quadratic number of edges within final clusters.

Applications

Gene expression analysis. The hybridization of synthetic oligonucleotides to arrayed cDNAs yields a fringerprint for each cDNA clone. Run HCS algorithm on these fringerprints can identify clones corresponding to the same gene.

PPI network structure discovery. Using HCS clustering to detect dense subnetworks in PPI that may have biological meaning and represent biological processes.

Variations

Singletons adoption: Elements left as singletons by the initial clustering process can be "adopted" by clusters based on similarity to the cluster. If the maximum number of neigbhors to a specific cluster is large enough, then it can be added to that cluster.

Removing Low Degree Vertices: When the input graph has vertices with low degrees, it is not worthy to run the algorithm since it is computationally expensive and not informative. Alternatively, a refinement of the algorithm can first remove all vertices with a degree lower than certain threshold.

Related algorithms

CLICK - CLuster Identification via Connectivity Kernels is an adaptation of HCS algorithm on weighted similarity graphs, where the weight is assigned with a probability flavor.

References

Hartuv, Erez, and Ron Shamir. "A clustering algorithm based on graph connectivity." Information processing letters 76, no. 4 (2000): 175-181.

Xu, Rui, and Donald Wunsch. "Survey of clustering algorithms." Neural Networks, IEEE Transactions on 16, no. 3 (2005): 645-678.

E Hartuv, A O Schmitt, J Lange, S Meier-Ewert, H Lehrach, R Shamir. "An algorithm for clustering cDNA fingerprints." Genomics 66, no. 3 (2000): 249-256.

Jurisica, Igor, and Dennis Wigle. Knowledge Discovery in Proteomics. Vol. 8. CRC press, 2006.

Sharan, Roded, and Ron Shamir. "CLICK: a clustering algorithm with applications to gene expression analysis." In Proc Int Conf Intell Syst Mol Biol, vol. 8, no. 307, p. 16. 2000.

Category:Algorithms Category:Clustering algorithm