What is agglomerative clustering?

quangngoc

Agglomerative clustering is a "bottom-up" approach to hierarchical clustering. Each observation starts in its cluster, and cluster pairs are merged as they move up the hierarchy.

Agglomerative clustering is a hierarchical clustering algorithm that follows a bottom-up approach to group similar data points into clusters. It starts by considering each data point as a separate cluster and iteratively merges the closest clusters until a desired number of clusters is reached or all the data points belong to a single cluster.

Here's a step-by-step explanation of the agglomerative clustering algorithm:

Initialization:
- Start by considering each data point as a separate cluster.
- The number of clusters initially equals the number of data points.
Similarity Computation:
- Calculate the pairwise distances or similarities between all the clusters.
- Common distance metrics include Euclidean distance, Manhattan distance, or cosine similarity, depending on the nature of the data.
Merging Clusters:
- Identify the two closest clusters based on the computed distances or similarities.
- Merge these two clusters into a single cluster.
- Update the number of clusters by decrementing it by one.
Updating Distances:
- Recompute the distances or similarities between the newly formed cluster and all the remaining clusters.
- The distance between clusters can be determined using different linkage criteria, such as:
  - Single Linkage: The distance between two clusters is the minimum distance between any two points in the clusters.
  - Complete Linkage: The distance between two clusters is the maximum distance between any two points in the clusters.
  - Average Linkage: The distance between two clusters is the average distance between all pairs of points in the clusters.
  - Ward's Method: The distance between two clusters is based on the increase in the sum of squared distances within clusters after merging.
Repeat:
- Repeat steps 3 and 4 until a desired number of clusters is reached or all the data points belong to a single cluster.
Dendrogram:
- The merging process can be visualized using a dendrogram, which is a tree-like diagram that shows the hierarchical structure of the clusters.
- The dendrogram illustrates how the clusters are merged at each step and the distances at which the merges occur.

Agglomerative clustering has several advantages:

It does not require specifying the number of clusters in advance, although it can be stopped at a desired number of clusters.
It can capture hierarchical relationships between clusters, which can be informative for understanding the structure of the data.
It is relatively simple to implement and understand.

However, agglomerative clustering also has some limitations:

It can be computationally expensive, especially for large datasets, as it requires calculating pairwise distances between all data points.
The choice of linkage criteria can affect the resulting clusters, and different linkage methods may produce different results.
Once a merge is made, it cannot be undone, which means that the algorithm cannot correct previous merging decisions.

Agglomerative clustering is commonly used in various domains, such as:

Document clustering based on text similarity
Customer segmentation based on purchasing behavior
Biological data analysis, such as gene expression clustering
Image segmentation and object recognition

It's important to preprocess the data, handle missing values, and scale the features appropriately before applying agglomerative clustering. The choice of distance metric and linkage criteria should be based on the characteristics of the data and the desired properties of the clusters.