How would you do it if the labels aren’t known?
When the true labels are not known, evaluating the performance of a k-means clustering algorithm becomes more challenging because you don't have ground truth information to compare the clusters to. In this case, you can use a variety of internal evaluation metrics that assess the quality and consistency of the clustering results without relying on external labels. Here are some common internal evaluation metrics for k-means clustering:
Inertia (Within-Cluster Sum of Squares):
- Inertia measures the total distance between data points and their cluster centroids. Lower inertia values indicate tighter and more compact clusters.
- Inertia can be accessed using the
inertia_
attribute of scikit-learn's k-means model.
Silhouette Score:
- The silhouette score assesses the quality of clusters by measuring the average cohesion (similarity to other points in the same cluster) and separation (similarity to points in other clusters) for each data point.
- It ranges from -1 to 1, with higher values indicating better clustering. You can calculate it using scikit-learn's
silhouette_score
function.
Davies-Bouldin Index:
- The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster while considering the within-cluster dispersion. Lower values indicate better clustering.
- You can calculate the Davies-Bouldin index using scikit-learn's
davies_bouldin_score
function.
Calinski-Harabasz Index (Variance Ratio Criterion):
- This index evaluates the ratio of between-cluster variance to within-cluster variance. Higher values suggest better clustering.
- It can be computed using scikit-learn's
calinski_harabasz_score
function.
Gap Statistics:
- Gap statistics compare the performance of your clustering to that of a random clustering. It quantifies the gap between the observed within-cluster dispersion and the expected dispersion under randomness.
- You can implement gap statistics using custom code or libraries like scikit-learn's
cluster.KMeans
combined with the gap-statistic
library.
Dunn Index:
- The Dunn index is the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Higher values indicate better clustering.
- You can compute the Dunn index by finding the distances between clusters and within clusters and taking the minimum inter-cluster distance divided by the maximum intra-cluster distance.
Hopkins Statistic:
- The Hopkins statistic measures the clustering tendency of the data. It quantifies the likelihood that the data has meaningful clusters.
- The Hopkins statistic can be calculated using custom code or libraries that provide this metric.
When evaluating k-means clustering without ground truth labels, it's often a good practice to use a combination of these internal metrics to assess the clustering quality from different angles. However, keep in mind that these metrics can have limitations, and the choice of the most appropriate metric may depend on the characteristics of your data and the specific goals of your analysis. Visualizations, such as cluster plots and dimensionality reduction techniques like PCA or t-SNE, can also provide insights into the quality of the clustering results.