To implement the k-means algorithm, you can follow these steps (see: https://www.youtube.com/watch?v=2lZZ_FzlIJY)
Initialize Centroids:
- Randomly select k data points from the dataset as the initial centroids.
- Alternatively, you can use more advanced initialization techniques like k-means++ to choose better initial centroids.
Assign Data Points to Clusters:
- For each data point in the dataset, calculate its distance (e.g., Euclidean distance or cosine similarity) to each centroid.
- Assign each data point to the cluster with the nearest centroid.
Update Centroids:
- For each cluster, calculate the mean (centroid) of all the data points assigned to that cluster.
- Update the centroids to the newly calculated means.
Repeat Steps 2 and 3:
- Iterate between assigning data points to clusters and updating centroids until convergence.
- Convergence can be defined as a maximum number of iterations or when the centroids no longer change significantly.
Here's a Python implementation of the k-means algorithm using Euclidean distance:
import numpy as np
def kmeans(data, k, max_iterations=100):
# Initialize centroids randomly
centroids = data[np.random.choice(data.shape[0], k, replace=False)]
for _ in range(max_iterations):
# Assign data points to clusters
distances = np.sqrt(((data - centroids[:, np.newaxis])**2).sum(axis=2))
clusters = np.argmin(distances, axis=0)
# Update centroids
new_centroids = np.array([data[clusters == i].mean(axis=0) for i in range(k)])
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, clusters
To use the kmeans
function:
# Assuming 'data' is your input dataset and 'k' is the desired number of clusters
centroids, clusters = kmeans(data, k)
If you want to use cosine similarity instead of Euclidean distance for text clustering, you can modify the distance calculation step:
def cosine_similarity(a, b):
# linalg: Linear algebra
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def kmeans_cosine(data, k, max_iterations=100):
# Initialize centroids randomly
centroids = data[np.random.choice(data.shape[0], k, replace=False)]
for _ in range(max_iterations):
# Assign data points to clusters
similarities = np.array([[cosine_similarity(data[i], centroid) for centroid in centroids] for i in range(data.shape[0])])
clusters = np.argmax(similarities, axis=1)
# Update centroids
new_centroids = np.array([data[clusters == i].mean(axis=0) for i in range(k)])
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, clusters
Note that this implementation assumes that the input data is normalized before applying k-means with cosine similarity.
Keep in mind that this is a basic implementation of k-means, and there are various optimizations and variations of the algorithm that can be applied depending on the specific requirements of your problem.