In this post we will implement the K-Means Clustering algorithm in python. You will not need to know any python libraries such as numpy or pandas as this will be done in vanilla python. The key take aways here are:
- The algorithm by itself is naive, and not guaranteed to converge on the best fit centroid locations
- We need an optimizer of sorts to find an acceptable centroid fit
Setup
To generate your own fake data sets and try out a basic implementation of the k-means algorithm download and run the following file kmeans.py
The goal of the K-means clustering algorithm is to fit a given set of data points into K specified groups or clusters. For example, say we have the following xy-coordinate data:
0) Initialize
The first step in the algorithm is to assign each centroid a random (x,y) coordinate from the data as it’s initial value
def get_n_random_samples(data, n):
return random.sample(data, n)
k_centroids = get_n_random_samples(data, k)
Steps one and two are followed until convergence:
1) For each data point
- Determine which centroid location is closest, centroid_closest
- Add data point to the collection of data points belonging to centroid_closest
# now we iterate across each data point x_i
# and determine which centroid it's closest to
# by determining which centroid as the least distance
for i in range(0, len(data)):
x = data[i]
# intialize distance to first centroid k_0
best_centroid_index = 0
min_distance = distance(x, k_centroids[best_centroid_index])
for j in range(0, len(k_centroids)):
k = k_centroids[j]
if distance(x, k) < min_distance:
min_distance = distance(x, k)
best_centroid_index = j
buckets[best_centroid_index].append(x)
2) For each centroid point
- Update the centroids (x,y) location by taking the average coordinate values of its collection of data points
# now we take the average x and y values
# of the data points assigned to each centroid k
# which is stored in buckets[k]
for i in range(0, len(buckets)):
k_centroids[i] = average_coords(buckets[i])
3) Repeat
Steps one and two are now executed in a while-loop that runs untill the centroid locations are no longer changing.
def kmeans(k, data, xmin = 0, xmax = 100, ymin = 0, ymax = 100):
# we start off by making a guess
# for the starting point of each centroid, k_i
# by assigning it the (x,y) coordinate of a random data point
k_centroids = get_n_random_samples(data, k)
buckets = []
# we need a variable to keep track of whether or not we should keep iterating
keep_going = True
while keep_going:
buckets = []
for i in range(0, k):
# re-initialize empty buckets for each centroid
buckets.append([])
# now we iterate across each data point x_i
# and determine which centroid it's closest to
# by determining which centroid as the least distance
for i in range(0, len(data)):
x = data[i]
# intialize to distance to first centroid k_0
min_distance = distance(x, k_centroids[best_centroid_index])
best_centroid_index = 0
for j in range(0, len(k_centroids)):
k = k_centroids[j]
if distance(x, k) < min_distance:
min_distance = distance(x, k)
best_centroid_index = j
buckets[best_centroid_index].append(x)
old_centroids = k_centroids
# now we take the average x and y values
# of the data points assigned to each centroid k
# which is stored in buckets[k]
for i in range(0, len(buckets)):
k_centroids[i] = average_coords(buckets[i])
# now we compare each of the previous centroid coordinates (old_centroids)
# and check to see if even one of them has changed its position
# since we want to stop only when none of the centroids have moved their postions
# ^ this is called convergence
did_anything_change = False
for i in range(0, len(old_centroids)):
old = old_centroids[i]
new = k_centroids[i]
if old != new:
did_anything_change = True
# if any of the centroids have moved locations
# then we need to keep iterating
keep_going = did_anything_change
return k_centroids
This procedure is guaranteed convergence, where additional iterations of the algorithm does not result in location updates for any of the centroids. However, the resulting fitted centroid locations may not be best suited to to the actual existing clusters of data. This is because the procedure we just defined is highly reliant on the initial positions of centroids defined in step 0. This procedure is demonstrated in the figure below.
Results
I encourage you to test out varitions with the parameters K-Clusters, N points per cluster and Cluster spread. You will quickly see that the locations of the centroids that the algorithm converges on are quite often far from ideal. Some centroids end up with no points in their cluster group while other centroids end up in the middle between two or more clusters that have been assigned to it.
The occurence of miss-fits observed in the above plot makes it quite clear how dependent this algorithm is on the randomized starting points for each of the centroids. One way to solve this would be to run hundreds, if not thousands of trials and assess the results of each trial by some measurement to evaluate for the best fits. Depending on the size of the data, and the number of presumed k-clusters this procedure can prove to be costly and impractical in a lot of cases.
It is clear that we need some way to optimize this procedure that does not involve running many instances of the k-convergence algorithm and hoping for the best. To do so requires that we have an optimizer at our disposal and as it so happens there are already a number of “good” optimizer algorithms that have been worked out for us. We need only apply them to the problem at hand.
In the next post I will introduce the Adam optimizer and apply it to a much simpler problem before tackling this one. Thanks for reading, and until then!
NOTE: If you’re interested in checking out a numpy based implementation check out this post from a colleague covering the same topics.