namespace tf {
/** @page kmeans k-means Clustering
We study a fundamental clustering problem in unsupervised learning, k-means clustering.
We will begin by discussing the problem formulation and then learn how to
write a parallel k-means algorithm.
@tableofcontents
@section KMeansProblemFormulation Problem Formulation
k-means clustering uses @em centroids,
k different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.
We describe the k-means algorithm in the following steps:
- Step 1: initialize k random centroids
- Step 2: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it
- Step 3: for every centroid, move the centroid to the average of the points assigned to that centroid
- Step 4: go to Step 2 until converged (no more changes in the last few iterations) or maximum iterations reached
The algorithm is illustrated as follows:
@image html images/kmeans_1.png
A sequential implementation of k-means is described as follows:
@code{.cpp}
// sequential implementation of k-means on a CPU
// N: number of points
// K: number of clusters
// M: number of iterations
// px/py: 2D point vector
void kmeans_seq(
int N, int K, int M, const std::vector& px, const std::vector& py
) {
std::vector c(K);
std::vector sx(K), sy(K), mx(K), my(K);
// initial centroids
std::copy_n(px.begin(), K, mx.begin());
std::copy_n(py.begin(), K, my.begin());
// k-means iteration
for(int m=0; m::max();
int best_k = 0;
for (int k = 0; k < K; ++k) {
const float d = L2(x, y, mx[k], my[k]);
if (d < best_d) {
best_d = d;
best_k = k;
}
}
sx[best_k] += x;
sy[best_k] += y;
c [best_k] += 1;
}
// update the centroid
for(int k=0; kassigning every point to the nearest centroid,
is highly parallelizable across individual points.
We can create a @em parallel-for task to run parallel iterations.
@code{.cpp}
std::vector best_ks(N); // nearest centroid of each point
unsigned P = 12; // 12 partitioned tasks
// update cluster
taskflow.for_each_index(0, N, 1, [&](int i){
float x = px[i];
float y = py[i];
float best_d = std::numeric_limits::max();
int best_k = 0;
for (int k = 0; k < K; ++k) {
const float d = L2(x, y, mx[k], my[k]);
if (d < best_d) {
best_d = d;
best_k = k;
}
}
best_ks[i] = best_k;
});
@endcode
The third step of moving every centroid to the average of points is also parallelizable
across individual centroids.
However, since k is typically not large, one task of doing this update is sufficient.
@code{.cpp}
taskflow.emplace([&](){
// sum of points
for(int i=0; i& px, const std::vector& py
) {
unsigned P = 12; // 12 partitions of the parallel-for graph
tf::Executor executor;
tf::Taskflow taskflow("K-Means");
std::vector c(K), best_ks(N);
std::vector sx(K), sy(K), mx(K), my(K);
// initial centroids
tf::Task init = taskflow.emplace([&](){
for(int i=0; i::max();
int best_k = 0;
for (int k = 0; k < K; ++k) {
const float d = L2(x, y, mx[k], my[k]);
if (d < best_d) {
best_d = d;
best_k = k;
}
}
best_ks[i] = best_k;
}).name("parallel-for");
tf::Task update_cluster = taskflow.emplace([&](){
for(int i=0; i
@dotfile images/kmeans_2.dot
The scheduler starts with @c init, moves on to @c clean_up, and then enters the
parallel-for task @c parallel-for that spawns a subflow of 12 workers to perform
parallel iterations.
When @c parallel-for completes, it updates the cluster centroids and checks if
they have converged through a condition task.
If not, the condition task informs the scheduler to go back to @c clean_up and then
@c parallel-for; otherwise, it returns a nominal index to stop the scheduler.
@section KMeansBenchmarking Benchmarking
Based on the discussion above, we compare the runtime of computing
various k-means problem sizes between a sequential
CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.
| N | K | M | CPU Sequential | CPU Parallel |
| :-: | :-: | :-: | :-: | :-: |
| 10 | 5 | 10 | 0.14 ms | 77 ms |
| 100 | 10 | 100 | 0.56 ms | 86 ms |
| 1000 | 10 | 1000 | 10 ms | 98 ms |
| 10000 | 10 | 10000 | 1006 ms | 713 ms |
| 100000 | 10 | 100000 | 102483 ms | 49966 ms |
When the number of points is larger than 10K,
the parallel CPU implementation starts to outperform the sequential CPU
implementation.
*/
}