Data Clustering¶
This post is based on (Gan).
Data clustering is a process of assigning a set of records into subsets, called clusters, such that records in the same cluster are similar and records in different clusters are quite distinct.
A typical clustering process involves the following five steps:
- pattern representation;
- dissimilarity measure definition;
- clustering;
- data abstraction;
- assesment of output
In this interactive session, we will be reviewing the k-means algorithm.
The k-means Algorithm¶
The k-means algorithm is the most popular and the simplest partitional clustering algorithm. It has many variations. This exercise will review the standard algorithm and several implementations (possibly for different variations).
Description of the Algorithm¶
Let $X = \left\{ \mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{n - 1} \right\}$ be a numeric dataset containing $n$ records and $k$ be an integer in $\{1, 2, \cdots, n\}$. The k-means algorithm tries to divide the dataset into $k$ clusters $C_0$, $C_1$, $\cdots$, and $C_{k - 1}$ by minimizing the following objective function: $$E = \sum_{i = 0}^{k - 1}\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i),$$ where $D(\cdot, \cdot)$ is a distance measure and $\boldsymbol{\mu}_i$ is the mean of cluster $C_i$, i.e., $$\boldsymbol{\mu}_i = \frac{1}{\lvert C_i \rvert} \sum_{\mathbf{x} \in C_i}\mathbf{x}$$
Let $\gamma_i$ be the cluster membership of record $\mathbf{x}_i$ for $i = 0$, $1$, $\cdots$, $n - 1$. That is, $\gamma_i = j$ if $\mathbf{x}_i$ belongs to cluster $C_j$. Then the objective function can be rewritten as: $$E = \sum_{i = 0}^{n - 1}D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$$
To minimize the objective function, the k-means algorithm employs an iterative process. At the beginning, the k-means algorithm selects $k$ random records from the dataset $X$ as initial cluster centers.
Suppose $\boldsymbol{\mu}_0^{(0)}$, $\boldsymbol{\mu}_1^{(0)}$, $\cdots$, and $\boldsymbol{\mu}_{k - 1}^{(0)}$ are the initial cluster centers. Based on these cluster centers, the k-means algorithm updates the cluster memberships $\gamma_0^{(0)}$, $\gamma_1^{(0)}$, $\cdots$, $\gamma_{n - 1}^{(0)}$ as follows: \begin{equation} \label{membership-update} \gamma_i^{(0)} = \underset{0 \leq j \leq k - 1}{\operatorname{argmin}} D(\mathbf{x}_i, \boldsymbol{\mu}_j^{(0)}) \end{equation} where $\operatorname{argmin}$ is the argument that minimizes the distance. That is, $\gamma_i^{(0)}$ is set to the index of the cluster to which $\mathbf{x}_i$ has the smallest distance.
Based on the cluster memberships $\gamma_0^{(0)}$, $\gamma_1^{(0)}$, $\cdots$, $\gamma_{n - 1}^{(0)}$, the k-means algorithm updates the cluster centers as follows: \begin{equation} \label{center-update} \boldsymbol{\boldsymbol{\mu}}_j^{(1)} = \frac{1}{\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|} \sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i, \qquad\text{$j = 0$, $1$, $\cdots$, $k - 1$} \end{equation}
Then the k-means algorithm repeats updating the cluster memberships based on equations \ref{membership-update} and \ref{center-update}.
Reviewing the math above with some examples¶
The above can be understood completely if we understand the little pieces, and try to make sense of ideas behind the math. So, don't be scared by the symbols. Let's review some notation first.
$\lvert C_{i} \rvert$ in this context, means the cardinality of the set $C_{i}$, that is, the size of the set. In mathematics, the cardinality of a set is a measure of the "number of elements of the set".
S_1 = set([1, 2, 3, 4, 5]) # or { 1, 2, 3, 4, 5}
len(S_1)
$\mathbf{x} \in C_1$ means, the vector (notice the bold font face) $\mathbf{x}$ belongs to (or is in) the set $C_1$. For example,
$$\begin{bmatrix} 1 \\ 2 \end{bmatrix} \in C_1$$means, the vector with two components $\left[1, 2\right]^T$ belongs to the cluster $C_1$
Centroid definition¶
# Numpy ndarrays are often used in python to represent vectors.
import numpy as np
x = np.array([1, 2])
# Notice we are informally using a 2-dimensional numpy array
# to represent all the elements of the Set C_1
C_1 = np.array([[1, 2],
[2, 5],
[4, 7]])
x in C_1
means: the sum (component-wise) of all the vectors that belong to the first cluster $C_1$.
# Use help(np.sum) or np.sum? (then press Tab key) to get more information about np.sum
np.sum(C_1, 0) # or C_1.sum(0)
The centroid (or cluster center) of the first cluster $C_1$ is computed with with a scalar multiplication. For our example:
1/len(C_1)
1 /len(C_1) * C_1.sum(0)
In other words, the centroid formula is simply computing the arithmetic mean (component-wise) of the all the vectors of a given cluster:
- First component: $$\frac{1 + 2 + 4}{3} = 2.33333333$$
- Second component: $$\frac{2 + 5 + 7}{3} = 4.66666667$$
Objective function¶
First form of the objective function¶
Let's review: $$E = \sum_{i = 0}^{k - 1}\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i)$$
The inner summation, $$\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i)$$ says something like: compute the sum of all the distances for the i-th cluster. Notice the requirement in the summation index: $\mathbf{x} \in C_i$ (only consider those vectors that belong to cluster $C_i$).
The outer summation, $$E = \sum_{i = 0}^{k - 1} \text{sum of distances for the $i$-th cluster}$$ simply says, that you have to accumulate the inner computation for all the clusters.
Second form of the objective function and cluster membership notation¶
The second form of the objective function, uses more compact notation:
$$E = \sum_{i = 0}^{n - 1}D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$$It says: compute the sum of the distances between a vector and it's corresponding cluster center, for all the vectors in the dataset. Notice that $D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$ implies that we have a way to compute the cluster membership function: $\gamma_i$.
Cluster membership update¶
Let's review \ref{membership-update}:
$$\gamma_i^{(0)} = \underset{0 \leq j \leq k - 1}{\operatorname{argmin}} D(\mathbf{x}_i, \boldsymbol{\mu}_j^{(0)})$$Text above says:
$\gamma_{i}^0$ is set to the index of the cluster to which $\mathbf{x}_{i}$ has the smallest distance.
Notice that we are assuming an iterative process. The superscript introduced here $^{(0)}$ refers to the number of iteration: the cluster membership update $0$.
The notation $$\underset{0 \leq j \leq k - 1}{\operatorname{argmin}} \text{distance to $j$-th cluster}$$ means that we have to choose the cluster index $j$ (that could be any of the integer values from $0$ to $k - 1$) that minimizes the distance to current record $\mathbf{x}_i$.
Cluster center update¶
Now, let's review cluster center update formula \ref{center-update} notation:
$$\boldsymbol{\boldsymbol{\mu}}_j^{(1)} = \frac{1}{\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|} \sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i, \qquad\text{$j = 0$, $1$, $\cdots$, $k - 1$}$$We are again using the membership function $\gamma_i$ to simplify notation. The outer index $j$, says that we have to update the cluster center computation for each one of the $k$ clusters.
For a specific value of $j$ (we are re-computing the center for $j$-th cluster), we simply have to use the cluster center formula, but we have to consider only those records that belong to this cluster:
$$\sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i$$
Notice the requirement in the summation index: sum every record, as long as $\gamma_{i} = j$ (as long as record belongs to $j$-th cluster).
$\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|$ is simply the number of elements of $j$-th cluster. It says: the cardinality of the set of all indexes $i$, such that $i$-th record belongs to cluster $j$.
%matplotlib inline
import pandas as pd
input_file = "data/600points.csv"
# Input file has labeled records. 'label' is one of 1, 2 or 3.
labeled_df = pd.read_csv(input_file,
header=None,
names=['id', 'x', 'y', 'label'])
labeled_df.dtypes
# To visualize "named" colors see:
# https://matplotlib.org/1.4.3/examples/color/named_colors.html
label_to_color = {1: 'red', 2: 'deepskyblue', 3: 'green'}
colors = labeled_df.loc[:, 'label'].map(label_to_color)
labeled_df.plot.scatter('x', 'y', grid=True, c=colors);
# Unlabeled data (prior to "clustering").
df = labeled_df.loc[:,['x', 'y']]
df.plot.scatter('x', 'y', grid=True);
To practice your Python¶
- Use the kmeans implementation from scikit-learn to run the algorithm for k = 3 clusters and a max of 100 iterations. You can install that using pip (
$ pip install scikit-learn
) or anaconda, ($ conda install scikit-learn
). Use python's docopt or
argparse
(available in the standard library) to parse the following command line options:Allowed options: --help produce help message --datafile arg the data file --k arg (=3) number of clusters --maxiter arg (=100) maximum number of iterations
Implement the algorithm by yourself and compare with the results obtained with scikit-learn.
- Use numba to try to improve the performance of your implementation.
- Use matplotlib (or another library of your preference) to visualize data and your results. Fine tune your plot so that the assesment of the result is easier to understand.
To practice with .NET (C# or F#)¶
- Use the k-means implementation from version 0.2 of ML.NET as a reference implementation for comparison (correctness of results, performance, etc).
- Use .NET's docopt port to parse command line options for your program (see analogous section for Python).
- Implement your own version of k-means using your preferred .NET language (C#, F#, VB.NET).
- Visualize your results with XPlot, FSharp.Charting, Live-Charts or any other good quality library of your preference.
To practice with GPUs and Python/R/C¶
- Take a look at kmcuda. See what the possibilities are for your favorite programming language.
- Visualize your results with the best library for your language of choice.
To practice with GPUs or Multicore CPUs and Haskell¶
- Take a look at Accelerate and the different supported backends (e.g. Multicore CPUs and GPUs).
- Study and explain the k-means sample
- Compare the implementations and performance.
- Visualize using
matplotlib
bindings for Haskell.
To practice with Scala and Spark¶
Try to use the existing Scala implementations:
- With Spark's MLLib
- With Spire
- Study and explain.
- Hard: study if it is possible to improve performance using sbt-javacpp bindings for CUDA and previous implementation code/algorithms.
- Visualize using Vegas or any other good quality library of your preference.
Bibliography
Gan, G., Data Clustering in C++: an object-oriented approach, Chapman and Hall/CRC.