Saturday, November 8, 2008

Clustering - The EM algorithm

In the Gaussian mixture model-based clustering, each cluster is represented by a Gaussian distribution. The entire dataset is modeled by a mixture (a linear combination) of these distributions.

The EM (Expectation Maximization) algorithm is used in practice to find the “optimal” parameters of the distributions that maximize the likelihood function.

The number of clusters is a parameter of the algorithm. But we can also detect the “optimal” number of clusters by evaluating several values, i.e. testing 1 cluster, 2 clusters, etc. and choosing the best one (which maximizes the likelihood or another criterion such as AIC or BIC).

Keywords: clustering, expectation maximization algorithm, gaussian mixture model
Components: EM-Clustering, K-Means, EM-Selection, scatterplot
Tutorial: en_Tanagra_EM_Clustering.pdf
Dataset: two_gaussians.xls
Reference:
Wikipédia (en) -- Expectation-maximization algorithm