Friday, July 14, 2017

Kohonen map with R

This tutorial complements the course material concerning the Kohonen map or Self-organizing map (June 2017). In a first time, we try to highlight two important aspects of the approach: its ability to summarize the available information in a two-dimensional space; Its combination with a cluster analysis method for associating the topological representation (and the reading that one can do) to the interpretation of the groups obtained from the clustering algorithm. We use the R software and the “Kohonen” package (Wehrens et Buydens, 2007). In a second time, we carry out a comparative study of the quality of the partitioning with the one obtained with the K-means algorithm. We use an external evaluation i.e. we compare the clustering results with pre-established classes. This procedure is often used in research to evaluate the performance of clustering methods. It takes on its meaning when it is applied to artificial data where the true class membership is known. We use the K-Means and Kohonen-Som components of Tanagra.

This tutorial is based on the Shane Lynn's article on the R-bloggers website (Lynn, 2014). I completed it by introducing the intermediate calculations to better understand the meaning of the charts, and by conducting the comparative study.

Keywords: som, self organizing map, kohonen network, data visualization, dimensionality reduction, cluster analysis, clustering, hierarchical agglomerative clustering, hac, two-step clustering, R software, kohonen package, k-means, external evaluation, heatmaps
Components: KOHONEN-SOM
Tutorial: Kohonen map with R
Program and dataset: waveform - som
References:
Tanagra tutorial, "Self-organizing map (slides)", June 2017.
Tanagra Tutorial, "Self-organizing map (with Tanagra)", July 2009.

Saturday, July 8, 2017

Cluster analysis with Python - HAC and K-Means

This tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties. We inspect and test two approaches using two Python procedures: the Hierarchical Agglomerative Clustering algorithm (SciPy package) ; and the K-Means algorithm (scikit-learn package).

One of the contributions of this tutorial is that we had conducted the same analysis with R previously, with the same steps. We can compare the commands used and the results provided by the available procedures. We observe that these tools have comparable behaviors and are substitutable in this context.

Keywords: python, scipy, scikit-learn, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, principal component analysis, PCA
Turorial: hac and k-means with Python 
Dataset and cource code: hac_kmeans_with_python.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.
Tanagra Tutorials, "Cluster analysis with R - HAC and K-Means", July 2017.

Thursday, July 6, 2017

Cluster analysis with R - HAC and K-Means

This tutorial describes a cluster analysis process. We deal with a set of cheeses (29 instances) characterized by their nutritional properties (9 variables). The aim is to determine groups of homogeneous cheeses in view of their properties.

We inspect and test two approaches using two procedures of the R software: the Hierarchical Agglomerative Clustering algorithm (hclust) ; and the K-Means algorithm (kmeans).

The data file "fromage.txt" comes from the teaching page of Marie Chavent from the University of Bordeaux. The excellent course materials and corrected exercises (commented R code) available on its website will complete this tutorial, which is intended firstly as a simple guide for the introduction of the R software in the context of the cluster analysis.

Keywords: R software, cluster analysis, clustering, hac, hierarchical agglomerative clustering, , k-means, fpc package, principal component analysis, PCA
Components: hclust, kmeans, kmeansruns
Turorial: hac and k-means with R 
Dataset and cource code: hac_kmeans_with_r.zip
References :
Marie Chavent, Teaching Page, University of Bordeaux.

Monday, July 3, 2017

k-medoids clustering (slides)

K-medoids is a partitioning-based clustering algorithm. It is related to the k-means but, instead of using the centroid as reference data point for the cluster, we use the medoid which is the individual nearest to all the other points within its cluster. One of the main consequence of this approach is that the resulting partition is less sensible to outliers.

This course material describes the algorithm. Then, we focus on the silhouette tool which can be used to determine the right number of clusters, a recurring open problem in cluster analysis.

Keywords: cluster analysis, clustering, unsupervised learning, paritionning method, relocation approach, medoid, PAM, partitioning aroung medoids, CLARA, clustering large applications, silhouette, silhouette plot
Slides: Cluster analysis - k-medoids algorithm
References:
Wikipedia, "k-medoids".