| Data analysis | TD 2 -- Clustering

# Exercise 1 : Partition and matrix Consider the Iris data set. Write a R code wich produces the partition matrix. Compute the gravity centers of the quantitative variables in the three classes using a matrix formula. ```{r} library(tidyverse) data(iris) ``` ```{r} data(iris) X<-iris[,1:4] library(nnet) C<-class.ind(iris$Species) # Matrice partition t((t(X)%*%C))/diag(t(C)%*%C) ``` ```{r} image(C) ``` ```{r} kmeans.res <- iris %>% select(c(-Species,-Sepal.Length,-Sepal.Width)) %>% kmeans(3,nstart = 10) cluster<-as.factor(kmeans.res$cluster) centers <- as.data.frame(kmeans.res$centers) ``` ```{r} library(ggplot2) ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) + geom_point() + geom_point(data=centers, color='coral',size=4,pch=21)+ geom_point(data=centers, color='coral',size=50,alpha=0.2) ``` # Exercise 2 : The bell number 1. Show that the number of partition of n objects verifies $$ B_{n+1} = \sum_{k=0}^n C^n_kB_k \, , \\ B_0 = 1 $$ 2. Compute manually the bell number for 1,2,3,4,5,6 objects. 1, 2, 5, 15, 52, 203 3. Write a R program which computes the Bell number for n objects. ```{r} bell_number <- function(n){ if(n<0){stop("n must be an non-negative integer")} if(n<=1){return(1)} res <- 0 for(k in 0:(n-1)){ res <- res + choose(n =n-1 ,k=k) * bell_number(k) } return(res) } ``` ```{r} bell_number(17) ``` # Exercise 3 : Between-Within Variance relation Consider $n$ points from $\mathbb{R}^p$ with a partition into $K$ classes of size $n_1,...,n_k$. Let us note $\hat\mu_k$ the gravity center of class $k$ and $\hat\mu$ the gravity center of the entire cloud of points. Show that $$ \sum_k\sum_{i\in k} \| x_i- \hat\mu_k \|^2 + \sum_k n_k\| \hat\mu_k - \hat\mu \|^2 = \sum_i \|x_i - \hat\mu \|^2 $$ # Exercise 4 Clustering of the crabs (library MASS) 1. Load the crabs dataset form library MASS. ```{r} library(MASS) data(crabs) ``` 2. Plot the dateset using pairs() with a color for each specy and a different symbol per sex. ```{r} pairs(crabs[,4:8], col=as.numeric(crabs$sp),pch=as.numeric(crabs$sex)+16) ``` ```{r} pairs(crabs[,4:8], col=c("Yellow","Red")[crabs$sp],pch=as.numeric(crabs$sex)+16) ``` 3. Cluster the dataset reduced to its quantitative variables into four cluster using the kmeans. ```{r} Kmeans.res <- crabs[,4:8] %>% kmeans(4,nstart = 1) Kmeans.res ``` ```{r} TrueClasses<-matrix(1:4,2,2) colnames(TrueClasses)<-levels(crabs$sex) rownames(TrueClasses)<-levels(crabs$sp) TrueClasses=diag(TrueClasses[crabs$sex,crabs$sp]) ``` ```{r} table(Kmeans.res$cluster,TrueClasses) ``` 4. Run the algorithm with 1000 different initialization and keep track of the within sum of squares. ```{r} WSS = rep(0,1000) for(i in 1:1000){ WSS[i] = (crabs[,4:8] %>% kmeans(4,nstart = 1))$tot.withinss } plot(WSS) ``` ```{r} summary(WSS) ``` ```{r} hist(WSS,breaks=50,freq=T) ``` 5. Comment the result. 6. Divide all quantitative variable by the most correlated variable to produce a new dataset. ```{r} crabs2 <- crabs[,-c(1,2,3)] crabs2<-(crabs2/crabs[,6])[,-3] colnames(crabs2) <- c("FL / CL", "RW / CL", "CW / CL", "BD / CL") head(crabs2) ``` ```{r} pairs(crabs2) ``` 7. Compare the partitions obtained using the kmeans with the ’natural’ partition. Comment. ```{r} Kmeans.res2 <- crabs2 %>% kmeans(4,nstart = 1) Kmeans.res2 ``` ```{r} table("Prediction" = Kmeans.res2$cluster,"True label"=TrueClasses) ``` ```{r} pairs(crabs2,col=TrueClasses) pairs(crabs2,col=Kmeans.res2$cluster) ``` ```{r} res<-kmeans(crabs2,4) # repr?sentons les donn?es dans le plan 2, 4 avec densit? et centres z<-kde2d(crabs2[,2],crabs2[,4]) contour(z) # placons les points points(crabs2[,c(2,4)],col=c("blue","orange")[crabs$sp],pch=c(20,21)[crabs$sex]) # placons les centres des classes points(res$center[,c(2,4)],cex=3,pch=21,bg="red") ``` 8. Try to cluster the data in 1 to 20 groups. Plot the within sum of squares in function of the number of clusters. Comment the figure. ```{r} WSSkcluster<-rep(0,10) for (k in 1:10) { WSSmax <- Inf #On fait 10 initialisation différentes for (i in 1:10) { res<-kmeans(crabs2,k) if (res$tot.withinss