# Exercise 1 : Partition and matrix
Consider the Iris data set. Write a R code wich produces the partition matrix. Compute the gravity centers of the quantitative variables in the three classes using a matrix formula.
```{r}
library(tidyverse)
data(iris)
```
```{r}
data(iris)
X<-iris[,1:4]
library(nnet)
C<-class.ind(iris$Species) # Matrice partition
t((t(X)%*%C))/diag(t(C)%*%C)
```
```{r}
image(C)
```
```{r}
kmeans.res <- iris %>% select(c(-Species,-Sepal.Length,-Sepal.Width)) %>% kmeans(3,nstart = 10)
cluster<-as.factor(kmeans.res$cluster)
centers <- as.data.frame(kmeans.res$centers)
```
```{r}
library(ggplot2)
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) + geom_point() +
geom_point(data=centers, color='coral',size=4,pch=21)+ geom_point(data=centers, color='coral',size=50,alpha=0.2)
```
# Exercise 2 : The bell number
1. Show that the number of partition of n objects verifies
$$
B_{n+1} = \sum_{k=0}^n C^n_kB_k \, , \\ B_0 = 1
$$
2. Compute manually the bell number for 1,2,3,4,5,6 objects.
1, 2, 5, 15, 52, 203
3. Write a R program which computes the Bell number for n objects.
```{r}
bell_number <- function(n){
if(n<0){stop("n must be an non-negative integer")}
if(n<=1){return(1)}
res <- 0
for(k in 0:(n-1)){
res <- res + choose(n =n-1 ,k=k) * bell_number(k)
}
return(res)
}
```
```{r}
bell_number(17)
```
# Exercise 3 : Between-Within Variance relation
Consider $n$ points from $\mathbb{R}^p$ with a partition into $K$ classes of size $n_1,...,n_k$. Let us note $\hat\mu_k$ the gravity center of class $k$ and $\hat\mu$ the gravity center of the entire cloud of points.
Show that
$$
\sum_k\sum_{i\in k} \| x_i- \hat\mu_k \|^2 + \sum_k n_k\| \hat\mu_k - \hat\mu \|^2 = \sum_i \|x_i - \hat\mu \|^2
$$
# Exercise 4 Clustering of the crabs (library MASS)
1. Load the crabs dataset form library MASS.
```{r}
library(MASS)
data(crabs)
```
2. Plot the dateset using pairs() with a color for each specy and a different symbol per sex.
```{r}
pairs(crabs[,4:8], col=as.numeric(crabs$sp),pch=as.numeric(crabs$sex)+16)
```
```{r}
pairs(crabs[,4:8], col=c("Yellow","Red")[crabs$sp],pch=as.numeric(crabs$sex)+16)
```
3. Cluster the dataset reduced to its quantitative variables into four cluster using the kmeans.
```{r}
Kmeans.res <- crabs[,4:8] %>% kmeans(4,nstart = 1)
Kmeans.res
```
```{r}
TrueClasses<-matrix(1:4,2,2)
colnames(TrueClasses)<-levels(crabs$sex)
rownames(TrueClasses)<-levels(crabs$sp)
TrueClasses=diag(TrueClasses[crabs$sex,crabs$sp])
```
```{r}
table(Kmeans.res$cluster,TrueClasses)
```
4. Run the algorithm with 1000 different initialization and keep track of the within sum of squares.
```{r}
WSS = rep(0,1000)
for(i in 1:1000){
WSS[i] = (crabs[,4:8] %>% kmeans(4,nstart = 1))$tot.withinss
}
plot(WSS)
```
```{r}
summary(WSS)
```
```{r}
hist(WSS,breaks=50,freq=T)
```
5. Comment the result.
6. Divide all quantitative variable by the most correlated variable to produce a new dataset.
```{r}
crabs2 <- crabs[,-c(1,2,3)]
crabs2<-(crabs2/crabs[,6])[,-3]
colnames(crabs2) <- c("FL / CL", "RW / CL", "CW / CL", "BD / CL")
head(crabs2)
```
```{r}
pairs(crabs2)
```
7. Compare the partitions obtained using the kmeans with the ’natural’ partition. Comment.
```{r}
Kmeans.res2 <- crabs2 %>% kmeans(4,nstart = 1)
Kmeans.res2
```
```{r}
table("Prediction" = Kmeans.res2$cluster,"True label"=TrueClasses)
```
```{r}
pairs(crabs2,col=TrueClasses)
pairs(crabs2,col=Kmeans.res2$cluster)
```
```{r}
res<-kmeans(crabs2,4)
# repr?sentons les donn?es dans le plan 2, 4 avec densit? et centres
z<-kde2d(crabs2[,2],crabs2[,4])
contour(z)
# placons les points
points(crabs2[,c(2,4)],col=c("blue","orange")[crabs$sp],pch=c(20,21)[crabs$sex])
# placons les centres des classes
points(res$center[,c(2,4)],cex=3,pch=21,bg="red")
```
8. Try to cluster the data in 1 to 20 groups. Plot the within sum of squares in function of the number of clusters. Comment the figure.
```{r}
WSSkcluster<-rep(0,10)
for (k in 1:10) {
WSSmax <- Inf
#On fait 10 initialisation différentes
for (i in 1:10) {
res<-kmeans(crabs2,k)
if (res$tot.withinss