# Implementation of expectation-maximization algorithm in R

# An implemenation of expectation-maximization (EM) in R

Suppose that a mixed distribution consists of two underlying normal distributions. The hidden variables are the mean and variance of these two underlying distributions. In this tutorial, I will provide an example to show how we can use the EM algorithm to estimate the true values of these hidden parameters.

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

- Let these be the true parameters of the two normal distributions:

```
red_mean = 3
red_std = 0.8
blue_mean = 7
blue_std = 2
red = rnorm(200,red_mean, red_std)
blue = rnorm(200,blue_mean, blue_std)
both_colours = c(red, blue) %>% sort()
plot(red,rep(0,length(red)),col="red",pch=16,ylab = "distribution 1")
```

```
plot(blue,rep(0,length(red)),col="blue",pch=16,ylab = "distribution 2")
```

```
plot(both_colours,rep(0,length(both_colours)),col="purple",pch=16,ylab = "mixed distribution")
```

- We need three functions to calculate the weighted probability, estimate the mean and the standard deviation.

```
# estimate the weight for each data point
weight_of_colour<-function(colour_likelihood, total_likelihood)
{
return(colour_likelihood/total_likelihood)
}
#estimate the mean
estimate_mean<-function(data, weight){
return(sum(data*weight)/sum(weight))
}
#estimate the standard deviation
estimate_std<-function(data, weight, mean){
variance = sum(weight*(data-mean)^2)/sum(weight)
return(sqrt(variance))
}
# make the intermediate plot for each iteration
plot_guesses<-function(red_mean_guess, blue_mean_guess, red_std_guess, blue_std_guess, alpha=1){
points(x, dnorm(x,red_mean_guess, red_std_guess), col=alpha(alpha = alpha,colour = 'red'))
points(x, dnorm(x,blue_mean_guess, blue_std_guess), col=alpha(alpha = alpha,colour = 'blue'))
r_height = dnorm(red_mean_guess,red_mean_guess, red_std_guess)
b_height = dnorm(blue_mean_guess,blue_mean_guess, blue_std_guess)
segments(x0=red_mean_guess, x1=red_mean_guess, y=0, y1=r_height, col=alpha(alpha = alpha,colour = 'red'))
segments(x0=blue_mean_guess, x1=blue_mean_guess, y0=0, y1=b_height, col=alpha(alpha = alpha,colour = 'blue'));
}
```

- The first step is to initiate starting parameters:

```
red_mean_guess = 1.5
blue_mean_guess = 8
ini_m=c(red_mean_guess,blue_mean_guess)
red_std_guess = 2
blue_std_guess = 1.7
ini_std=c(red_std_guess,blue_std_guess)
```

- We then iterate between the steps of expectation and maximization using a loop function.

```
N_ITER = 200
alphas = seq(0.1, 1, length.out=N_ITER)
lo = floor(min(both_colours)) - 1
hi = ceiling(max(both_colours)) + 1
x = seq(lo, hi, length.out=500)
plot(both_colours, rep(0,length(both_colours)), pch=16, col='purple',ylim = c(0,0.6))
for(i in 0:N_ITER){
# Expectation step
likelihood_of_red = dnorm(both_colours,red_mean_guess, red_std_guess)
likelihood_of_blue = dnorm(both_colours,blue_mean_guess, blue_std_guess)
red_weight = weight_of_colour(likelihood_of_red, likelihood_of_red+likelihood_of_blue)
blue_weight = weight_of_colour(likelihood_of_blue, likelihood_of_red+likelihood_of_blue)
# maximization step
red_std_guess = estimate_std(both_colours, red_weight, red_mean_guess)
blue_std_guess = estimate_std(both_colours, blue_weight, blue_mean_guess)
red_mean_guess = estimate_mean(both_colours, red_weight)
blue_mean_guess = estimate_mean(both_colours, blue_weight)
plot_guesses(red_mean_guess, blue_mean_guess, red_std_guess, blue_std_guess, alpha=alphas[i])
}
```

5. After 200 loops, here is a comparison of the true and estimated parameters:

```
true_r_m=mean(red)
true_b_m=mean(blue)
est_r_m=red_mean_guess
est_b_m=blue_mean_guess
true_r_s=sqrt(var(red))
true_b_s=sqrt(var(blue))
est_r_s=red_std_guess
est_b_s=blue_std_guess
df<-data.frame(true_mean=c(true_r_m,true_b_m),inital_mean=ini_m, estimated_mean=c(est_r_m,est_b_m),true_std=c(true_r_s,true_b_s),initial_std=ini_std, estimated_std=c(est_r_s,est_b_s))
print(df)
```

```
## true_mean inital_mean estimated_mean true_std initial_std estimated_std
## 1 2.915 1.5 2.944 0.7557 2.0 0.760
## 2 6.737 8.0 6.726 1.9773 1.7 2.013
```

The results clearly showed a significant improvement over the initial parameters.