Implementation of expectation-maximization algorithm in R

An implemenation of expectation-maximization (EM) in R

Suppose that a mixed distribution consists of two underlying normal distributions. The hidden variables are the mean and variance of these two underlying distributions. In this tutorial, I will provide an example to show how we can use the EM algorithm to estimate the true values of these hidden parameters.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  1. Let these be the true parameters of the two normal distributions:
red_mean = 3
red_std = 0.8

blue_mean = 7
blue_std = 2

red = rnorm(200,red_mean, red_std)
blue = rnorm(200,blue_mean, blue_std)

both_colours = c(red, blue) %>% sort() 
plot(red,rep(0,length(red)),col="red",pch=16,ylab = "distribution 1")

plot(blue,rep(0,length(red)),col="blue",pch=16,ylab = "distribution 2")

plot(both_colours,rep(0,length(both_colours)),col="purple",pch=16,ylab = "mixed distribution")

  1. We need three functions to calculate the weighted probability, estimate the mean and the standard deviation.
# estimate the weight for each data point
weight_of_colour<-function(colour_likelihood, total_likelihood)
  {
    return(colour_likelihood/total_likelihood)
}

#estimate the mean
estimate_mean<-function(data, weight){
    return(sum(data*weight)/sum(weight))
}

#estimate the standard deviation
estimate_std<-function(data, weight, mean){
    variance = sum(weight*(data-mean)^2)/sum(weight)
    return(sqrt(variance))
}  

# make the intermediate plot for each iteration
plot_guesses<-function(red_mean_guess, blue_mean_guess, red_std_guess, blue_std_guess, alpha=1){
    points(x, dnorm(x,red_mean_guess, red_std_guess), col=alpha(alpha = alpha,colour = 'red'))
    points(x, dnorm(x,blue_mean_guess, blue_std_guess), col=alpha(alpha = alpha,colour = 'blue'))
    r_height = dnorm(red_mean_guess,red_mean_guess, red_std_guess)
    b_height = dnorm(blue_mean_guess,blue_mean_guess, blue_std_guess)
    segments(x0=red_mean_guess, x1=red_mean_guess, y=0, y1=r_height, col=alpha(alpha = alpha,colour = 'red'))
    segments(x0=blue_mean_guess, x1=blue_mean_guess, y0=0, y1=b_height, col=alpha(alpha = alpha,colour = 'blue'));

}
  1. The first step is to initiate starting parameters:
red_mean_guess = 1.5
blue_mean_guess = 8

ini_m=c(red_mean_guess,blue_mean_guess)

red_std_guess = 2
blue_std_guess = 1.7

ini_std=c(red_std_guess,blue_std_guess)
  1. We then iterate between the steps of expectation and maximization using a loop function.
N_ITER = 200

alphas = seq(0.1, 1, length.out=N_ITER) 
lo = floor(min(both_colours)) - 1
hi = ceiling(max(both_colours)) + 1
x = seq(lo, hi, length.out=500)
plot(both_colours, rep(0,length(both_colours)), pch=16, col='purple',ylim = c(0,0.6))



for(i in 0:N_ITER){
    
# Expectation step

    likelihood_of_red = dnorm(both_colours,red_mean_guess, red_std_guess)
    likelihood_of_blue = dnorm(both_colours,blue_mean_guess, blue_std_guess)

    red_weight = weight_of_colour(likelihood_of_red, likelihood_of_red+likelihood_of_blue)
    blue_weight = weight_of_colour(likelihood_of_blue, likelihood_of_red+likelihood_of_blue)
# maximization step
    
    red_std_guess = estimate_std(both_colours, red_weight, red_mean_guess)
    blue_std_guess = estimate_std(both_colours, blue_weight, blue_mean_guess)

    red_mean_guess = estimate_mean(both_colours, red_weight)
    blue_mean_guess = estimate_mean(both_colours, blue_weight)

    plot_guesses(red_mean_guess, blue_mean_guess, red_std_guess, blue_std_guess, alpha=alphas[i])

}

5. After 200 loops, here is a comparison of the true and estimated parameters:

        true_r_m=mean(red)
        true_b_m=mean(blue)
        
        est_r_m=red_mean_guess
        est_b_m=blue_mean_guess
        
        true_r_s=sqrt(var(red))
        true_b_s=sqrt(var(blue))
        
        est_r_s=red_std_guess
        est_b_s=blue_std_guess
        
        df<-data.frame(true_mean=c(true_r_m,true_b_m),inital_mean=ini_m, estimated_mean=c(est_r_m,est_b_m),true_std=c(true_r_s,true_b_s),initial_std=ini_std, estimated_std=c(est_r_s,est_b_s))
        print(df)
##   true_mean inital_mean estimated_mean true_std initial_std estimated_std
## 1     2.915         1.5          2.944   0.7557         2.0         0.760
## 2     6.737         8.0          6.726   1.9773         1.7         2.013

The results clearly showed a significant improvement over the initial parameters.

Associate Professor

My research explores how different factors are involved in regulating gene transcription that leads to the development of acute myeloid leukemia and other human cancers.

comments powered by Disqus

Related