Homework 5. Statistics 771, Spring 09 Posted online Wednesday April 1/09 Due in class Monday April 13/09 1. Data in the file `test.txt' holds records from a genomics study on a certain kind of DNA binding event. There are n=2808 records y_1,...,y_n from events ordered along one chromosome. To find interesting regions, consider an HMM involving a latent Markov chain Z_1,...,Z_n where each Z_i takes one of three values {1,2,3}, say, corresponding to different (in order) levels mu1 < mu2 < mu3 means for a normal observation component (with common variance sigma^2 for the components). a. Develop the Baum Welch algorithm to estimate the transition matrix { p_jk } and the other parameters (mu1,mu2,mu3, sigma ) by maximum likelihood. If, instead of estimating sigma, you fix it at some prespecified value, how are the estimates of the other parameters affected? b. Implement the Viterbi algorithm to recover the latent states. Plot the findings. How is the reconstruction affected by the value of sigma? 2. Consider two random variables (X,Y). Their correlation is rho = E{ [X-E(X)][Y-E(Y)] }/sqrt{ var(X) var(Y) }. And we well know that the independence of X and Y implies rho=0, but rho=0 does not imply the independence. (The only exception is for jointly normally distributed data.) Recently a new concept of correlation, the so-called `distance correlation', has been introduced by G Szekely, which carries the if and only if zero status relative to independence regardless of the joint distribution. Analagous to Pearson's correlation coefficient for estimating rho, there is a sample based distance correlation statistic. It is computed as follows from a sample of pairs { (X_i, Y_i): i=1,...,n }: First consider all pairwise absolute differences a_{ij} = | X_i - X_j | b_{ij} = | Y_i - Y_j | separately in the two variables. Then form a_{i dot} = (1/n) sum_{j=1}^n a_{ij}, a_{dot,j} = (1/n) sum_{i=1}^n a_{i,j} a_{dot,dot} = (1/n)^2 sum_{i=1}^n sum_{j=1}^n a_{i,j} and do so similarly for the b's, and then compute A_{ij} = a_{ij} - a_{i,dot} - a_{dot,j} + a_{dot,dot} B_{ij} = b_{ij} - b_{i,dot} - b_{dot,j} + b_{dot,dot} The empirical distance covariance between X and Y is edcov(A,B) = [ (1/n)^2 sum_{i=1}^n sum_{j=1}^n A_{ij} B_{ij} ]^{1/2} and the empirical distance correlation is edcorr(A,B) = edcov(A,B) ]/sqrt[ edcov(A,A) * edcov(B,B) ] as long as the denominator is positive, else it equals 0. Write an R function to compute distance correlation. Test in on several benchmark data sets. reference: Szekely et al (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics, 35, 2769-2794.