Statistical Computing

UW Madison Statistics 771 Spring 2011

Prof Michael Newton

 


 

Overview

This course introduces graduate students to some basic elements of statistical computing, and computational statistics, more broadly defined. There are no graduate level prerequisites, although students are expected to be familiar with essential features of probability and statistical inference as usually covered in an intermediate undergraduate course. 

Statistical computing concerns the relation between computers and statistical analysis.  Early problems were how to accurately, stably, and efficiently calculate simple statistics, such as sample moments, in the presence of inexact computer arithmetic.  Numerical analysis issues continue to arise in other important computational problems: function evaluation, optimization, integration, and the solution of equations.  One goal of this course is to study algorithms that effectively solve these problems. 

There is some truth to the statement that statistical theory regarding what should be calculated in a data analysis is intimately related to computational concerns about what can be calculated.  A second goal of this course is to present a number of modern, compute-intensive methods for data analysis.

A third goal of this course is to have students practice statistical computing by coding algorithms and studying their properties, and to have students become familiar with the R environment for statistical computing.

 

Format

Lectures TR 2:30-3:45

Fortnightly homework (25 pts)

Quizes (50 pts)

Project (build R package) (25 pts)

Office hours: as requested

 

Textbooks are not required, but optional texts are:

Computational Statistics, G Givens and J Hoeting, Wiley, 2005

A first course in statistical programming with R, W.J. Braun and D.J. Murdoch,

The R Book, M. J. Crawley

Elements of Statistical Computing, R.A. Thisted, Chapman and Hall, 1988

Other materials linked as needed.

 

Syllabus

Numerical methods (basics; floating point issues, representation error, other errors, log-sum-exp trick; power-series and continued fraction approximations to probabilities; pnorm)

Numerical linear algebra (basics; decompositions (Cholesky, QR, SVD); solving linear systems (qr, chol, svd, backsolve) )

Nonlinear optimization (selected topics): from maximum likelihood, Nelder-Mead, coordinate descent (nlminb, optim)

Root finding (selected topics): Newton-Raphson, iteratively reweighted least squares for generalized linear models (glm)

Expectation/Maximization (convex analysis, optimization transfer, mixture models, clustering/nonparametrics)

Probabilistic graphical models (selected topics): hidden Markov models, Baum-Welch algorithm, Viterbi algorithm, graphical representations, message-passing algorithms

Constrained optimization (selected topics): pool-adjacent-violators algorithm

Monte Carlo methods: static/dynamic, non-uniform sampling, multivariate sampling, Markov chain sampling, Gibbs sampling, Metropolis Hastings, adaptive sampling

Building R packages (selected topics)

Other topics (dictated by student interest)

 

Homework

 

Homework 1. Due in class Feb 3/2011

 

Homework 2 (Due Feb 10).   R data archive for homework (bir.RData)

 

Homework 3 (Due March 10)

 

Homework 4 (Due May 3)

 

Project information, part 2: By Thursday April 7 please send me further information on your R-package project.  Specifically, I want to see some functionality demonstrated in R language code.  It may be the illustration of a function that will be in your package, or some source code that does elements of interest. It should include an R implementation of some data set that will be used by the package. 

 

Lectures

 

Class

Date

Topic

Handouts/code

Citations

1

1/18/11

Introduction: overview of course; algorithms/methods/deployment; fixed precision arithmetic; log-sum-exp trick; power series and continued fractions to compute cumulative standard normal probabilities; R

try1.R code for normal approximation norm.pdf  output of R code

Algorithm 715, Cody, 1993

Cody, 1969

Shenton, 1954

log-sum-exp

continued fractions

2

1/20/11

Numerical linear algebra: QR and Cholesky decompositions, backsolve, forward solve and how these solve multiple linear regression with least squares

 

QR

Cholesky

3

1/25/11

Singular value decomposition; problem to find a direction with projections close to data equivalent to find direction of maximal variance in projections; the principal components problem; multidimensional scaling

 

SVD, see Hastie, Tibshirini and Friedman, 2001, pages 60-62

4

1/27/11

Regression with lots of covariates; ridge regression as shrinkage method; from SVD perspective; lasso and coordinate descent by the shooting algorithm.

 

Tibshirani, 1996

5

2/1/11

Optimization by direct search; motivated by likelihood optimization (take logs!);  simple grid search; golden section search; Nelder Mead algorithm (optim)

my old notes on NM

Nelder and Mead,  1965

Singer and Nelder, 2009

6

2/3/11

Solving for roots: motivated by likelihood equations; bisection; scaled iteration; Newton’s method; secant method; multivariate Newton’s method; quasi-newton (mentioned); GLMs and iteratively reweighted least squares.

 

Green, 1984

7

2/8/11

IRLS continued…this time in the mutant frequency problem with Bernoulli response and complementary log-log link.  Start EM.  First present the concept of an optimization transfer algorithm.  Then define the E and M steps…start the ZIP example.

 

 

8

2/10/11

Convex analysis: convex set in R^d; hyperplane, separating hyperplane thm supporting hyperplane thm; epigraph; convex function; Jensen’s inequality;  back to EM algorithm; proof that it ascends, i.e. by showing the two optimization transfer properties. Complete the ZIP example.

 

 

9

2/15/11

Review for quiz

EM continued. Mixture modeling; self consistency equation.

 

 

10

2/17/11

Quiz 1

 

.

11

2/22/11

EM continued. Blood type example (multinomial). Show how EM steps combine to create a mapping theta_{m+1} = Psi( theta_m ). Normal mixture models; issues with identifiability, bounded likelihood; shared vs separate component parameters; the use for clustering.  Discuss mclust.  Take 1-d case, common variance, and show that when this variance gets small we have hard thresholding and the K – means algorithm.

 

 

12

2/24/11

EM continued. Computing standard errors.  First an aside on information in regular models. (defn as variance of score, value as neg curvature, CR bound; MLE variance.) Missing information principle and Louis’ method. [Recall i_{actual}(theta,x) = -S’(theta,x), where S is score from actual likelihood; i_{complete}(theta,x) = E{ -S’(theta,X,Z) | X=x }, and i_{missing}(theta,x) = E{ -d/dtheta log f(Z|X=x,theta) | X=x }.]    Supplemental EM and the fact that the EM mapping Psi has a derivative (we did the 1-d case) that equals i_{missing}(theta_hat) / i_{complete}(theta_hat).  And finally the parametric bootstrap method.  Then we started a final EM example involving a hidden binary Markov chain Z_1,…,Z_n involving two transition parameters p and q, and conditionally independent observables X_1,…,X_n having observation components f_0(x) and f_1(x).  We found the objects required in the E/M steps, and indicated possible recursive computation.

 

 

13

3/1/11

Baum Welch and a pushing sums.

First a story about a simple HMM. A factory with N machines making widgets of two types…Z(t) = number of machines making type A widgets at time t; exponential/ind lifetimes of machines + iid replacement from a pool leads to particular continuous time process…indicate how this gives discrete time chain when sampled. Analogy to coin tossing in room and then sampling with replacement periodically. Discuss as model for stem cell problem, and inference on number of machines and machine lifetime of interest….a classic HMM.

Next an aside on numbers. Given a, b, c, ab + ac can be computed using 2 products and one sum; or we can write S=a(b+c), computed in one sum and one product, taking advantage of the particular structure.  Similarly if given a_i, and b_{i,j}, with indices in 1,2,…,m, we can naively compute S=sum_i sum_j a_i b_{ij} using m^2 products and m^2 – 1 sums. Writing it S=sum_i a_i B_i where B_i = sum_j b_{i,j} is slightly different, with m(m-1) sums to get all B_i’s, then m products and another m-1 sums to get S.  Both are O(m^2) [discuss meaning of order notation].  But now consider S= sum_i sum_j sum_k a_i b_{ij} c_{jk}.  Follow the same logic to get naïve computation in O(m^3) and `pushing sums’ in O(m^2).  In general, if z_i takes on m values, and we want to sum S=sum_all h_1(z_1) prod_{j=2}^n h_j( z_{j-1},z_j ) it takes O(m^n) naively [as both m,n->inf] and it takes O(n m^2) by pushing sums. 

Back to Baum-Welch.  Last time we introduced HMM and showed how EM lead to the need to compute things like P[ Z_{i-1}=a, Z_i = b | X=x]. Here we develop the forward and backward recursions on alpha_i(a) = P[ X_1^{i} , Z_i=a ] (joint, forward prob) and beta_i(b) = P[X_{i+1}^n|Z_i=b] (conditional, backward prob).  We show that marginal P[X=x] is a sum requiring pushing sums, and we show how it’s computed from forward, backward probabilities.  And we introduce smoothed probabilities P[Z_i=a|X=x], noting they are not components of a joint factorization.

 

Rabiner

14

3/3/11

Optimization continued: Viterbi algorithm

First an aside on numbers: given a_{i,j} with each index in 1, 2, …, m, to find max_{i,j} a_{i,j} can be done in two steps, first maxing over j for each i, and then maxing over i.  Next recall HMM context; hidden Z=(Z_1,…,Z_n) (each variable takes a finite, say m, set of possible values, modeled as a Markov chain, and data X_1,…,X_n modeled as c.i. given Z=z.  Last time we discussed Baum Welch for estimating the parameters.  Suppose now we want the modal state zhat.  Then we zhat is argmax of h(z) = h_1(z_1) \prod_{j=2}^n h_j( z_{j-1}, z_j ) where each h_j involves an obvservation component and a transition component.  Since all h’s >0, zhat is also argmax g(z) where g=log(h).  Write g as a sum.  (Suppose we have estimated the params via BW, so h is only a function of the z’s).   Define T_k(z_1,…,z_k) as g_1(z_1) + sum_{j=2}^k g_j(z_{j-1}, z_j), and note two facts: T_n is g, of interest, and any T_k can be written as T_{k-1} + g_k.  Next define a sequence of vectors delta_k(y), y in 1:m and k=2,…,n where delta_k(y) = max_{z_1,…,z_{k-1}} T_k( z_1,…,z_{k-1}, y ); i.e. lock the last variable and max over the others.  Show that delta_k(y) = max_{z_{k-1}} [ g_k(z_{k-1},y) + detla_{k-1}( z_{k-1} ), and thus the delta_k vectors can be computed recursively in linear time.  Keep track of psi_k(y) = argmax in defn of delta_k (over the z_{k-1}’s ), and then observe that zhat_k = argmax delta_n(y), and zhat_{i} = psi_{i+1}( zhat_{i+1} ) for subsequent i’s going down to i=1.Discuss wide usage. 

Consider some basic statistical decision theory to provide some additional context.  Loss, risk, bayes risk, bayes rule, posterior expected loss. Do so using state of nature Z (possibly high dimensional, and discrete).   Show that posterior mode provides a bayes rule under 0-1 loss, and thus Viterbi provides such a rule.  Consider the Chip Lawrence work in RNA secondary structure that indicates problems with zhat_vit; stems from fact that unknown z is not of primary interest, rather something simpler, like f(z) is; and f(zhat_vit) may be a bad estimator of f(z) even if zhat is a good estimator of z. --- one issue is plug-in vs bayes, but we won’t discuss that today; another issue is the choice of loss.  Show that with Hamming Loss L = sum_i 1[ z_i .neq. zhat_i ] we get a Bayes rule zhat_centered = argmax_i P[ Z_i = zhat_i | data ], with components computable by BW smoothing.

 

Carvalho and Lawrence, PNAS, 2008, 105:3209-3214 (on examples in comp bio where having the mode isn’t the best thing…)

15

3/8/11

Part 1 on probabilistic graphical models; factoring a joint distribution on a DAG; the conditional independence graph as the basis for a Markov random field and a Gibbs distribution. Baum Welch computations as message passing on a tree.

 

 

16

3/10/11

R tricks; HW 3 due; review for quiz

 

 

 

 

Spring break 3/12-3/20

 

 

17

3/22/11

Quiz 2

 

 

18

3/24/11

More on probabilistic graphical models.  Recall that if we have variables as nodes on a tree, some of which (wlog, the leaves) we condition on and some of which we want to some out, we can arrange marginal posterior computations, such as P[ X_root=x| data] by arranging the tree with directional edges pointing toward X_root.  Then we can pass messages, initially from the leaves in to each of their unique downstream nodes, and subsequently from any internal node that is ready (in that it has received messages from all of its upstream nodes), until we send messages into the root.  Each message can be expressed as a vector over values possibly taken by the receiving node; the entries in the vector are conditional probabilities of the upstream observations given the value at the receiving node.  I show the HMM example as one case.  Now the message delta_{i->j} (x_j) is a conditional probability (could also be specified as a joint prob…but we do conditionals), and it can be expressed as a sum over variables at the sending node of a node-specific function psi_i times the product of all other incoming messages into i.    We said briefly on 3/8 that to compute all marginal posteriors at once requires reframing the tree with different rootings; but there’s a possible efficiency, noting that in all such reframings only two different messages ever pass over a given edge.  That’s in fact how the BW algorithm works (sequentially getting the forward and backward probabilities by such message passing).   Now the message-passing is more general than we previously discussed from BW because it can handle any tree.  However as constructed it does not handle graphs with cycles.  (we wouldn’t know where to send the messages!)   In some cases we can borrow the tree-based algorithm on a tree of cliques from the original graph.  E.g. with a 7 node tree having 5 cliques.  Discuss the cliques and the cluster graph.  Indicate a that one can sometimes find a cluster tree; i.e. a tree that’s a subgraph of the cluster graph satisfying the running intersection property.  [if C_i and C_j both contain a node u, then u is contained in every clique on the unique path intervening path].  Then I show how message passing works on the junction tree; we need to consider that this is a bit more general than the previous message passing, because connected cliques share variables (the so-called `sepsets’).  A message from C_i to a neighboring C_j is a function of x_{C_j} and is a sum over all variables in C_i minus sepset_{i,j} of the initially assigned clique function psi_i(x_{c_i}) times a product of incoming messages over cliques upstream from the current clique.  Finally note two things about actual graph computations.  One is a simple example from statistical genetics, where, as we often have, the modeling produces a DAG.  But this DAG isn’t good for computation; we need to `moralize’ to turn it into a valid conditional independence graph.  Lastly is the idea that a given undirected graph may not admit a junction tree.  It is known that a junction tree exists iff the graph is triangulated (i.e., every cycle of 4 or more nodes has an edge connecting non-adjacent nodes). 

 

Some helpful notes and links: Michael Jordan; Martin Wainwright

19

3/29/11

Monte Carlo Method:  the use of pseudorandom numbers to approximate quantities of interest, Q, say.  Write Q = f{ E_P[ g(X) ] } for some functions f and g and for some random object X distributed according to P, where all f, g, P are known to the analyst.  In standard Monte Carlo, we get the computer to realize X_1,…,X_n ~ iid P and then we get Qhat_n = f{ 1/n sum_{i} g(X_i) } to approximate Q.  LLN + Continuous mapping theorem gives consistency; CLT and Slutsky given approximate normality, so Qhat_n ~ approx Normal[ Q, sigma2/n ] where sigma2 depends on P, g, and f.  The big benefit of MC is that it is so general; the big limitation is that the standard error of Qhat_n drops like 1/sqrt{n}, so, e.g. to get 10 times the precision we need 100 times the number of sample. We note that variations of standard Monte Carlo include `Markov chain MC’, in which we give up having iid copies, and quasi MC, in which samples are more regular than random.  Both affect the precision.  MCMC broadens the scope of MC by allowing many more P’s and hence Q’s that are not reachable by iid sampling; but the expense, the primary expense, is that sigma^2 is inflated, though root n remains.  Quasi MC is far more difficult to apply, and it has been developed in a much narrower range of problems, but where it is feasible it can speed up the root n component.  On examples: Eg. 1; Buffon’s needle; Eg 2 Metropolis et al and the hard core process (and the ideal gas law); Eg 3 bootstrap standard errors; Eg 4 permutation p-values (expand on this a bit in the two-sample case and conditioning on order statistics); Eg 5 and marginal posterior distributions in Bayesian inference.  On how to proceed; first we need to study pseudo-random numbers; a deterministic sequence with the same relevant statistical properties as a truly random sequence (note that randomness is not constructable, so we can only ask if some particular thing looks like it or not).  I discuss congruential generators and shift register generators, and the Mersenne Twister with period 2^19937-1 and uniformity in high dimensions.  And I discuss the need for uniformity in k-dimensional unit squares and chi-squared tests for this. Finally I discuss elements of R, including .Random.seed and set.seed, and the benefit of saving the seed to have a proper record of a MC computation.

 

B.D. Ripley, Stochastic Simulation, Wiley, 1987.

20

3/31/11

Monte Carlo continued.  Non-uniform random variate generation.  Assume our computer can realize U_1, U_2, … ~iid Unif(0,1).  How do we get other distributions?  E.g. Bernoulli, X_i = 1[ U_i <= p ]; any discrete by F^(-1), indeed any 1-d distribution by F^(-1), though not particularly efficient.  Binomial by summing Bernoulli’s; Exponential by F^(-1); Poisson by counting exponentials; what about normal?  Show the standard Box-Muller method.  Then on to Rejection sampling (in steps).  1. The idea of sampling until a conditioning event occurs gives a draw from the conditional distribution (e.g. Normal given in interval).  2. The idea of having a uniform distribution over a finite region in Euclidean space, A, say. 3. The idea of using 1 to get uniform over a sub-region C subset A. 4. If A ={x,y: x in R, 0<=y<= h(x) } for a positive integrable function h then X marginally has density proportional to h. 5. The idea that to get a random point in a set like A we could draw from density proportional to h and then draw  a uniform height in (0, h(x)).  6. Finally the rejection idea.  Given a target density f of interest and a density g that’s easy to simulate and also such that there exists k >0 and k < infty such that f(x) <= k g(x).  Then sample (X_n,Y_n) pairs with X_n ~g and Y_n|X_n=x ~ Unif(0,k g(x) ) until Y_n <= f(X_n).  At that first time N, report X_N. Claim X_N ~ f.   Show example of normal density by bounding by Laplace. [Note N has geometric distribution] Do the modified Box-Muller method where you sample (U_n,V_n)  Uniformly in a box (side length 2; centered at 0) until the point is within a unit disc. Then (U_N,V_N) makes a uniformly distributed angle theta with the x –axis, and T^2 = U_N^2 + V_N^2 is uniformly distributed and independent of the theta.  Further U_N/T and V_N/T make the cosine and sine of theta, and T, which is independent, yields a Gamma(1,1/2) variable via R^2=2*log(1/T^2), which acts as the random magnitude of a normal pair (X,Y).  Thus you get X=R cos(theta) , Y=R sin(theta) from U_N and V_N without using any trigonometry.

 

Luc Devroye. Non-Uniform Random Variate Generation. Springer, 1986.

21

4/5/11

Rejection sampling continued. Example with Gamma(a,1) and Weibull envelope; X = { log(1/U) }^{1/a}  has density g(x) = a x^{a-1} exp(-x^a).  I work out r(x) = f(x)/g(x) for f the Gamma(a,1) (with a < 1 ) and find a finite x* that maximizes r(x), so c = r(x*); then I review the algorithm.  Next I discuss the general ratio of uniforms method. Start with a positive integrable h(x) (proportional to some density of interest) and consider the planer region C={ (u,v): 0 <= u <= sqrt{h(v/u)} }.  I work out area of C using transformation of variables (u,v) -> (u,x=v/u) and get area = int h(x) / 2.  Then I consider sampling (U,V) ~ Unif(C), and X=V/U.  I show, by the same change of variables formula, that X ~ h/int(h).  As an example I consider the normal again, and h(x) = exp(-x^2/2).  Show that C is contained in a box in the plane, since u is in (0,1) and v^2 <= -2 u^2 log(u^2).  Review the algorithm…rejection to get (U,V) in C, then return (V/U).  Recall our algorithm for making normals without doing trigonometry…we note that the angle made by (U,V) is uniform in (0,2pi), and the ratio U/V is tangent of the angle, so it has a Cauchy distribution (no need for trig to get).  This approach is very fast for t distributions.  Next I go on to discuss composition, where in the distribution of interest f(x) is represented as either a discrete or continuous mixture over a latent Z.  (indicate that’s how rnorm actually works, with a finite decomposition), but also indicate how Z ~ chisquare_k and X|Z=z ~ Normal[0, k/z ] makes a t_k, and how Z~Gamma(k,k) and X|Z=z ~ Poisson(theta * z ) makes a negative binomial, and, after discussing that Beta’s come from independent Gammas discuss the Beta-binomial as an overdispersed binomial.  As an aside discuss the Polya-urn method for sampling the beta binomial, and indicate its connection to Bayes theory….next time on to multivariate simulation…

 

 

 

22

4/7/11

Multivariate simulation.  0. The natural point that given a joint distribution f(x,y) for X,Y, one can simulate by getting X from its marginal and then Y from its conditional given X=x.  1.  Can always get a multivariate draw in principle using the sequential factorization of a joint distribution (e.g. through a DAG), but a. this is usually not available in simple form for inference (it may be for forward simulation) and b. it may not be the most efficient way, even if it’s available.  On to some examples of special distributions; Dirichlet in terms of independent Gamma’s.  Multinomial in terms of n iid simple multinomials (Andy’s `multinoulli’); Multivariate normals in terms of Cholesky decomposition.  (Wishart’s for homework).  But still most joint distributions are not amenable to direct sampling.  Another approach is MCMC.  First a few observations; note that if X~N(0,1) and Y|X=x ~ Normal[ rho X, 1-rho^2 ], then Y is a randomly modified version of the input X, but it has the same distribution (X=^d Y).  Now on to finite Markov chains…e.g. binary, with transition probs alpha 0->1 and beta 1->0, and start, say at X1=1.  Consider how easy it is to simulate a chain stepwise.  Show simulation of many realizations, considering the n-step distributions P[X_n=1|X_1=1], show that they converge to alpha/(alpha+beta) Bernoulli in this case [`stationary distribution’], and were we to have started there we would stay there always.  Discuss aperiodicity and irreducibility and that this is sufficient to get convergence to a unique stationary distribution.  Also discuss generalizations of the LLN and CLT for Markov chains, also using the example (bsim*.R).  Xbar still approximates mu, a property of the stationary distribution, but its rate of convergence may be slower.  MCMC works by representing the quantity  of interest as a property of a distribution pi, and then by rigging up a Markov chain that has stationary distribution pi, in order to get approximants from the chain….more next time.

 

 

23

4/12/11

Metropolis Hastings and Gibbs samplers. Last time we reviewed Markov chains and the goal of MCMC to establish the quantity of interest as a feature of a target distribution, wherein the target distribution is the stationary distribution of a Markov chain that you cook up .   Start here with the idea of a reversed chain Y_1, Y_2,…,Y_n produced from an initial chain X_1,X_2,..,X_n, where X_n is stationary, with stationary distribution pi(x), and where Y_i = X_{n-i+1}.  Work out the transition probabilities for the reversed chain as ptilde(a,b) = p(b,a) pi(b)/pi(a) for all states a,b.  Next define the chain to be reversible if ptilde(a,b) = p(a,b), the original transition probabilities, for all states.  This defines detailed balance pi(a) p(a,b) = pi(b) p(b,a).  Further, if you’re given pi and a set of transition probs, and if you can establish detailed balance, then you know the MC having these transition probabilities will have pi as its stationary distribution [sum out a on both sides of detailed balance,e.g.].  This is the program for MH algorithms.  Set up the Metropolis algorithm (finite state version).  Target distribution pi(a) on finite set.  You’ll make a chain X_1, X_2,…by proposing Y from q(x,.), a proposal distribution over the space for each x, such that q(x,y)=q(y,x).  Then you compute r = pi(y)/pi(x) for Y=y, and make a uniformly distributed U.  X_{n+1}=Y if U <=r, else X_{n+1}=X_n.  Discuss a bit in terms of magnitude of pi and normalizing constant.  Discuss acceptance rate.  Show that this chain satisfies detailed balance wrt pi.  Give some continuous examples…simulating  a normal using a Uniform[x-eps,x+eps] proposal.  (Sketch typical output for big eps and small eps.)  Recall LLN/CLT for M chains to motivate estimates.  Give the original Metropolis example.  Discuss Hastings (1970) and the relaxation of symmetrix q..(homework).  E.g. with Exponential proposal for normal [note it’s not irreducible…need multiple proposals; as long as each one maintains detailed balance].  Show a proposal that for bivariate X=(U,V) which involves Y=(U,Z), where Z comes from p(v|u).  Compute MH ratio equal 1, so never reject proposal…`Gibbs sampler’.   Indicate how it works in practice by collecting a bunch of proposal mechanisms into one algorithm.

 

 

24

4/14/11

Gibbs samplers, continued: Start with an aside about completing the square in a multivariate normal Bayes computation. I.e., say Y=X theta + epsilon in matrix form Y is an n vector, X is nxp, and theta is px1, with normal errors, and say Y |X ~ Normal[ X theta, I_n/lambda_y ].  If also theta ~ Normal[ 0, I_p/lambda_theta] is a prior, then the posterior density has a certain form…write out as proportional to exp{ -1/2[ stuff] } where stuff involves theta^T theta and other pieces.  Go through the completing the square computation, by starting this  posterior and then by also simplifying the MVN density form; thus we get theta|Y,X ~ Normal[ posterior mean, posterior precision ].  Next go on to MCMC. First Albert and Chib’s probit regression model Y_i = 1 [ Z_i <= 0 ], where Z  ~ Normal[ - Xtheta, I_n ].  Show the graphical representation of this model (DAG and undirected form) and show how to do Gibbs sampling by p(theta|z,y) and p(z|theta,y) involving the normal form from the beginning of class [using Cholesky decomposition] and using truncated normal sampling for z given everything else.  Reconsider the form of the Gibbs sampler output and how it’s used/diagnosed.  Next start the Besag 95 field trials example and model…

 

 

25

4/19/11 (Passover)

Substitute lecturer: Prof Bates. R 1

 

 

26

4/21/11

Substitute lecturer: Prof Bates. R 2

 

 

27

4/26/11

 

 

 

28

4/28/11

 

 

 

29

5/3/11

 

 

 

30

5/5/11