Statistical
Computing
UW Madison
Statistics 771 Spring 2011
Prof
Michael Newton
Overview
This course
introduces graduate students to some basic elements of statistical computing,
and computational
statistics, more
broadly defined. There are no graduate level prerequisites, although students
are expected to be familiar with essential features of probability and
statistical inference as usually covered in an intermediate undergraduate
course.
Statistical
computing concerns the relation between computers and statistical
analysis. Early problems were how
to accurately, stably, and efficiently calculate simple statistics, such as
sample moments, in the presence of inexact computer arithmetic. Numerical analysis issues continue to
arise in other important computational problems: function evaluation,
optimization, integration, and the solution of equations. One goal of this course is to study algorithms that effectively solve these
problems.
There is some
truth to the statement that statistical theory regarding what should be
calculated in a data analysis is intimately related to computational concerns
about what can be calculated. A
second goal of this course is to present a number of modern, compute-intensive methods for data analysis.
A third goal
of this course is to have students practice statistical computing by coding algorithms and studying
their properties, and to have students become familiar with the R environment
for statistical computing.
Format
Lectures TR
2:30-3:45
Fortnightly
homework (25 pts)
Quizes (50
pts)
Project
(build R package) (25 pts)
Office hours:
as requested
Textbooks
are not required, but optional texts are:
Computational
Statistics, G
Givens and J Hoeting, Wiley, 2005
A
first course in statistical programming with R, W.J. Braun and D.J. Murdoch,
The
R Book, M. J.
Crawley
Elements
of Statistical Computing, R.A. Thisted, Chapman and Hall, 1988
Other
materials linked as needed.
Syllabus
Numerical
methods (basics; floating point issues, representation error, other errors,
log-sum-exp trick; power-series and continued fraction approximations to
probabilities; pnorm)
Numerical
linear algebra (basics; decompositions (Cholesky, QR, SVD); solving linear
systems (qr,
chol, svd, backsolve)
)
Nonlinear
optimization (selected topics): from maximum likelihood, Nelder-Mead,
coordinate descent (nlminb,
optim)
Root finding
(selected topics): Newton-Raphson, iteratively reweighted least squares for
generalized linear models (glm)
Expectation/Maximization
(convex analysis, optimization transfer, mixture models,
clustering/nonparametrics)
Probabilistic
graphical models (selected topics): hidden Markov models, Baum-Welch algorithm,
Viterbi algorithm, graphical representations, message-passing algorithms
Constrained
optimization (selected topics): pool-adjacent-violators algorithm
Monte Carlo
methods: static/dynamic, non-uniform sampling, multivariate sampling, Markov
chain sampling, Gibbs sampling, Metropolis Hastings, adaptive sampling
Building R
packages (selected topics)
Other topics
(dictated by student interest)
Homework
Homework 1. Due in class Feb 3/2011
Homework 2 (Due Feb 10). R data archive for
homework (bir.RData)
Project
information, part 2: By Thursday April 7 please send me further information on
your R-package project.
Specifically, I want to see some functionality demonstrated in R
language code. It may be the
illustration of a function that will be in your package, or some source code
that does elements of interest. It should include an R implementation of some
data set that will be used by the package.
Lectures
|
Class |
Date |
Topic |
Handouts/code |
Citations |
|
1 |
1/18/11 |
Introduction:
overview of course; algorithms/methods/deployment; fixed precision
arithmetic; log-sum-exp trick; power series and continued fractions to
compute cumulative standard normal probabilities; R |
try1.R code for normal approximation norm.pdf
output of R code |
|
|
2 |
1/20/11 |
Numerical
linear algebra: QR and Cholesky decompositions, backsolve, forward solve and
how these solve multiple linear regression with least squares |
|
|
|
3 |
1/25/11 |
Singular
value decomposition; problem to find a direction with projections close to
data equivalent to find direction of maximal variance in projections; the
principal components problem; multidimensional scaling |
|
SVD, see
Hastie, Tibshirini and Friedman, 2001, pages 60-62 |
|
4 |
1/27/11 |
Regression
with lots of covariates; ridge regression as shrinkage method; from SVD
perspective; lasso and coordinate descent by the shooting algorithm. |
|
|
|
5 |
2/1/11 |
Optimization
by direct search; motivated by likelihood optimization (take logs!); simple grid search; golden section
search; Nelder Mead algorithm (optim) |
||
|
6 |
2/3/11 |
Solving for
roots: motivated by likelihood equations; bisection; scaled iteration;
Newton’s method; secant method; multivariate Newton’s method; quasi-newton
(mentioned); GLMs and iteratively reweighted least squares. |
|
|
|
7 |
2/8/11 |
IRLS
continued…this time in the mutant frequency problem with Bernoulli response
and complementary log-log link.
Start EM. First present
the concept of an optimization transfer algorithm. Then define the E and M steps…start the ZIP example. |
|
|
|
8 |
2/10/11 |
Convex
analysis: convex set in R^d; hyperplane, separating hyperplane thm supporting
hyperplane thm; epigraph; convex function; Jensen’s inequality; back to EM algorithm; proof that it
ascends, i.e. by showing the two optimization transfer properties. Complete
the ZIP example. |
|
|
|
9 |
2/15/11 |
Review for
quiz EM
continued. Mixture modeling; self consistency equation. |
|
|
|
10 |
2/17/11 |
Quiz 1 |
|
. |
|
11 |
2/22/11 |
EM
continued. Blood type example (multinomial). Show how EM steps combine to
create a mapping theta_{m+1} = Psi( theta_m ). Normal mixture models; issues
with identifiability, bounded likelihood; shared vs separate component
parameters; the use for clustering.
Discuss mclust. Take 1-d
case, common variance, and show that when this variance gets small we have
hard thresholding and the K – means algorithm. |
|
|
|
12 |
2/24/11 |
EM
continued. Computing standard errors.
First an aside on information in regular models. (defn as variance of
score, value as neg curvature, CR bound; MLE variance.) Missing information
principle and Louis’ method. [Recall i_{actual}(theta,x) = -S’(theta,x),
where S is score from actual likelihood; i_{complete}(theta,x) = E{
-S’(theta,X,Z) | X=x }, and i_{missing}(theta,x) = E{ -d/dtheta log
f(Z|X=x,theta) | X=x }.]
Supplemental EM and the fact that the EM mapping Psi has a derivative
(we did the 1-d case) that equals i_{missing}(theta_hat) /
i_{complete}(theta_hat). And
finally the parametric bootstrap method. Then we started a final EM example involving a hidden
binary Markov chain Z_1,…,Z_n involving two transition parameters p and q,
and conditionally independent observables X_1,…,X_n having observation
components f_0(x) and f_1(x). We
found the objects required in the E/M steps, and indicated possible recursive
computation. |
|
|
|
13 |
3/1/11 |
Baum Welch
and a pushing sums. First a
story about a simple HMM. A factory with N machines making widgets of two
types…Z(t) = number of machines making type A widgets at time t;
exponential/ind lifetimes of machines + iid replacement from a pool leads to
particular continuous time process…indicate how this gives discrete time chain
when sampled. Analogy to coin tossing in room and then sampling with
replacement periodically. Discuss as model for stem cell problem, and
inference on number of machines and machine lifetime of interest….a classic
HMM. Next an
aside on numbers. Given a, b, c, ab + ac can be computed using 2 products and
one sum; or we can write S=a(b+c), computed in one sum and one product,
taking advantage of the particular structure. Similarly if given a_i, and b_{i,j}, with indices in
1,2,…,m, we can naively compute S=sum_i sum_j a_i b_{ij} using m^2 products
and m^2 – 1 sums. Writing it S=sum_i a_i B_i where B_i = sum_j b_{i,j} is
slightly different, with m(m-1) sums to get all B_i’s, then m products and
another m-1 sums to get S. Both
are O(m^2) [discuss meaning of order notation]. But now consider S= sum_i sum_j sum_k a_i b_{ij}
c_{jk}. Follow the same logic to
get naïve computation in O(m^3) and `pushing sums’ in O(m^2). In general, if z_i takes on m values,
and we want to sum S=sum_all h_1(z_1) prod_{j=2}^n h_j( z_{j-1},z_j ) it
takes O(m^n) naively [as both m,n->inf] and it takes O(n m^2) by pushing
sums. Back to
Baum-Welch. Last time we
introduced HMM and showed how EM lead to the need to compute things like P[
Z_{i-1}=a, Z_i = b | X=x]. Here we develop the forward and backward
recursions on alpha_i(a) = P[ X_1^{i} , Z_i=a ] (joint, forward prob) and
beta_i(b) = P[X_{i+1}^n|Z_i=b] (conditional, backward prob). We show that marginal P[X=x] is a sum
requiring pushing sums, and we show how it’s computed from forward, backward
probabilities. And we introduce
smoothed probabilities P[Z_i=a|X=x], noting they are not components of a
joint factorization. |
|
Rabiner |
|
14 |
3/3/11 |
Optimization
continued: Viterbi algorithm First an
aside on numbers: given a_{i,j} with each index in 1, 2, …, m, to find
max_{i,j} a_{i,j} can be done in two steps, first maxing over j for each i,
and then maxing over i. Next
recall HMM context; hidden Z=(Z_1,…,Z_n) (each variable takes a finite, say
m, set of possible values, modeled as a Markov chain, and data X_1,…,X_n
modeled as c.i. given Z=z. Last
time we discussed Baum Welch for estimating the parameters. Suppose now we want the modal state
zhat. Then we zhat is argmax of
h(z) = h_1(z_1) \prod_{j=2}^n h_j( z_{j-1}, z_j ) where each h_j involves an
obvservation component and a transition component. Since all h’s >0, zhat is also argmax g(z) where
g=log(h). Write g as a sum. (Suppose we have estimated the params
via BW, so h is only a function of the z’s). Define T_k(z_1,…,z_k) as g_1(z_1) + sum_{j=2}^k
g_j(z_{j-1}, z_j), and note two facts: T_n is g, of interest, and any T_k can
be written as T_{k-1} + g_k.
Next define a sequence of vectors delta_k(y), y in 1:m and k=2,…,n
where delta_k(y) = max_{z_1,…,z_{k-1}} T_k( z_1,…,z_{k-1}, y ); i.e. lock the
last variable and max over the others.
Show that delta_k(y) = max_{z_{k-1}} [ g_k(z_{k-1},y) + detla_{k-1}(
z_{k-1} ), and thus the delta_k vectors can be computed recursively in linear
time. Keep track of psi_k(y) =
argmax in defn of delta_k (over the z_{k-1}’s ), and then observe that zhat_k
= argmax delta_n(y), and zhat_{i} = psi_{i+1}( zhat_{i+1} ) for subsequent
i’s going down to i=1.Discuss wide usage. Consider
some basic statistical decision theory to provide some additional context. Loss, risk, bayes risk, bayes rule,
posterior expected loss. Do so using state of nature Z (possibly high
dimensional, and discrete).
Show that posterior mode provides a bayes rule under 0-1 loss, and
thus Viterbi provides such a rule.
Consider the Chip Lawrence work in RNA secondary structure that
indicates problems with zhat_vit; stems from fact that unknown z is not of
primary interest, rather something simpler, like f(z) is; and f(zhat_vit) may
be a bad estimator of f(z) even if zhat is a good estimator of z. --- one
issue is plug-in vs bayes, but we won’t discuss that today; another issue is
the choice of loss. Show that
with Hamming Loss L = sum_i 1[ z_i .neq. zhat_i ] we get a Bayes rule zhat_centered
= argmax_i P[ Z_i = zhat_i | data ], with components computable by BW
smoothing. |
|
Carvalho
and Lawrence, PNAS, 2008, 105:3209-3214 (on examples in comp bio where having
the mode isn’t the best thing…) |
|
15 |
3/8/11 |
Part 1 on
probabilistic graphical models; factoring a joint distribution on a DAG; the
conditional independence graph as the basis for a Markov random field and a
Gibbs distribution. Baum Welch computations as message passing on a tree. |
|
|
|
16 |
3/10/11 |
R tricks;
HW 3 due; review for quiz |
|
|
|
|
|
Spring
break 3/12-3/20 |
|
|
|
17 |
3/22/11 |
Quiz 2 |
|
|
|
18 |
3/24/11 |
More on
probabilistic graphical models.
Recall that if we have variables as nodes on a tree, some of which
(wlog, the leaves) we condition on and some of which we want to some out, we
can arrange marginal posterior computations, such as P[ X_root=x| data] by
arranging the tree with directional edges pointing toward X_root. Then we can pass messages, initially
from the leaves in to each of their unique downstream nodes, and subsequently
from any internal node that is ready (in that it has received messages from
all of its upstream nodes), until we send messages into the root. Each message can be expressed as a
vector over values possibly taken by the receiving node; the entries in the
vector are conditional probabilities of the upstream observations given the
value at the receiving node. I
show the HMM example as one case.
Now the message delta_{i->j} (x_j) is a conditional probability
(could also be specified as a joint prob…but we do conditionals), and it can
be expressed as a sum over variables at the sending node of a node-specific
function psi_i times the product of all other incoming messages into i. We said briefly on 3/8
that to compute all marginal posteriors at once requires reframing the tree
with different rootings; but there’s a possible efficiency, noting that in
all such reframings only two different messages ever pass over a given
edge. That’s in fact how the BW
algorithm works (sequentially getting the forward and backward probabilities
by such message passing).
Now the message-passing is more general than we previously discussed
from BW because it can handle any tree.
However as constructed it does not handle graphs with cycles. (we wouldn’t know where to send the
messages!) In some cases
we can borrow the tree-based algorithm on a tree of cliques from the original
graph. E.g. with a 7 node tree
having 5 cliques. Discuss the
cliques and the cluster graph.
Indicate a that one can sometimes find a cluster tree; i.e. a tree
that’s a subgraph of the cluster graph satisfying the running intersection
property. [if C_i and C_j both
contain a node u, then u is contained in every clique on the unique path
intervening path]. Then I show
how message passing works on the junction tree; we need to consider that this
is a bit more general than the previous message passing, because connected
cliques share variables (the so-called `sepsets’). A message from C_i to a neighboring C_j is a function of
x_{C_j} and is a sum over all variables in C_i minus sepset_{i,j} of the
initially assigned clique function psi_i(x_{c_i}) times a product of incoming
messages over cliques upstream from the current clique. Finally note two things about actual
graph computations. One is a
simple example from statistical genetics, where, as we often have, the
modeling produces a DAG. But
this DAG isn’t good for computation; we need to `moralize’ to turn it into a
valid conditional independence graph.
Lastly is the idea that a given undirected graph may not admit a
junction tree. It is known that
a junction tree exists iff the graph is triangulated (i.e., every cycle of 4
or more nodes has an edge connecting non-adjacent nodes). |
|
Some
helpful notes and links: Michael
Jordan; Martin
Wainwright |
|
19 |
3/29/11 |
Monte Carlo
Method: the use of pseudorandom
numbers to approximate quantities of interest, Q, say. Write Q = f{ E_P[ g(X) ] } for some
functions f and g and for some random object X distributed according to P,
where all f, g, P are known to the analyst. In standard Monte Carlo, we get the computer to realize
X_1,…,X_n ~ iid P and then we get Qhat_n = f{ 1/n sum_{i} g(X_i) } to
approximate Q. LLN + Continuous
mapping theorem gives consistency; CLT and Slutsky given approximate
normality, so Qhat_n ~ approx Normal[ Q, sigma2/n ] where sigma2 depends on
P, g, and f. The big benefit of
MC is that it is so general; the big limitation is that the standard error of
Qhat_n drops like 1/sqrt{n}, so, e.g. to get 10 times the precision we need
100 times the number of sample. We note that variations of standard Monte
Carlo include `Markov chain MC’, in which we give up having iid copies, and
quasi MC, in which samples are more regular than random. Both affect the precision. MCMC broadens the scope of MC by
allowing many more P’s and hence Q’s that are not reachable by iid sampling;
but the expense, the primary expense, is that sigma^2 is inflated, though
root n remains. Quasi MC is far
more difficult to apply, and it has been developed in a much narrower range
of problems, but where it is feasible it can speed up the root n component. On examples: Eg. 1; Buffon’s needle;
Eg 2 Metropolis et al and the hard core process (and the ideal gas law); Eg 3
bootstrap standard errors; Eg 4 permutation p-values (expand on this a bit in
the two-sample case and conditioning on order statistics); Eg 5 and marginal
posterior distributions in Bayesian inference. On how to proceed; first we need to study pseudo-random
numbers; a deterministic sequence with the same relevant statistical
properties as a truly random sequence (note that randomness is not constructable,
so we can only ask if some particular thing looks like it or not). I discuss congruential generators and
shift register generators, and the Mersenne Twister with period 2^19937-1 and
uniformity in high dimensions.
And I discuss the need for uniformity in k-dimensional unit squares
and chi-squared tests for this. Finally I discuss elements of R, including
.Random.seed and set.seed, and the benefit of saving the seed to have a
proper record of a MC computation. |
|
B.D.
Ripley, Stochastic Simulation, Wiley, 1987. |
|
20 |
3/31/11 |
Monte Carlo
continued. Non-uniform random
variate generation. Assume our
computer can realize U_1, U_2, … ~iid Unif(0,1). How do we get other distributions? E.g. Bernoulli, X_i = 1[ U_i <= p
]; any discrete by F^(-1), indeed any 1-d distribution by F^(-1), though not particularly
efficient. Binomial by summing
Bernoulli’s; Exponential by F^(-1); Poisson by counting exponentials; what
about normal? Show the standard
Box-Muller method. Then on to
Rejection sampling (in steps).
1. The idea of sampling until a conditioning event occurs gives a draw
from the conditional distribution (e.g. Normal given in interval). 2. The idea of having a uniform
distribution over a finite region in Euclidean space, A, say. 3. The idea of
using 1 to get uniform over a sub-region C subset A. 4. If A ={x,y: x in R,
0<=y<= h(x) } for a positive integrable function h then X marginally
has density proportional to h. 5. The idea that to get a random point in a
set like A we could draw from density proportional to h and then draw a uniform height in (0, h(x)). 6. Finally the rejection idea. Given a target density f of interest
and a density g that’s easy to simulate and also such that there exists k
>0 and k < infty such that f(x) <= k g(x). Then sample (X_n,Y_n) pairs with X_n ~g and Y_n|X_n=x ~
Unif(0,k g(x) ) until Y_n <= f(X_n).
At that first time N, report X_N. Claim X_N ~ f. Show example of normal density
by bounding by Laplace. [Note N has geometric distribution] Do the modified
Box-Muller method where you sample (U_n,V_n) Uniformly in a box (side length 2; centered at 0) until
the point is within a unit disc. Then (U_N,V_N) makes a uniformly distributed
angle theta with the x –axis, and T^2 = U_N^2 + V_N^2 is uniformly
distributed and independent of the theta. Further U_N/T and V_N/T make the cosine and sine of theta,
and T, which is independent, yields a Gamma(1,1/2) variable via
R^2=2*log(1/T^2), which acts as the random magnitude of a normal pair
(X,Y). Thus you get X=R
cos(theta) , Y=R sin(theta) from U_N and V_N without using any trigonometry. |
|
Luc
Devroye. Non-Uniform Random Variate Generation. Springer, 1986. |
|
21 |
4/5/11 |
Rejection
sampling continued. Example with Gamma(a,1) and Weibull envelope; X = {
log(1/U) }^{1/a} has density
g(x) = a x^{a-1} exp(-x^a). I
work out r(x) = f(x)/g(x) for f the Gamma(a,1) (with a < 1 ) and find a
finite x* that maximizes r(x), so c = r(x*); then I review the
algorithm. Next I discuss the
general ratio of uniforms method. Start with a positive integrable h(x)
(proportional to some density of interest) and consider the planer region C={
(u,v): 0 <= u <= sqrt{h(v/u)} }.
I work out area of C using transformation of variables (u,v) ->
(u,x=v/u) and get area = int h(x) / 2.
Then I consider sampling (U,V) ~ Unif(C), and X=V/U. I show, by the same change of
variables formula, that X ~ h/int(h).
As an example I consider the normal again, and h(x) =
exp(-x^2/2). Show that C is
contained in a box in the plane, since u is in (0,1) and v^2 <= -2 u^2
log(u^2). Review the
algorithm…rejection to get (U,V) in C, then return (V/U). Recall our algorithm for making
normals without doing trigonometry…we note that the angle made by (U,V) is
uniform in (0,2pi), and the ratio U/V is tangent of the angle, so it has a
Cauchy distribution (no need for trig to get). This approach is very fast for t distributions. Next I go on to discuss composition,
where in the distribution of interest f(x) is represented as either a discrete
or continuous mixture over a latent Z.
(indicate that’s how rnorm actually works, with a finite
decomposition), but also indicate how Z ~ chisquare_k and X|Z=z ~ Normal[0,
k/z ] makes a t_k, and how Z~Gamma(k,k) and X|Z=z ~ Poisson(theta * z ) makes
a negative binomial, and, after discussing that Beta’s come from independent
Gammas discuss the Beta-binomial as an overdispersed binomial. As an aside discuss the Polya-urn
method for sampling the beta binomial, and indicate its connection to Bayes
theory….next time on to multivariate simulation… |
|
|
|
22 |
4/7/11 |
Multivariate
simulation. 0. The natural point
that given a joint distribution f(x,y) for X,Y, one can simulate by getting X
from its marginal and then Y from its conditional given X=x. 1. Can always get a multivariate draw in principle using the
sequential factorization of a joint distribution (e.g. through a DAG), but a.
this is usually not available in simple form for inference (it may be for
forward simulation) and b. it may not be the most efficient way, even if it’s
available. On to some examples
of special distributions; Dirichlet in terms of independent Gamma’s. Multinomial in terms of n iid simple
multinomials (Andy’s `multinoulli’); Multivariate normals in terms of
Cholesky decomposition.
(Wishart’s for homework).
But still most joint distributions are not amenable to direct
sampling. Another approach is
MCMC. First a few observations;
note that if X~N(0,1) and Y|X=x ~ Normal[ rho X, 1-rho^2 ], then Y is a randomly
modified version of the input X, but it has the same distribution (X=^d
Y). Now on to finite Markov
chains…e.g. binary, with transition probs alpha 0->1 and beta 1->0, and
start, say at X1=1. Consider how
easy it is to simulate a chain stepwise. Show simulation of many realizations, considering the
n-step distributions P[X_n=1|X_1=1], show that they converge to
alpha/(alpha+beta) Bernoulli in this case [`stationary distribution’], and
were we to have started there we would stay there always. Discuss aperiodicity and
irreducibility and that this is sufficient to get convergence to a unique
stationary distribution. Also
discuss generalizations of the LLN and CLT for Markov chains, also using the
example (bsim*.R). Xbar still
approximates mu, a property of the stationary distribution, but its rate of
convergence may be slower. MCMC
works by representing the quantity
of interest as a property of a distribution pi, and then by rigging up
a Markov chain that has stationary distribution pi, in order to get
approximants from the chain….more next time. |
|
|
|
23 |
4/12/11 |
Metropolis
Hastings and Gibbs samplers. Last time we reviewed Markov chains and the goal
of MCMC to establish the quantity of interest as a feature of a target
distribution, wherein the target distribution is the stationary distribution
of a Markov chain that you cook up . Start here with the idea of a reversed chain Y_1,
Y_2,…,Y_n produced from an initial chain X_1,X_2,..,X_n, where X_n is
stationary, with stationary distribution pi(x), and where Y_i =
X_{n-i+1}. Work out the transition
probabilities for the reversed chain as ptilde(a,b) = p(b,a) pi(b)/pi(a) for
all states a,b. Next define the
chain to be reversible if ptilde(a,b) = p(a,b), the original transition
probabilities, for all states.
This defines detailed balance pi(a) p(a,b) = pi(b) p(b,a). Further, if you’re given pi and a set
of transition probs, and if you can establish detailed balance, then you know
the MC having these transition probabilities will have pi as its stationary
distribution [sum out a on both sides of detailed balance,e.g.]. This is the program for MH
algorithms. Set up the
Metropolis algorithm (finite state version). Target distribution pi(a) on finite set. You’ll make a chain X_1, X_2,…by
proposing Y from q(x,.), a proposal distribution over the space for each x,
such that q(x,y)=q(y,x). Then
you compute r = pi(y)/pi(x) for Y=y, and make a uniformly distributed U. X_{n+1}=Y if U <=r, else
X_{n+1}=X_n. Discuss a bit in
terms of magnitude of pi and normalizing constant. Discuss acceptance rate. Show that this chain satisfies detailed balance wrt
pi. Give some continuous
examples…simulating a normal
using a Uniform[x-eps,x+eps] proposal.
(Sketch typical output for big eps and small eps.) Recall LLN/CLT for M chains to
motivate estimates. Give the
original Metropolis example.
Discuss Hastings (1970) and the relaxation of symmetrix
q..(homework). E.g. with
Exponential proposal for normal [note it’s not irreducible…need multiple
proposals; as long as each one maintains detailed balance]. Show a proposal that for bivariate
X=(U,V) which involves Y=(U,Z), where Z comes from p(v|u). Compute MH ratio equal 1, so never
reject proposal…`Gibbs sampler’.
Indicate how it works in practice by collecting a bunch of proposal
mechanisms into one algorithm. |
|
|
|
24 |
4/14/11 |
Gibbs
samplers, continued: Start with an aside about completing the square in a
multivariate normal Bayes computation. I.e., say Y=X theta + epsilon in
matrix form Y is an n vector, X is nxp, and theta is px1, with normal errors,
and say Y |X ~ Normal[ X theta, I_n/lambda_y ]. If also theta ~ Normal[ 0, I_p/lambda_theta] is a prior,
then the posterior density has a certain form…write out as proportional to
exp{ -1/2[ stuff] } where stuff involves theta^T theta and other pieces. Go through the completing the square
computation, by starting this
posterior and then by also simplifying the MVN density form; thus we
get theta|Y,X ~ Normal[ posterior mean, posterior precision ]. Next go on to MCMC. First Albert and
Chib’s probit regression model Y_i = 1 [ Z_i <= 0 ], where Z ~ Normal[ - Xtheta, I_n ]. Show the graphical representation of
this model (DAG and undirected form) and show how to do Gibbs sampling by
p(theta|z,y) and p(z|theta,y) involving the normal form from the beginning of
class [using Cholesky decomposition] and using truncated normal sampling for
z given everything else.
Reconsider the form of the Gibbs sampler output and how it’s
used/diagnosed. Next start the
Besag 95 field trials example and model… |
|
|
|
25 |
4/19/11
(Passover) |
Substitute
lecturer: Prof Bates. R 1 |
|
|
|
26 |
4/21/11 |
Substitute
lecturer: Prof Bates. R 2 |
|
|
|
27 |
4/26/11 |
|
|
|
|
28 |
4/28/11 |
|
|
|
|
29 |
5/3/11 |
|
|
|
|
30 |
5/5/11 |
|
|
|