MSRC Abstracts

Session I: Theory

Title: Bayesian model choice and information criteria in sparse generalized linear models

Mathias Drton
Department of Statistics
University of Chicago

Abstract: We consider Bayesian model selection in generalized linear models that are high-dimensional, with the number of covariates p being large relative to the sample size n, but sparse in that the number of active covariates is small compared to p. Treating the covariates as random and adopting an asymptotic scenario in which p increases with n, we show that Bayesian model selection using certain priors on the set of models is asymptotically equivalent to selecting a model using an extended Bayesian information criterion. Moreover, we prove that the smallest true model is selected by either of these methods with probability tending to one. Having addressed random covariates, we are also able to give a consistency result for pseudo-likelihood approaches to high-dimensional sparse graphical modeling. Experiments on real data demonstrate good performance of the extended Bayesian information criterion for regression and for graphical models.

Title: Simultaneous Confidence Bands for Functional Regression Curves

Jing Wang
Dept of Mathematics, Statistics and Computer Science
University of Illinois at Chicago

Abstract: A new procedure is developed to construct simultaneous confidence bands for regression curves in functional data analysis.
Specifically polynomial spline estimators are proposed to approximate the
derivatives of the mean functions, the covariance functions and the
associated eigenfunctions. Desirable statistical properties of the
proposed procedure include semiparametrically efficiency of the curve
derivatives, and asymptotic consistency of the derivatives of the
covariance function and eigenfunctions. The proposed spline confidence
bands are shown to be asymptotically efficient as if all random
trajectories were observed correctly. The confidence band procedure is
illustrated through numerical simulation studies and a real life example.
This is a joint work with collaborators, Guanqun Cao, Dr. Li Wang, and Dr.
David Totem.

Session II: Statistical Computing

Title: Bayesian Inference for Irreducible Diffusions

Osnat Stramer
Department of Statistics and Actuarial Science
University of Iowa

Abstract: In this talk we examine two relatively new MCMC methods which allow for Bayesian inference in diffusion models. First, the Monte Carlo within Metropolis (MCWM) algorithm (O’Neil, Balding, Becker, Serola and Mollison, 2000) uses an importance sampling approximation for the likelihood and yields a Markov chain. Our simulation study shows that there exists a limiting stationary distribution that can be made arbitrarily “close” to the posterior distribution. The second method, described in Beaumont (2003) and generalized in Andrieu and Roberts (2009), introduces auxiliary variables and utilizes a standard Metropolis-Hastings algorithm on the enlarged space; this method preserves the original posterior distribution. When applied to diffusion models, this pseudo-marginal (PM) approach can be viewed as a generalization of the popular data augmentation schemes that sample jointly from the missing paths and the parameters of the diffusion volatility. The efficacy of the PM approach is demonstrated in a simulation study of the popular Heston models. We also define a more general parametric stochastic variance model for asset prices than the stochastic variance model of Heston. Our model is based on a continuous-time version of the smooth transition autoregressive (STAR) models introduced in Chan and Tong (1986). We apply the generalized Heston model to the S&P 500, VIX bivariate dataset utilizing the PM algorithm. Comparison is made with the Golightly and Wilkinson (2008) approach.

Title: Inference of functional clusters from non-functional data

Long Nguyen, Department of Statistics, University of Michigan

The problem of functional clustering, while the data are not available as functions or samples of functions, will be discussed. This problem commonly arises when it is not possible to obtain functional samples due to, for examples, measurement limitations or confidentiality
constraints. We propose a Bayesian nonparametric method based on a nested hierarchy of Dirichlet processes, where the functional clusters of interest are taken as latent random functions in a Bayesian nonparametric hierarchy. We shall discuss an efficient and rather
intuitive MCMC algorithm for posterior inference and demonstrate its effectiveness in a number of data examples. Finally, and time permitting, we present new results regarding the identifiability, posterior consistency and convergence rates of the latent random
clusters in a nonparametric Bayesian setting.

Session III: Applications of Statistics

Title:Modeling the longitudinal change in speech recognition ability of cochlear implant users

Jacob Oleson
Assistant Professor
Department of Biostatistics
University of Iowa

Abstract:Practitioners often ask if a treatment successfully improved performance. Many times this question is directed towards the outcome of a single individual as opposed to a specific population. For instance, cochlear implant researchers would like to know how much individual patients will improve in their ability to recognize speech after receiving a cochlear implant. In this talk, we will discuss methods to assess the growth trajectory of a single individual who is administered a test where the result is percent of words identified correctly. One criterion for improvement is change from pre-treatment to post-treatment which will be demonstrated using a credible interval derived from two correlated binomial draws. We will then extend the single subject improvement to longitudinal models since the patients return for yearly checkups and thus have multiple performance outcomes. Although many longitudinal models exist to help investigators track and analyze their patients over time, many of these models have limitations in that they are restricted in the form they can take and thus may not accurately represent an individual's true trajectory over a period of time. One particularly important development in analyzing longitudinal data is the functional linear mixed effects model which relaxes many of the restrictions traditional longitudinal models place in the structure of a given growth curve. We will demonstrate a Bayesian hierarchical model for subject-specific growth curves that accounts for missing observations when the outcomes are binomial in nature.

Title:Quantile-based Permutation Thresholds for QTL Hotspots

Brian S. Yandell and Elias Chaibub Neto
Dept. of Statistics
University of Wisconsin-Madison

Abstract: One important idea in genetical genomic studies is to infer how genotypes (DNA markers for an individual) affect phenotypes (traits measured on an individual, including thousands of mRNA expression levels). QTL hotspots (genomic locations affecting many traits) are a common feature in genetical genomics studies, and are biologically interesting since they may harbor critical regulators. But are these hotspots real? Or are they spurious, a result of non-genetic correlation from uncontrolled environmental factors or unmeasured variables? A recently proposed empirical test (Brietling et al. 2008) uses the number of traits that pass a predetermined LOD threshold (LOD is a rescaled likelihood ratio, similar to an F statistic), assessing the null distribution with an extension of Churchill and Doerge's (1994) permutation test, which is itself an extension of Fisher's permutation t-test. That is, we permute the phenotype traits together while keeping the original genotypes intact. This breaks the genotype-phenotype bond while preserving the correlation structure separately among phenotypes and among genotypes. This seems to solve the problem but it only considers the number of traits above one threshold, without accounting for the magnitude of the LOD scores. Relevant information is lost. In particular, biologically interesting hotspots composed of a moderate to small number of traits with strong LOD scores may be neglected as non-significant. In this talk we propose a quantile-based permutation approach that simultaneously accounts for the number and the LOD scores of traits within the hotspots. By considering sliding thresholds, our method can assess the statistical significance of both small and large hotspots. We assess performance with simulations and illustrate how our approach can effectively assess the significance of moderate and small hotspots with strong LOD scores in two experimental crosses, budding yeast and a mouse model for type II diabetes.

Session IV: Graduate Student Talks

Zhihua Su, University of Minnesota
Title: "Envelope Models: Efficient Estimation in Multivariate Linear Regression"

Abstract: This talk presents an introduction of a new statistical concept called an envelope. An envelope has the potential to achieve substantial efficiency gains in multivariate analysis by identifying and cleaning up immaterial information in the data. The efficiency gains will be demonstrated both by theory and example. If time permits, some recent developments in this area, including partial envelopes and heteroscedastic envelopes, will also be discussed. They refine and extend the enveloping idea, adapting it to more data types and increasing the potential to achieve efficiency gains. Applications of envelopes and their connection to other fields will also be mentioned.

Hyun Keun Cho, University of Illinois at Urbana-Champaign
Title: "Efficient moment selection from high-dimensional moment conditions"

Abstract: For high-dimensional correlated data with a large cluster size, it is feasible to generate many valid moment conditions such as in dynamic panel data models. The generalized method of moments (GMM) (Hansen, 1982) approach has the advantages of obtaining an eTitle: timator through optimally combining valid moment conditions. However, the GMM estimator could be infeasible when the number of moment conditions exceeds the sample size. We propose an objective criterion which includes a set of important moment conditions in addition to selecting optimal linear combinations of the remaining moment conditions. This is in contrast to existing methods which only select a subset of the valid moment conditions. Monte Carlo simulation studies and real data example show that the proposed method performs better when important moment conditions are included in addition to linear combinations of the remaining moment conditions. It outperforms existing methods in the sense of reducing bias and improving the efficiency of the estimation.

Rina Foygel, University of Chicago
Title: "Matrix reconstruction with the local max norm"

In the low-rank matrix reconstruction problem, we observe some entries of a n × m matrix Y
and would like to accurately estimate the unobserved entries, based on the assumption that Y is
approximately low-rank. This problem is relevant in many modern high-dimensional applications, including recommender systems such as Netflix, as well as applications where vectors Y1, . . . ,Y m Є Rⁿ have similarities arising from temporal or spatial structure, such as video data (where the m frames of a video are similar due to stationary objects appearing in many frames) and weather data (where m locations have similar weather patterns due to geographic proximity). A common approach to the low-rank matrix reconstruction problem is to regularize with the matrix trace-norm. This approach can be viewed as a convex surrogate for a low-rank constraint, and is known to guarantee low error in estimating the unobserved entries of Y under certain assumptions (work by Recht, Candes & Tao, Negahban & Wainwright, and others). An alternative is the max-norm, which provides stricter regularization than the trace-norm (Srebro & Shraibman). Depending on the nature of the data, either one of these norms might be better than the other at reconstructing the matrix. With this in mind, we propose a family of matrix norms that generalizes the trace-norm and the maxnorm, which we call the local max-norm. This family of norms can be viewed as a way of interpolating between the trace-norm and the max-norm. We show that these norms are computationally tractable, and show improved empirical performance on the task of reconstructing weather pattern data (daily
maximum temperatures at 30 weather stations).