MSRC Abstracts

Session I: Theory

Gaussian Extreme Values and Optimal Thresholding in Sparse Signal
Recovery - Anirban DasGupta, Purdue University

In the problem of doing formal inference with microarray data, Donoho and Jin (2004) provided a Gaussian mixture model to represent a small fraction of a total of n observations which contain a possibly detectable signal. A simple thresholding rule is to flag an observation if it exceeds o p2 log n, where o is the inherent scaling constant (standard deviation). We prove that with the op2 log n thresholding sequence, not only the false
discovery rate (FDR), but even the total number of false discoveries (F)
converges in probability to zero. But, a perfect score on the false discovery
front is obtained in exchange for missing all the true signals asymptotically.
This leads us to seeking for adjusted thresholding sequences so that better
balance is obtained between the false discovery rate (FDR) and the signal recovery rate (RSR). We prove that by adjusting the thresholding sequence as cn = p2 log n − log log n+C 2p2 log n for suitable C we can achieve such better balance.
As a sample, we prove that we can choose C so that F is asymptotically Poisson; this improves the convergence rate of RSR in some precise sense, and this is made explicit. In general, inside the Donoho-Jin detection boundary, we cannot have a nonzero limit for RSR unless we let the FDR converge to one, which is unacceptable. With more radically adjusuted thresholding sequences, we can attain the best possible convergence rate for the RSR, subject to the constraint that the FDR converges to zero at a user-provided asymptotic rate. This is also made explicit. These adjusted thresholding sequences arise out of work on Gaussian extremes in DasGupta, Lahiri, and Stoyanov (2013).

Self-normalization - Xiaofeng Shao, University of Illinois

Self-normalization has a long history in statistics, dating back to the work of "Student" (1908), who introduced the celebrated t-statistic. In this talk, we will focus on some recent developments of self-normalization in the time series setting. Specifically, we will talk about two inference problems for time series data: confidence interval construction and change point detection. The main features of self-normalization will be highlighted and its performance will be compared to some existing approaches through theory and simulations. If time permits, we will briefly mention some extensions and possible future work.

Session II: Statistical Computing

Aster Models with Random Effects - Charles J. Geyer, University of Minnesota

Aster models (Geyer, et al., Biometrika, 2007) generalize generalized linear models (GLM) allowing different components of the response vector to have different distributions (some Bernoulli, some Poisson, some zero-truncated Poisson, some normal) and components of the response vector to have dependence specified by simple graphical models. They are used for life history analysis of plants and animals or whenever survival is part of the response but interest is in what happens after survival. The R package aster has been on CRAN since 2005.

GLM with random effects (generalized linear mixed models, GLMM) are very popular, but problematic. The likelihood can be evaluated by numerical integration for simple GLMM but not for more complicated GLMM. The latter use various approximations, none very reliable. When one bootstraps GLMM, an appreciable fraction of iterations fail to converge so one cannot get P

Aster models with random effects inherit all these problems, but version 0.8-20 of the R package added random effects because scientists need them. Mostly based on Laplace approximation ideas in Breslow and Clayton (JASA, 1993), it has two innovations: calculating approximate observed Fisher information using the implicit function theorem and a test for zero variance components that uses the theory of constrained optimization.

Particle learning for low counts in disease outbreaks - Jarad Niemi, Iowa State University

We consider a measles outbreak in Harare, Zimbabwe from 2009-2010 where a total of 156 individual cases were confirmed in the weekly recorded surveillance system over the course of a year. We model the data with an epidemiological compartment model with observations only on the transitions to infectious, i.e. showing symptoms and contagious. We present methodology called particle learning for sequentially estimating both the state of the system, i.e. current number of infectious individuals, and the static parameters, i.e. rate of infectivity. Particle learning is a sequential Monte Carlo method, a.k.a. particle filter, that utilizes the sufficient statistic structure contained in the model to combat particle degeneracy. We conclude with an application of this methodology to the measles outbreak in Harare.

Session III: Applications of Statistics

Clustering Cancer Mortality Curves of U.S. States Based On Change Points - Chae Young Lim, Michigan State University

Statistical methods for analyzing disease incidence or mortality data over geographical regions and time have gained considerable interest in recent years due to increasing concerns of public health, health disparity and legitimate resource allocations. The trend analysis of cancer incidence or mortality rates is needed for subsequent investigation in public health. For example, the National Cancer Institute provides a software to fit statistical models to track changes in cancer curves. The current available models to detect changes over time are designed for a single curve. When there are multiple curves, the current methods could be applied multiple times, however, this may not be efficient in statistical sense. Further, the interest could be group multiple curves based on the change-points of curves and capture possible spatial dependence when multiple curves are observed over geographical regions. This talk introduce a statistical model that allows concurrent estimation of change-points of multiple curves and grouping those curves based on common changes over time while the model incorporates heterogeneous variability over time and possible spatial dependence among geographically observed curves. The Bayesian analysis is carried out by eliciting a Dirichlet process prior on the relevant functional space to model change-points. The resulting posterior is shown to be valid and proper. The age-adjusted lung cancer mortality rates of U.S. states are analyzed to detect change-points and rates of change on each curve as well as clusters of states that share similar trend over time.

Advances in simulation-based inference for stochastic dynamic systems - Edward Ionides, University of Michigan

Characteristic features of biological dynamic systems include stochasticity, nonlinearity, measurement error, unobserved variables, unknown system parameters, and even unknown system mechanisms. The resulting inferential challenges will be discussed, with specific applications to ecology and epidemiology. Examples will include transmission of malaria and measles, and a longitudinal study of dynamic variation in sexual behaviors. It is convenient to use statistical inference methodology which is based on simulations from a numerical model. Methodology with this property is said to be plug-and-play. Several plug-and-play approaches have recently been developed for statistically efficient inference on partially observed stochastic dynamic systems. Plug-and-play methodology frees the scientist from an obligation to work with models for which transition probabilities are analytically tractable. The examples will demonstrate how this framework for modeling and inference facilitates asking and answering questions about biological systems.

Session IV: Graduate Student Talks

Network Granger Causality with Inherent Grouping Structure - Sumanta Basu, University of Michigan

The problem of estimating high-dimensional network models arises naturally in the analysis of many biological and socio-economic systems. In this work, we aim to learn a network structure from temporal panel data, employing the framework of Granger causal models under the assumptions of sparsity of its edges and inherent grouping structure among its nodes. To that end, we introduce a group lasso regression
regularization framework, and also examine a thresholded variant to address the issue of group misspecification. Further, the norm consistency and variable selection consistency of the estimates is established, the latter under the novel concept of direction consistency. The performance of the proposed methodology is assessed through an extensive set of simulation studies and comparisons with existing techniques. The
study is illustrated on two motivating examples coming from functional genomics and financial econometrics.

Optimal Sparse Volatility Matrix Estimation for High Dimensional Itô Processes With Measurement Errors - Minjing Tao, University of Wisconsin-Madison

Stochastic processes are often used to model complex scientific problems in fields ranging from biology and finance to engineering and physical science. This talk investigates rate-optimal estimation of the volatility matrix of a high dimensional Itô process observed with measurement errors at discrete time points. The minimax rate of convergence is established for estimating sparse volatility matrices. By combining the multi-scale and threshold approaches we construct a volatility matrix estimator to achieve the optimal convergence rate. The minimax lower bound is derived by considering a subclass of Itô processes for which the minimax lower bound is obtained through a novel equivalent model of covariance matrix estimation for independent but non-identically distributed observations and through a delicate construction of the least favorable parameters. In addition, a simulation study was conducted to test the finite sample performance of the optimal estimator, and the simulation results were found to support the established asymptotic theory.