Statistics Technical Reports:Search | Browse by year

Sorted by:

Title:Adjusting Treatment Effect Estimates by Post-Stratification in Randomized Experiments
Author(s):Miratrix, Luke W.; Sekhon, Jasjeet S.; Yu, Bin; 
Date issued:November 2011 (PDF)
Abstract:Experimenters often use post-stratification to adjust estimates. Post-stratification is akin to blocking, except that the number of treated units in each strata is a random variable be- cause stratification occurs after treatment assignment. We analyze both post-stratification and blocking under the Neyman model and compare the efficiency of these designs. We derive the variances for a post-stratified estimator and a simple difference-in-means estimator under different randomization schemes. Post-stratification is nearly as efficient as blocking: the difference in their variances is on the order of 1/n2, provided treatment proportion is not too close to 0 or 1. Post-stratification is therefore a reasonable alternative to blocking when the latter is not feasible. However, in finite samples, post-stratification can increase variance if the number of strata is large and the strata are poorly chosen. To examine why the estimators’ variances are different, we extend our results by conditioning on the observed number of treated units in each strata. Conditioning also provides more accurate variance estimates because it takes into account how close (or far) a realized random sample is from a comparable blocked experiment. We then show that the practical substance of our results remain under an infinite population sampling model. Finally, we provide an analysis of an actual experiment to illustrate our analytical results.
Keyword note:Miratrix__Luke Sekhon__Jasjeet_S Yu__Bin
Report ID:809

Title:Killed Brownian motion with a prescribed lifetime distribution and models of default
Author(s):Ettinger, Boris; Evans, Steven N.; Hening, Alexandru; 
Date issued:November 2011 (PDF)
Abstract:The inverse first passage time problem asks whether, for a Brownian motion $B$ and a nonnegative random variable $\zeta$, there exists a time-varying barrier $b$ such that $\mathbb{P}\{B_s > b(s), \, 0 \le s \le t\} = \mathbb{P}\{\zeta > t\}$. We study a "smoothed" version of this problem and ask whether there is a "barrier" $b$ such that $\mathbb{E}[\exp(-\lambda \int_0^t \psi(B_s - b(s)) \, ds)] = \mathbb{P}\{\zeta > t\}$, where $\lambda$ is a killing rate parameter and $\psi: \mathbb{R} \to [0,1]$ is a non-increasing function. We prove that if $\psi$ is suitably smooth, the function $t \mapsto \mathbb{P}\{\zeta > t\}$ is twice continuously differentiable, and the condition $0 < -\frac{d \log \mathbb{P}\{\zeta > t\}}{dt} < \lambda$ holds for the hazard rate of $\zeta$, then there exists a unique continuously differentiable function $b$ solving the smoothed problem. We show how this result leads to flexible models of default for which it is possible to compute expected values of contingent claims.
Keyword note:Ettinger__Boris Evans__Steven_N Hening__Alexandru
Report ID:808

Title:Phylogenetic analyses of alignments with gaps
Author(s):Evans, Steven N.; Warnow, Tandy; 
Date issued:October 2011 (PDF)
Abstract:Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels has not been fully investigated. We prove that treating indels as missing data can be inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. We also prove that the true tree can be identified solely from the pattern of gaps in the true alignment (that is, character states can be ignored). Our results show that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. Moreover, the pattern of gaps in an accurate alignment may give substantial information about the underlying phylogeny, over and above what is present in the character states. These observations suggest that the recent focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.
Keyword note:Evans__Steven_N Warnow__Tandy
Report ID:807

Title:Lipschitz minorants of Brownian Motion and Levy processes
Author(s):Abramson, Joshua; Evans, Steven N.; 
Date issued:October 2011 (PDF)
Abstract:For $\alpha > 0$, the $\alpha$-Lipschitz minorant of a function $f: \mathbb{R} \to \mathbb{R}$ is the greatest function $m : \mathbb{R} \to \mathbb{R}$ such that $m \leq f$ and $|m(s)-m(t)| \le \alpha |s-t|$ for all $s,t \in \mathbb{R}$, should such a function exist. If $X=(X_t)_{t \in \mathbb{R}}$ is a real-valued L\'evy process that is not pure linear drift with slope $\pm \alpha$, then the sample paths of $X$ have an $\alpha$-Lipschitz minorant almost surely if and only if $| \mathbb{E}[X_1] | < \alpha$. Denoting the minorant by $M$, we investigate properties of the random closed set $\mathcal{Z} := {t \in \mathbb{R} : M_t = X_t \wedge X_{t-}}$, which, since it is regenerative and stationary, has the distribution of the closed range of some subordinator "made stationary" in a suitable sense. We give conditions for the contact set $\mathcal{Z}$ to be countable or to have zero Lebesgue measure, and we obtain formulas that characterize the L\'evy measure of the associated subordinator. We study the limit of \mathcal{Z}$ as $\alpha \to \infty$ and find for the so-called abrupt L\'evy processes introduced by Vigon that this limit is the set of local infima of $X$. When $X$ is a Brownian motion with drift $\beta$ such that $|\beta| < \alpha$, we calculate explicitly the densities of various random variables related to the minorant.
Keyword note:Abramson__Joshua Evans__Steven_N
Report ID:806

Title:A limit theorem for occupation measures of Levy processes in compact groups
Author(s):Berger, Arno; Evans, Steven N.; 
Date issued:September 2011 (PDF)
Abstract:A short proof is given of a necessary and sufficient condition for the normalized occupation measure of a Levy process in a metrizable compact group to be asymptotically uniform with probability one.
Keyword note:Berger__Arno Evans__Steven_N
Report ID:805

Title:Estimation and correction for GC-content bias in high throughput sequencing
Author(s):Benjamini, Yuval; Speed, Terence P.; 
Date issued:June 2011 (PDF)
Abstract:GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing assays, particularly the Illumina Genome Analyzer technology. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation. The bias is not consistent between samples, and current methods to remove it in a single sample do not assume any knowledge of the curve shape or scale. In this work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Based on these findings, we propose a new method to calculate predicted coverage and correct for the bias. This parsimonious model produces single bp prediction which suffices to predict the GC effect on fragment coverage at all scales, all chromosomes and for both strands; this allows optimal GC-effect correction regardless of the downstream smoothing or binning. We demonstrate our model's potential for improving on current approaches to copy-number estimation. These GC-modeling considerations can also inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq. Finally, our analysis provides empirical evidence strengthening the hypothesis that PCR is the most important cause of the GC bias.
Keyword note:Benjamini__Yuval Speed__Terry_P
Report ID:804

Title:Stochastic equations on projective systems of groups
Author(s):Evans, Steven N.; Gordeeva, Tatyana; 
Date issued:June 2011 (PDF)
Abstract:We consider stochastic equations of the form $X_k = \phi_k(X_{k+1}) Z_k$, $k \in \mathbb{N}$, where $X_k$ and $Z_k$ are random variables taking values in a compact group $G_k$, $\phi_k: G_{k+1} \to G_k$ is a continuous homomorphism, and the noise $(Z_k)_{k \in \mathbb{N}}$ is a sequence of independent random variables. We take the sequence of homomorphisms and the sequence of noise distributions as given, and investigate what conditions on these objects result in a unique distribution for the "solution" sequence $(X_k)_{k \in \mathbb{N}}$ and what conditions permits the existence of a solution sequence that is a function of the noise alone (that is, the solution does not incorporate extra input randomness "at infinity"). Our results extend previous work on stochastic equations on a single group that was originally motivated by Tsirelson's example of a stochastic differential equation that has a unique solution in law but no strong solutions.
Keyword note:Evans__Steven_N Gordeeva__Tatyana
Report ID:803

Title:Stochastic population growth in spatially heterogeneous environments
Author(s):Evans, Steven N.; Ralph, Peter L.; Schreiber, Sebastian J.; Sen, Arnab; 
Date issued:May 2011 (PDF)
Abstract:Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches; the conditional law of $X_{t+dt}$ given $X_t=x$ is such that when $dt$ is small the conditional mean of $X_{t+dt}^i-X_t^i$ is approximately $[x^i\mu_i+\sum_j(x^j D_{ji}-x^i D_{ij})]dt$, where $X_t^i$ and $\mu_i$ are the abundance and per capita growth rate in the $i$-th patch respectivly, and $D_{ij}$ is the dispersal rate from the $i$-th to the $j$-th patch, and the conditional covariance of $X_{t+dt}^i-X_t^i$ and $X_{t+dt}^j-X_t^j$ is approximately $x^i x^j \sigma_{ij}dt$. We show for such a spatially extended population that if $S_t=(X_t^1+...+X_t^n)$ is the total population abundance, then $Y_t=X_t/S_t$, the vector of patch proportions, converges in law to a random vector $Y_\infty$ as $t\to\infty$, and the stochastic growth rate $\lim_{t\to\infty}t^{-1}\log S_t$ equals the space-time average per-capita growth rate $\sum_i\mu_i\E[Y_\infty^i]$ experienced by the population minus half of the space-time average temporal variation $\E[\sum_{i,j}\sigma_{ij}Y_\infty^i Y_\infty^j]$ experienced by the population. We derive analytic results for the law of $Y_\infty$, find which choice of the dispersal mechanism $D$ produces an optimal stochastic growth rate for a freely dispersing population, and investigate the effect on the stochastic growth rate of constraints on dispersal rates. Our results provide fundamental insights into "ideal free" movement in the face of uncertainty, the persistence of coupled sink populations, the evolution of dispersal rates, and the single large or several small (SLOSS) debate in conservation biology.
Keyword note:Evans__Steven_N Ralph__Peter Schreiber__Sebastian_J Sen__Arnab
Report ID:802

Title:Summarizing large-scale, multiple-document news data: sparse methods & human validation
Author(s):Miratrix, Luke; Jia, Jinzhu; Gawalt, Brian; Yu, Bin; El Ghaoui, Laurent; 
Date issued:May 2011 (PDF)
Abstract:News media significantly drives the course of events. Understanding how has long been an active and important area of research. Now, as the amount of online news media available grows, there is even more information calling for analysis, an ever increasing range of inquiry that one might conduct. We believe subject-specific summarization of multiple news documents at once can help. In this paper we adapt scalable statistical techniques to perform this summarization under a predictive framework using a vector space model of documents. We reduce corpora of many millions of words to a few representative key-phrases that describe a specified subject of interest. We propose this as a tool for news media study.We consider the efficacies of four different feature selection approaches---phrase co-occurrence, phrase correlation, $L^1$ regularized logistic regression (L1LR), and $L^1$ regularized linear regression (Lasso)---under many different pre-processing choices. To evaluate these different summarizers we establish a survey by which non-expert human readers rate generated summaries. Data pre-processing decisions are important; we also study the impact of several different techniques for vectorizing the documents, and identifying which documents concern a subject.We find that the Lasso, which consistently produces high-quality summaries across the many pre-processing schemes and subjects, is the best choice of feature selection engine. Our findings also reinforce the many years of work suggesting the tf-idf representation is a strong choice of vector space, but only for longer units of text.Though we focus here on print media (newspapers), our methods are general and could be applied to any corpora, even ones of considerable size.
Keyword note:Miratrix__Luke Jia__Jinzhu Gawalt__Brian Yu__Bin El__Ghaoui__Laurent
Report ID:801

Title:Using Control Genes to Correct for Unwanted Variation in Microarray Data
Author(s):Gagnon-Bartsch, Johann A.; Speed, Terence P.; 
Date issued:March 2011 (PDF)
Abstract:Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method, and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and SVA. We present several example studies, each concerning genes differentially expressed with respect to gender in the brain, and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression, and conclude that there may be promise, but substantial challenges remain.
Keyword note:Gagnon-Bartsch__Johann_A Speed__Terry_P
Report ID:800

Title:Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison
Author(s):Matsen, Frederick A.; Evans, Steven N.; 
Date issued:March 2011 (PDF)
Abstract:It is becoming increasingly common to analyze collections of sequence reads by first assigning each read to a location on a phylogenetic tree. In parallel, quantitative methods are being developed to compare samples of reads using the information provided by such phylogenetic placements: one example is the phylogenetic Kantorovich-Rubinstein (KR) metric which calculates a distance between pairs of samples using the evolutionary distances between the assigned positions of the reads on the phylogenetic tree. The KR distance generalizes the weighted UniFrac metric. Classical, general-purpose ordination and clustering methods can be applied to KR distances, but we argue that more interesting and interpretable results are produced by two new methods that leverage the special structure of phylogenetic placement data. Edge principal components analysis enables the detection of important differences between samples containing closely related taxa and allows the visualization of the principal component axes in terms of edges of the phylogenetic tree. Squash clustering produces informative internal edge lengths for clustering trees by incorporating distances between averages of samples, rather than the averages of distances between samples used in general-purpose procedures such as UPGMA. We present these methods and illustrate their use with data from the microbiome of the human vagina.
Keyword note:Matsen__Frederick_A Evans__Steven_N
Report ID:799

Title:Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability
Author(s):Boettiger, Alistair N.; Ralph, Peter L.; Evans, Steven N.; 
Date issued:March 2011 (PDF)
Abstract:Recent whole genome polymerase binding assays in the Drosophila embryo have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. These constitute a subset of promoter proximally paused genes which are regulated at transcription elongation rather than at initiation, and it has been proposed that this difference allows these genes to both express faster and achieve more synchronous expression across populations of cells, thus overcoming the molecular "noise" arising from low copy number factors. Promoter-proximal pausing is observed mainly in metazoans, in accord with its posited role in synchrony. Regulating gene expression by controlling release from a promoter paused state instead of by regulating access of the polymerase to the promoter DNA can be described as a rearrangement of the regulatory topology so that it controls transcriptional elongation rather than transcriptional initiation. It has been established experimentally that genes which are regulated at elongation tend to express faster and more synchronously; however, it has not been shown directly whether or not it is the change in the regulated step per se that causes this increase in speed and synchrony. We investigate this question by proposing and analyzing a continuous-time Markov chain model of polymerase complex assembly regulated at one of two steps: initial polymerase association with DNA, or release from a paused, transcribing state. Our analysis demonstrates that, over a wide range of physical parameters, increased speed and synchrony are functional consequences of elongation control. Further, we make new predictions about the effect of elongation regulation on the consistent control of total transcript number between cells, and identify which elements in the transcription induction pathway are most sensitive to molecular noise and thus may be most evolutionarily constrained. Our methods produce symbolic expressions for quantities of interest with reasonable computational effort and can be used to explore the interplay between interaction topology and molecular noise in a broader class of biochemical networks. We provide general-purpose code implementing these methods.
Keyword note:Boettiger__Alistair_N Ralph__Peter Evans__Steven_N
Report ID:798