Statistics Technical Reports:Search | Browse by year

Sorted by:

Title:Unseparated pairs and fixed points in random permutations
Author(s):Diaconis, Persi; Evans, Steven N.; Graham, Ron; 
Date issued:August 2013 (PDF)
Abstract:In a uniform random permutation \Pi of [n] := {1,2,...,n}, the set of elements k in [n-1] such that \Pi(k+1) = \Pi(k) + 1 has the same distribution as the set of fixed points of \Pi that lie in [n-1]. We give three different proofs of this fact using, respectively, an enumeration relying on the inclusion-exclusion principle, the introduction of two different Markov chains to generate uniform random permutations, and the construction of a combinatorial bijection. We also obtain the distribution of the analogous set for circular permutations that consists of those k in [n] such that \Pi(k+1 mod n) = \Pi(k) + 1 mod n. This latter random set is just the set of fixed points of the commutator [\rho, \Pi], where \rho is the n-cycle (1,2,...,n). We show for a general permutation \eta that, under weak conditions on the number of fixed points and 2-cycles of \eta, the total variation distance between the distribution of the number of fixed points of [\eta,\Pi] and a Poisson distribution with expected value 1 is small when n is large.
Keyword note:Diaconis__Persi Evans__Steven_N Graham__Ron
Report ID:822

Title:Analysis and rejection sampling of Wright-Fisher diffusion bridges
Author(s):Schraiber, Joshua G.; Griffiths, Robert C.; Evans, Steven N.; 
Date issued:June 2013 (PDF)
Abstract:We investigate the properties of a Wright–Fisher diffusion process starting at frequency x at time 0 and conditioned to be at frequency y at time T. Such a process is called a bridge. Bridges arise naturally in the analysis of selection acting on standing variation and in the inference of selection from allele frequency time series. We establish a number of results about the distribution of neutral Wright–Fisher bridges and develop a novel rejection-sampling scheme for bridges under selection that we use to study their behavior.
Keyword note:Schraiber__Joshua_G Griffiths__Robert_C Evans__Steven_N
Report ID:821

Title:Removing unwanted variation from high dimensional data with negative controls
Author(s):Gagnon-Bartsch, Johann A.; Jacob, Laurent; Speed, Terence P.; 
Date issued:December 2013 (PDF)
Abstract:High dimensional data suffer from unwanted variation, such as the batch effects common in microarray data. Unwanted variation complicates the analysis of high dimensional data, leading to high rates of false discoveries, high rates of missed discoveries, or both. In many cases the factors causing the unwanted variation are unknown and must be inferred from the data. In such cases, negative controls may be used to identify the unwanted variation and separate it from the wanted variation. We present a new method, RUV-4, to adjust for unwanted variation in high dimensional data with negative controls. RUV-4 may be used when the goal of the analysis is to determine which of the features are truly associated with a given factor of interest. One nice property of RUV-4 is that it is relatively insensitive to the number of unwanted factors included in the model; this makes estimating the number of factors less critical. We also present a novel method for estimating the features' variances that may be used even when a large number of unwanted factors are included in the model and the design matrix is full rank. We name this the "inverse method for estimating variances." By combining RUV-4 with the inverse method, it is no longer necessary to estimate the number of unwanted factors at all. Using both real and simulated data we compare the performance of RUV-4 with that of other adjustment methods such as SVA, LEAPP, ICE, and RUV-2. We find that RUV-4 and its variants perform as well or better than other methods.
Keyword note:Gagnon-Bartsch__Johann_A Jacob__Laurent Speed__Terry_P
Report ID:820

Title:Comparing somatic mutation-callers
Author(s):Kim, Su Yeon; Speed, Terence P.; 
Date issued:February 2013 (PDF)
Abstract:Background: Somatic mutation-calling based on DNA from matched tumor-normal patient samples is one of the key tasks carried by many cancer genome projects. One such large-scale project is The Cancer Genome Atlas (TCGA), which is now routinely compiling catalogs of somatic mutations from hundreds of paired tumor-normal DNA exome-sequence data. Nonetheless, mutation calling is still very challenging. TCGA benchmark studies revealed that even relatively recent mutation callers from major centers showed substantial discrepancies. Evaluation of the mutation callers or understanding the sources of discrepancies is not straightforward, since for most tumor studies, validation data based on independent whole-exome DNA sequencing is not available, only partial validation data for a selected (ascertained) subset of sites. Results: We have analyzed two sets of mutation-calling data from multiple centers and their partial validation data. Various aspects of the mutation-calling outputs were explored to characterize the discrepancies in detail. To assess the performances of multiple callers, we introduce four approaches utilizing the external sequence data to varying degrees, ranging from having independent DNA-seq pairs, RNA-seq for tumor samples only, the original exome-seq pairs only, or none of those. Conclusions: Our analyses provide guidelines to visualizing and understanding the discrepancies among the outputs from multiple callers. Furthermore, applying the four evaluation approaches to the whole exome data, we illustrate the challenges and highlight the various circumstances that require extra caution in assessing the performances of multiple callers.
Keyword note:Kim__Su_Yeon Speed__Terry_P
Report ID:819