Statistics Technical Reports:Search | Browse by year

Sorted by:

Title:Annotation-free estimates of gene-expression from mRNA-Seq
Author(s):Purdom, Elizabeth; 
Date issued:September 2012 (PDF)
Abstract:Motivation: mRNA-Seq experiments provide an impressive array of information about the transcriptome of a sample. Yet in organisms that undergo alternative splicing, correctly estimating the standard measures of gene expression can be a complex problem because of complications caused by alternative splicing. The simple estimate based on the number of fragments aligning to a gene has the potential to be biased. Many methods now exist that estimate individual isoform estimates, which can then be combined to give accurate gene expression estimates. However, isoform estimates require either knowledge of the transcriptome or the ability to accurately predict it. Yet many mRNA-Seq experiments are run on organisms with no known genome, much less a transcriptome. In addition, these methods are computationally intensive and usually require access to the raw reads, making them difficult to use for researchers who want to analyze large numbers of samples. Results: We examine estimates based on summaries that are easy to obtain and analyze, specifically methods based on counting the number of sequenced fragments that overlap exons. We compare these methods to isoform-based gene estimates. We show that in simulated data our gene estimation methods based on exon counts give reasonable gene estimates in the presence of moderate alternative splicing. We compare all of these methods on two mRNA-Seq datasets and observe little difference between any of the methods. In which case, simple count-based methods can be sufficient and allow the experimenter to make use of statistical techniques that appropriately account for the biological variation between samples.
Keyword note:Purdom__Elizabeth
Report ID:825

Title:Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed
Author(s):Jacob, Laurent; Gagnon-Bartsch, Johann; Speed, Terence P.; 
Date issued:November 2012 (PDF)
Abstract:When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state of the art corrections.
Keyword note:Jacob__Laurent Gagnon-Bartsch__Johann_A Speed__Terry_P
Report ID:818

Title:Stochastic flights of propellers
Author(s):Pan, Margaret; Rein, Hanno; Chiang, Eugene; Evans, Steven N.; 
Date issued:June 2012 (PDF)
Abstract:Kilometer-sized moonlets in Saturn's A ring create S-shaped wakes called "propellers" in surrounding material. The Cassini spacecraft has tracked the motions of propellers for several years and finds that they deviate from Keplerian orbits having constant semimajor axes. The inferred orbital migration is known to switch sign. We show using a statistical test that the time series of orbital longitudes of the propeller Bleriot is consistent with that of a time-integrated Gaussian random walk. That is, Bleriot's observed migration pattern is consistent with being stochastic. We further show, using a combination of analytic estimates and collisional N-body simulations, that stochastic migration of the right magnitude to explain the Cassini observations can be driven by encounters with ring particles 10–20 m in radius. That the local ring mass is concentrated in decameter-sized particles is supported on independent grounds by occultation analyses.
Keyword note:Pan__Margaret Rein__Hanno Chiang__Eugene Evans__Steven_N
Report ID:817

Title:Supervised feature selection in graphs with path coding penalties and network flows
Author(s):Mairal, Julien; Yu, Bin; 
Date issued:April 2012 (PDF)
Abstract:We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with a small number of connected components. By exploiting prior knowledge, one can indeed improve the prediction performance and/or obtain better interpretable results. Regularization or penalty functions for selecting features in graphs have recently been proposed but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and "well connected" subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called "path coding" penalties. Unlike existing regularization functions, path coding penalties can both model long range interactions between features in the graph and be tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and lead to more connected subgraphs than other regularization functions for graphs.
Keyword note:Mairal__Julien Yu__Bin
Report ID:816

Title:A likelihood method for jointly estimating the selection coefficient and the allele age for time serial data
Author(s):Malaspinas, Anna-Sapfo; Malaspinas, Orestis; Evans, Steven N.; Slatkin, Montgomery; 
Date issued:April 2012 (PDF)
Abstract:Recent advances in sequencing technologies have made available an ever-increasing amount of ancient genomic data. In particular, it is now possible to target specific single nucleotide polymorphisms in several samples at different time points. Such time series data is also available in the context of experimental or viral evolution. Time-series data should allow for a more precise inference of population genetic parameters, and to test hypotheses about the recent action of natural selection. In this manuscript, we develop a likelihood method to jointly estimate the selection coefficient and the age of an allele from time serial data. Our method can be used for allele frequencies sampled from a single diallelic locus. The transition probabilities are calculated by approximating the standard diffusion equation of the Wright-Fisher model with a one step process. We show that our method produces unbiased estimates. The power of the method is tested via simulations. Finally, the utility of the method is illustrated with an application to several loci encoding coat color in horses, a pattern that has previously been linked with domestication. Importantly, given our ability to estimate the age of the allele, it is possible to gain traction on the important problem of distinguishing selection on new mutations from selection on standing variation. In this coat color example for instance, we estimate the age of this allele, which is found to predate domestication.
Keyword note:Malaspinas__Anna_Sapfo Malaspinas__Orestis Evans__Steven_N Slatkin__Montgomery
Report ID:815

Title:A stochastic smoothing algorithm for semidefinite programming
Author(s):d'Aspremont, Alexandre; El Karoui, Noureddine; 
Date issued:April 2012 (PDF)
Abstract:We use a rank one Gaussian perturbation to derive a smooth stochastic approximation of the maximum eigenvalue function. We then combine this smoothing result with an optimal smooth stochastic optimization algorithm to produce an efficient method for solving maximum eigenvalue minimization problems. We show that the complexity of this new method is lower than that of deterministic smoothing algorithms in certain precision/dimension regimes.
Keyword note:d_Aspremont__Alexandre El__Karoui__Noureddine
Report ID:814

Title:Penalized robust regression in high-dimension
Author(s):Bean, Derek; Bickel, Peter; El Karoui, Noureddine; Lim, Chinghway; Yu, Bin; 
Date issued:April 2012 (PDF)
Abstract:We discuss the behavior of penalized robust regression estimators in high-dimension and compare our theoretical predictions to simulations. Our results show the importance of the geometry of the dataset and shed light on the theoretical behavior of LASSO and much more involved methods.
Keyword note:Bean__Derek Bickel__Peter_John El__Karoui__Noureddine Lim__Chinghway Yu__Bin
Report ID:813

Title:On robust regression with high-dimensional predictors
Author(s):El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin; 
Date issued:April 2012 (PDF)
Abstract:We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both large but p<=n. This is the short version of the paper.
Keyword note:El Karoui, Noureddine Bean__Derek Bickel__Peter_John Lim__Chinghway Yu__Bin
Report ID:812

Title:On robust regression with high-dimensional predictors
Author(s):El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin; 
Date issued:April 2012 (PDF)
Abstract:We consider the problem of understanding the properties of robust regression estimators in the high-dimensional setting. This is the long version of the paper.
Keyword note:El__Karoui__Noureddine Bean__Derek Bickel__Peter_John Lim__Chinghway Yu__Bin
Report ID:811

Title:Optimal objective function in high-dimensional regression
Author(s):El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Yu, Bin; 
Date issued:April 2012 (PDF)
Abstract:We consider, for the first time in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem.
Keyword note:Bean__Derek Bickel__Peter_John El__Karoui__Noureddine Yu__Bin
Report ID:810