**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2012**Results:**10**Sorted by:**

**Title:**Annotation-free estimates of gene-expression from mRNA-Seq**Author(s):**Purdom, Elizabeth; **Date issued:**September 2012

http://nma.berkeley.edu/ark:/28722/bk0016x8b02 (PDF) **Abstract:**Motivation: mRNA-Seq experiments provide an impressive array of information about the transcriptome of a sample. Yet in organisms
that undergo alternative splicing, correctly estimating the standard measures of gene expression can be a complex problem
because of complications caused by alternative splicing. The simple estimate based on the number of fragments aligning to
a gene has the potential to be biased. Many methods now exist that estimate individual isoform estimates, which can then
be combined to give accurate gene expression estimates. However, isoform estimates require either knowledge of the transcriptome
or the ability to accurately predict it. Yet many mRNA-Seq experiments are run on organisms with no known genome, much less
a transcriptome. In addition, these methods are computationally intensive and usually require access to the raw reads, making
them difficult to use for researchers who want to analyze large numbers of samples. Results: We examine estimates based
on summaries that are easy to obtain and analyze, specifically methods based on counting the number of sequenced fragments
that overlap exons. We compare these methods to isoform-based gene estimates. We show that in simulated data our gene estimation
methods based on exon counts give reasonable gene estimates in the presence of moderate alternative splicing. We compare all
of these methods on two mRNA-Seq datasets and observe little difference between any of the methods. In which case, simple
count-based methods can be sufficient and allow the experimenter to make use of statistical techniques that appropriately
account for the biological variation between samples.**Keyword note:**Purdom__Elizabeth**Report ID:**825**Relevance:**100

**Title:**Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed**Author(s):**Jacob, Laurent; Gagnon-Bartsch, Johann; Speed, Terence P.; **Date issued:**November 2012

http://nma.berkeley.edu/ark:/28722/bk0012h6d26 (PDF) **Abstract:**When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors
such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious
associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples
or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted
variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor
of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes
and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can
be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated
on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest
and compare favorably to state of the art corrections.**Keyword note:**Jacob__Laurent Gagnon-Bartsch__Johann_A Speed__Terry_P**Report ID:**818**Relevance:**100

**Title:**Stochastic flights of propellers**Author(s):**Pan, Margaret; Rein, Hanno; Chiang, Eugene; Evans, Steven N.; **Date issued:**June 2012

http://nma.berkeley.edu/ark:/28722/bk001108m24 (PDF) **Abstract:**Kilometer-sized moonlets in Saturn's A ring create S-shaped wakes called "propellers" in surrounding material. The Cassini
spacecraft has tracked the motions of propellers for several years and finds that they deviate from Keplerian orbits having
constant semimajor axes. The inferred orbital migration is known to switch sign. We show using a statistical test that the
time series of orbital longitudes of the propeller Bleriot is consistent with that of a time-integrated Gaussian random walk.
That is, Bleriot's observed migration pattern is consistent with being stochastic. We further show, using a combination of
analytic estimates and collisional N-body simulations, that stochastic migration of the right magnitude to explain the Cassini
observations can be driven by encounters with ring particles 10–20 m in radius. That the local ring mass is concentrated in
decameter-sized particles is supported on independent grounds by occultation analyses.**Keyword note:**Pan__Margaret Rein__Hanno Chiang__Eugene Evans__Steven_N**Report ID:**817**Relevance:**100

**Title:**Supervised feature selection in graphs with path coding penalties and network flows**Author(s):**Mairal, Julien; Yu, Bin; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w474v (PDF) **Abstract:**We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network.
In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with
a small number of connected components. By exploiting prior knowledge, one can indeed improve the prediction performance and/or
obtain better interpretable results. Regularization or penalty functions for selecting features in graphs have recently been
proposed but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection
problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and
"well connected" subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties
over paths on a DAG called "path coding" penalties. Unlike existing regularization functions, path coding penalties can both
model long range interactions between features in the graph and be tractable. The penalties and their proximal operators involve
path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic,
image, and genomic data that our approach is scalable and lead to more connected subgraphs than other regularization functions
for graphs.**Keyword note:**Mairal__Julien Yu__Bin**Report ID:**816**Relevance:**100

**Title:**A likelihood method for jointly estimating the selection coefficient and the allele age for time serial data**Author(s):**Malaspinas, Anna-Sapfo; Malaspinas, Orestis; Evans, Steven N.; Slatkin, Montgomery; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w470n (PDF) **Abstract:**Recent advances in sequencing technologies have made available an ever-increasing amount of ancient genomic data. In particular,
it is now possible to target specific single nucleotide polymorphisms in several samples at different time points. Such time
series data is also available in the context of experimental or viral evolution. Time-series data should allow for a more
precise inference of population genetic parameters, and to test hypotheses about the recent action of natural selection. In
this manuscript, we develop a likelihood method to jointly estimate the selection coefficient and the age of an allele from
time serial data. Our method can be used for allele frequencies sampled from a single diallelic locus. The transition probabilities
are calculated by approximating the standard diffusion equation of the Wright-Fisher model with a one step process. We show
that our method produces unbiased estimates. The power of the method is tested via simulations. Finally, the utility of the
method is illustrated with an application to several loci encoding coat color in horses, a pattern that has previously been
linked with domestication. Importantly, given our ability to estimate the age of the allele, it is possible to gain traction
on the important problem of distinguishing selection on new mutations from selection on standing variation. In this coat color
example for instance, we estimate the age of this allele, which is found to predate domestication.**Keyword note:**Malaspinas__Anna_Sapfo Malaspinas__Orestis Evans__Steven_N Slatkin__Montgomery**Report ID:**815**Relevance:**100

**Title:**A stochastic smoothing algorithm for semidefinite programming**Author(s):**d'Aspremont, Alexandre; El Karoui, Noureddine; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w472r (PDF) **Abstract:**We use a rank one Gaussian perturbation to derive a smooth stochastic approximation of the maximum eigenvalue function. We
then combine this smoothing result with an optimal smooth stochastic optimization algorithm to produce an efficient method
for solving maximum eigenvalue minimization problems. We show that the complexity of this new method is lower than that of
deterministic smoothing algorithms in certain precision/dimension regimes.**Keyword note:**d_Aspremont__Alexandre El__Karoui__Noureddine**Report ID:**814**Relevance:**100

**Title:**Penalized robust regression in high-dimension**Author(s):**Bean, Derek; Bickel, Peter; El Karoui, Noureddine; Lim, Chinghway; Yu, Bin; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w4627 (PDF) **Abstract:**We discuss the behavior of penalized robust regression estimators in high-dimension and compare our theoretical predictions
to simulations. Our results show the importance of the geometry of the dataset and shed light on the theoretical behavior
of LASSO and much more involved methods.**Keyword note:**Bean__Derek Bickel__Peter_John El__Karoui__Noureddine Lim__Chinghway Yu__Bin**Report ID:**813**Relevance:**100

**Title:**On robust regression with high-dimensional predictors**Author(s):**El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w464b (PDF) **Abstract:**We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both
large but p<=n. This is the short version of the paper.**Keyword note:**El Karoui, Noureddine Bean__Derek Bickel__Peter_John Lim__Chinghway Yu__Bin**Report ID:**812**Relevance:**100

**Title:**On robust regression with high-dimensional predictors**Author(s):**El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w466f (PDF) **Abstract:**We consider the problem of understanding the properties of robust regression estimators in the high-dimensional setting. This
is the long version of the paper.**Keyword note:**El__Karoui__Noureddine Bean__Derek Bickel__Peter_John Lim__Chinghway Yu__Bin**Report ID:**811**Relevance:**100

**Title:**Optimal objective function in high-dimensional regression**Author(s):**El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Yu, Bin; **Date issued:**April 2012

http://nma.berkeley.edu/ark:/28722/bk0010w468j (PDF) **Abstract:**We consider, for the first time in the modern setting of high-dimensional statistics, the classic problem of optimizing the
objective function in regression. We propose an algorithm to compute this optimal objective function that takes into account
the dimensionality of the problem.**Keyword note:**Bean__Derek Bickel__Peter_John El__Karoui__Noureddine Yu__Bin**Report ID:**810**Relevance:**100