Title: Annotation-free estimates of gene-expression from mRNA-Seq
Author(s): Purdom, Elizabeth
Date issued: September 2012

Abstract: Motivation: mRNA-Seq experiments provide an impressive array of information about the transcriptome of a sample. Yet in organisms
that undergo alternative splicing, correctly estimating the standard measures of gene expression can be a complex problem
because of complications caused by alternative splicing. The simple estimate based on the number of fragments aligning to
a gene has the potential to be biased. Many methods now exist that estimate individual isoform estimates, which can then
be combined to give accurate gene expression estimates. However, isoform estimates require either knowledge of the transcriptome
or the ability to accurately predict it. Yet many mRNA-Seq experiments are run on organisms with no known genome, much less
a transcriptome. In addition, these methods are computationally intensive and usually require access to the raw reads, making
them difficult to use for researchers who want to analyze large numbers of samples. Results: We examine estimates based
on summaries that are easy to obtain and analyze, specifically methods based on counting the number of sequenced fragments
that overlap exons. We compare these methods to isoform-based gene estimates. We show that in simulated data our gene estimation
methods based on exon counts give reasonable gene estimates in the presence of moderate alternative splicing. We compare all
of these methods on two mRNA-Seq datasets and observe little difference between any of the methods. In which case, simple
count-based methods can be sufficient and allow the experimenter to make use of statistical techniques that appropriately
account for the biological variation between samples.

Title: Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed
Author(s): Jacob, Laurent; Gagnon-Bartsch, Johann; Speed, Terence P.
Date issued: November 2012

Abstract: When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors
such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious
associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples
or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted
variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor
of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes
and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can
be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated
on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest
and compare favorably to state of the art corrections.

Title: Stochastic flights of propellers
Author(s): Pan, Margaret; Rein, Hanno; Chiang, Eugene; Evans, Steven N.
Date issued: June 2012

Abstract: Kilometer-sized moonlets in Saturn's A ring create S-shaped wakes called "propellers" in surrounding material. The Cassini
spacecraft has tracked the motions of propellers for several years and finds that they deviate from Keplerian orbits having
constant semimajor axes. The inferred orbital migration is known to switch sign. We show using a statistical test that the
time series of orbital longitudes of the propeller Bleriot is consistent with that of a time-integrated Gaussian random walk.
That is, Bleriot's observed migration pattern is consistent with being stochastic. We further show, using a combination of
analytic estimates and collisional N-body simulations, that stochastic migration of the right magnitude to explain the Cassini
observations can be driven by encounters with ring particles 10–20 m in radius. That the local ring mass is concentrated in
decameter-sized particles is supported on independent grounds by occultation analyses.

Title: Supervised feature selection in graphs with path coding penalties and network flows
Author(s): Mairal, Julien; Yu, Bin
Date issued: April 2012

Abstract: We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network.
In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with
a small number of connected components. By exploiting prior knowledge, one can indeed improve the prediction performance and/or
obtain better interpretable results. Regularization or penalty functions for selecting features in graphs have recently been
proposed but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection
problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and
"well connected" subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties
over paths on a DAG called "path coding" penalties. Unlike existing regularization functions, path coding penalties can both
model long range interactions between features in the graph and be tractable. The penalties and their proximal operators involve
path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic,
image, and genomic data that our approach is scalable and lead to more connected subgraphs than other regularization functions
for graphs.

Title: A likelihood method for jointly estimating the selection coefficient and the allele age for time serial data
Author(s): Malaspinas, Anna-Sapfo; Malaspinas, Orestis; Evans, Steven N.; Slatkin, Montgomery
Date issued: April 2012

Abstract: Recent advances in sequencing technologies have made available an ever-increasing amount of ancient genomic data. In particular,
it is now possible to target specific single nucleotide polymorphisms in several samples at different time points. Such time
series data is also available in the context of experimental or viral evolution. Time-series data should allow for a more
precise inference of population genetic parameters, and to test hypotheses about the recent action of natural selection. In
this manuscript, we develop a likelihood method to jointly estimate the selection coefficient and the age of an allele from
time serial data. Our method can be used for allele frequencies sampled from a single diallelic locus. The transition probabilities
are calculated by approximating the standard diffusion equation of the Wright-Fisher model with a one step process. We show
that our method produces unbiased estimates. The power of the method is tested via simulations. Finally, the utility of the
method is illustrated with an application to several loci encoding coat color in horses, a pattern that has previously been
linked with domestication. Importantly, given our ability to estimate the age of the allele, it is possible to gain traction
on the important problem of distinguishing selection on new mutations from selection on standing variation. In this coat color
example for instance, we estimate the age of this allele, which is found to predate domestication.

Title: A stochastic smoothing algorithm for semidefinite programming
Author(s): d'Aspremont, Alexandre; El Karoui, Noureddine
Date issued: April 2012

Abstract: We use a rank one Gaussian perturbation to derive a smooth stochastic approximation of the maximum eigenvalue function. We
then combine this smoothing result with an optimal smooth stochastic optimization algorithm to produce an efficient method
for solving maximum eigenvalue minimization problems. We show that the complexity of this new method is lower than that of
deterministic smoothing algorithms in certain precision/dimension regimes.

Title: Penalized robust regression in high-dimension
Author(s): Bean, Derek; Bickel, Peter; El Karoui, Noureddine; Lim, Chinghway; Yu, Bin
Date issued: April 2012

Abstract: We discuss the behavior of penalized robust regression estimators in high-dimension and compare our theoretical predictions
to simulations. Our results show the importance of the geometry of the dataset and shed light on the theoretical behavior
of LASSO and much more involved methods.

Title: On robust regression with high-dimensional predictors
Author(s): El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin
Date issued: April 2012

Abstract: We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both
large but p<=n. This is the short version of the paper.

Title: On robust regression with high-dimensional predictors
Author(s): El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Lim, Chinghway; Yu, Bin
Date issued: April 2012

Abstract: We consider the problem of understanding the properties of robust regression estimators in the high-dimensional setting. This
is the long version of the paper.

Title: Optimal objective function in high-dimensional regression
Author(s): El Karoui, Noureddine; Bean, Derek; Bickel, Peter; Yu, Bin
Date issued: April 2012

Abstract: We consider, for the first time in the modern setting of high-dimensional statistics, the classic problem of optimizing the
objective function in regression. We propose an algorithm to compute this optimal objective function that takes into account
the dimensionality of the problem.