**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2010**Results:**12**Sorted by:**

**Title:**A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers**Author(s):**Negahban, Sahand; Ravikumar, Pradeep; Wainwright, Martin J.; Yu, Bin; **Date issued:**October 2010

http://nma.berkeley.edu/ark:/28722/bk00077105b (PDF) **Abstract:**High-dimensional statistical inference deals with models in which the the number of parameters $p$ is comparable to or larger
than the sample size $n$. Since it is usually impossible to obtain consistent procedures unless $p/n \rightarrow 0$, a line
of recent work has studied models with various types of low-dimensional structure (e.g., sparse vectors; block-structured
matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized
convex program (known as a regularized $M$-estimator) which combines a loss function (measuring how well the model fits the
data) with some regularization function that encourages the assumed structure. This paper provides a unified framework for
establishing consistency and convergence rates for such regularized $M$-estimators under high-dimensional scaling. We state
one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results
on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions,
referred to as restricted strong convexity and decomposability, that ensure corresponding regularized $M$-estimators have
fast convergence rates, and which are optimal in many well-studied cases.**Keyword note:**Negahban__Sahand Ravikumar__Pradeep Wainwright__Martin Yu__Bin**Report ID:**797**Relevance:**100

**Title:**Trickle-down processes and their boundaries**Author(s):**Evans, Steven N.; Gruebel, Rudolf; Wakolbinger, Anton; **Date issued:**October 2010

http://nma.berkeley.edu/ark:/28722/bk00077107f (PDF) **Abstract:**It is possible to represent each of a number of Markov chains as an evolving sequence of connected subsets of a directed acyclic
graph that grow in the following way: initially, all vertices of the graph are unoccupied, particles are fed in one-by-one
at a distinguished source vertex, successive particles proceed along directed edges according to an appropriate stochastic
mechanism, and each particle comes to rest once it encounters an unoccupied vertex. Examples include the binary and digital
search tree processes, the random recursive tree process and generalizations of it arising from nested instances of Pitman's
two-parameter Chinese restaurant process, tree-growth models associated with Mallows' phi model of random permutations and
with Schuetzenberger's non-commutative q-binomial theorem, and a construction due to Luczak and Winkler that grows uniform
random binary trees in a Markovian manner. We introduce a framework that encompasses such Markov chains, and we characterize
their asymptotic behavior by analyzing in detail their Doob-Martin compactifications, Poisson boundaries and tail sigma-fields.**Keyword note:**Evans__Steven_N Gruebel__Rudolf Wakolbinger__Anton**Report ID:**796**Relevance:**100

**Title:**Minimax-optimal rates for sparse additive models over kernel classes via convex programming**Author(s):**Raskutti, Garvesh; Wainwright, Martin J.; Yu, Bin; **Date issued:**August 2010

http://nma.berkeley.edu/ark:/28722/bk000770z14 (PDF) **Abstract:**Sparse additive models are families of $d$-variate functions that have the additive decomposition \mbox{$f^* = \sum_{j \in
S} f^*_j$,} where $S$ is a unknown subset of cardinality $s \ll d$. We consider the case where each component function $f^*_j$
lies in a reproducing kernel Hilbert space, and analyze a simple kernel-based convex program for estimating the unknown function
$f^*$. Working within a high-dimensional framework that allows both the dimension $d$ and sparsity $s$ to scale, we derive
convergence rates in the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norms. These rates consist of two terms: a \emph{subset
selection term} of the order $\frac{s \log d}{n}$, corresponding to the difficulty of finding the unknown $s$-sized subset,
and an \emph{estimation error} term of the order $s \, \nu_n^2$, where $\nu_n^2$ is the optimal rate for estimating an univariate
function within the RKHS. We complement these achievable results by deriving minimax lower bounds on the $L^2(\mathbb{P})$
error, thereby showing that our method is optimal up to constant factors for sub-linear sparsity $s = o(d)$. Thus, we obtain
optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, finite-rank
kernel classes, as well as Sobolev smoothness classes.**Keyword note:**Raskutti__Garvesh Wainwright__Martin Yu__Bin**Report ID:**795**Relevance:**100

**Title:**Non-existence of Markovian time dynamics for graphical models of correlated default**Author(s):**Evans, Steven N.; Hening, Alexandru; **Date issued:**August 2010

http://nma.berkeley.edu/ark:/28722/bk000770z5b (PDF) **Abstract:**Filiz et al. (2008) proposed a model for the pattern of defaults seen among a group of firms at the end of a given time period.
The ingredients in the model are a graph, where the vertices correspond to the firms and the edges describe the network of
interdependencies between the firms, a parameter for each vertex that captures the individual propensity of that firm to default,
and a parameter for each edge that captures the joint propensity of the two connected firms to default. The correlated default
model can be re-rewritten as a standard Ising model on the graph by identifying the set of defaulting firms in the default
model with the set of sites in the Ising model for which the spin is +1. We ask whether there is a suitable continuous time
Markov chain taking values in the subsets of the vertex set such that the initial state of the chain is the empty set, each
jump of the chain involves the inclusion of a single extra vertex, the distribution of the chain at some fixed time horizon
time is the one given by the default model, and the distribution of the chain for other times is described by a probability
distribution in the same family as the default model. We show for three simple but financially natural special cases that
this is not possible outside of the trivial case where there is complete independence between the firms.**Keyword note:**Evans__Steven_N Hening__Alexandru**Report ID:**794**Relevance:**100

**Title:**Pade approximants and exact two-locus sampling distributions**Author(s):**Jenkins, Paul A.; Song, Yun S.; **Date issued:**July 2010

http://nma.berkeley.edu/ark:/28722/bk000770x91 (PDF) **Abstract:**For population genetics models with recombination, obtaining an exact, analytic sampling distribution has remained a challenging
open problem for several decades. Recently, a new perspective based on asymptotic series has been introduced to make progress
on this problem. Specifically, closed-form expressions have been derived for the first few terms in an asymptotic expansion
of the two-locus sampling distribution when the recombination rate rho is moderate to large. In this paper, a new computational
technique is developed for finding the asymptotic expansion to an arbitrary order. Computation in this new approach can be
automated easily. Furthermore, it is proved here that only a finite number of terms in the asymptotic expansion is needed
to recover (via the method of Pade approximants) the exact two-locus sampling distribution as an analytic function of rho;
this function is exact for all values of rho from 0 to infinity. It is also shown that the new computational framework presented
here is flexible enough to incorporate natural selection.**Keyword note:**Jenkins__Paul_A Song__Yun_S**Report ID:**793**Relevance:**100

**Title:**An Analysis of the Convergence of Graph Laplacians**Author(s):**Ting, Daniel; Huang, Ling; Jordan, Michael; **Date issued:**July 2010

http://nma.berkeley.edu/ark:/28722/bk000770z37 (PDF) **Abstract:**Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with
smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously
unstudied graphs including kNN graphs. We also introduce a kernel-free framework to analyze graph constructions with shrinking
neighborhoods in general and apply it to analyze locally linear embedding (LLE). We also describe how for a given limiting
Laplacian operator desirable properties such as a convergent spectrum and sparseness can be achieved choosing the appropriate
graph construction.**Keyword note:**Ting__Daniel Huang__Ling Jordan__Michael_I**Report ID:**792**Relevance:**100

**Title:**Spectral clustering and the high-dimensional Stochastic Block Model**Author(s):**Rohe, Karl; Chatterjee, Sourav; Yu, Bin; **Date issued:**July 2010

http://nma.berkeley.edu/ark:/28722/bk000770z7f (PDF) **Abstract:**Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors.
Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected
actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally
feasible method to discover these communities.The Stochastic Block Model (Holland et al., 1983) is a social network model
with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model,
we bound the number of nodes "misclustered" by spectral clustering. The asymptotic results in this paper are the first clustering
results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional.In
order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space
model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normalized
graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique.
Our method of studying the eigenvectors of random matrices is original.**Keyword note:**Rohe__Karl Chatterjee__Sourav Yu__Bin**Report ID:**791**Relevance:**100

**Title:**Measuring reproducibility of high-throughput experiments**Author(s):**Li, Qunhua; Brown, James B.; Huang, Haiyan; Bickel, Peter J.; **Date issued:**May 2010

http://nma.berkeley.edu/ark:/28722/bk000770z9j (PDF) **Abstract:**Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work, we propose a unified
approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries
using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively
assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from
which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to
the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds
both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate,
it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm
using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method
in a ChIP-seq experiment.**Keyword note:**Li__Qunhua Brown__James_B Huang__Haiyan Bickel__Peter_John**Report ID:**790**Relevance:**100

**Title:**The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples**Author(s):**Evans, Steven N.; Matsen, Frederick A.; **Date issued:**May 2010

http://nma.berkeley.edu/ark:/28722/bk000771014 (PDF) **Abstract:**We employ the Kantorovich-Rubinstein (KR) metric and $L^p$ generalizations to compare probability distributions on a given
phylogenetic tree. Such distributions arise in the context of metagenomics, where a sample of environmental sequences may
be treated as a collection of weighted points on a reference phylogenetic tree of known sequences. In contrast to many applications
of Kantorovich-Rubinstein ideas, the phylogenetic KR metric can be written in a closed form and calculated in linear time.
Using Monte Carlo resampling of the data, we assign a statistical significance level to the observed distance between two
distributions under a null hypothesis of no clustering. We also approximate the significance level using a functional of a
suitable Gaussian process; in the $L^2$ generalized case this functional is distributed as a linear combination of $\chi_1^2$
random variables weighted by the eigenvalues of an associated matrix. We conclude with an example application using our software
implementation of the KR metric and its generalizations.**Keyword note:**Evans__Steven_N Matsen__Frederick_A**Report ID:**789**Relevance:**100

**Title:**Shape-based peak identification for ChIP-Seq**Author(s):**Hower, Valerie; Evans, Steven N.; Pachter, Lior; **Date issued:**May 2010

http://nma.berkeley.edu/ark:/28722/bk000771037 (PDF) **Abstract:**We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically
significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a
non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem
to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets,
and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events.
The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.html.**Keyword note:**Hower__Valerie Evans__Steven_N Pachter__Lior**Report ID:**788**Relevance:**100

**Title:**Coverage statistics for sequence census methods**Author(s):**Evans, Steven N.; Hower, Valerie; Pachter, Lior; **Date issued:**April 2010

http://nma.berkeley.edu/ark:/28722/bk000713912 (PDF) **Abstract:**We study the shape of coverage functions resulting from the sequencing of random genome fragments, and show that they can
be described by Galton-Watson trees. This extends standard analyses of shotgun sequencing that focus on coverage statistics
at individual sites, and provides a null model for detecting deviations from random coverage in high-throughput sequence census
based experiments such as ChIP-Seq.**Keyword note:**Evans__Steven_N Hower__Valerie Pachter__Lior**Report ID:**787**Relevance:**100

**Title:**Second order accurate distributed eigenvector computation for extremely large matrices**Author(s):**El Karoui, Noureddine; d'Aspremont, Alexandre; **Date issued:**February 2010

http://nma.berkeley.edu/ark:/28722/bk00071379f (PDF) **Abstract:**We propose a second-order accurate method to estimate the eigenvectors of extremely large matrices thereby addressing a problem
of relevance to statisticians working in the analysis of very large datasets. More specifically, we show that averaging eigenvectors
of randomly subsampled matrices efficiently approximates the true eigenvectors of the original matrix under certain conditions
on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix
completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in dimensionality reduction and
information retrieval.**Keyword note:**El__Karoui__Noureddine d_Aspremont__Alexandre**Report ID:**786**Relevance:**100