**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2007**Results:**21**Sorted by:****Page: 1 2 Next**

**Title:**The spectrum of kernel random matrices**Author(s):**El Karoui, Noureddine; **Date issued:**December 2007

http://nma.berkeley.edu/ark:/28722/bk0005d085d (PDF) **Abstract:**We place ourselves in the setting of high-dimensional statistical inference, where the number of variables p in a dataset
of interest is of the same order of magnitude as the number of observations n. We consider the spectrum of certain kernel
random matrices, in particular n x n matrices whose (i,j)-th entry is f(X_i'X_j/p) or f(\norm{X_i-X_j}^2/p), where p is the
dimension of the data, and X_i are independent data vectors. Here f is assumed to be a locally smooth function. The study
is motivated by questions arising in statistics and computer science, where these matrices are used to perform, among other
things, non-linear versions of principal component analysis. Surprisingly, we show that in high-dimensions, and for the models
we analyze, the problem becomes essentially linear - which is at odds with heuristics sometimes used to justify the usage
of these methods. The analysis also highlights certain peculiarities of models widely studied in random matrix theory and
raises some questions about their relevance as tools to model high-dimensional data encountered in practice.**Keyword note:**El__Karoui__Noureddine**Report ID:**748**Relevance:**100

**Title:**The Effectiveness of Internet Content Filters**Author(s):**Stark, P.B.; **Date issued:**November 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0905 (PDF) **Abstract:**As part of its defense of the Child Online Protection Act, which seeks to prevent minors from viewing commercially published
harmful-to-minors material on the World Wide Web, the U.S. Department of Justice commissioned a study of the prevalence of
adult materials and the effectiveness of Internet content filters in blocking them. As of 2005-6, about 1.1% of webpages
indexed by Google and MSN were adult - hundreds of millions of pages. About 6% of a set of 1.3 billion searches executed
on AOL, MSN and Yahoo! in summer 2005 retrieve at least one adult webpage among the first ten results, and about 1.7% of the
first the results are adult webpages. These estimates are based on simple random samples of webpages indexed by search engines
and a stratified random sample of searches. Webpages with sexually explicit content intended for adult entertainment (i.e.,
not in an educational, medical or artistic context) were used to test a variety of Internet content filters for underblocking
failing to block webpages that they are intended to block. A random sample of clean webpages with no sexual content or reference
to sex was used to test the filters for overblocking blocking webpages they are not intended to block. Webpages retrieved
by the most popular searches according to Wordtracker were also categorized and used to test filters. Generally, filters
with lower rates of underblocking had higher rates of overblocking. If the filter most effective at blocking adult materials
were applied to search indexes, typical query results or the results of popular queries, the number of clean pages blocked
in error would exceed the number of adult pages blocked correctly.**Keyword note:**Stark__Philip_B**Report ID:**746**Relevance:**100

**Title:**Covariance regularization by thresholding**Author(s):**Bickel, Peter J.; Levina, Elizaveta; **Date issued:**October 2007

http://nma.berkeley.edu/ark:/28722/bk0005d094c (PDF) **Abstract:**This paper considers regularizing a covariance matrix of p variables estimated from n observations, by hard thresholding.
We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in
a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates. The results are
uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling
scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding
to other covariance estimators in simulations and on an example from climate data.**Keyword note:**Bickel__Peter_John Levina__Elizaveta**Report ID:**744**Relevance:**100

**Title:**Joint covariate selection for grouped classification**Author(s):**Obozinski, Guillaume; Taskar, Ben; Jordan, Michael; **Date issued:**October 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0971 (PDF) **Abstract:**We address the problem of recovering a common set of covariates that are relevant simultaneously to several classification
problems. We propose a joint measure of complexity for the group of problems that couples covariate selection. By penalizing
the sum of L2-norms of the blocks of coefficients associated with each covariate across different classification problems,
we encourage similar sparsity patterns in all models. To fit parameters under this regularization, we propose a blockwise
boosting scheme that follows the regularization path. As the regularization coefficient decreases, the algorithm maintains
and updates concurrently a growing set of covariates that are simultaneously active for all problems. We show empirically
that this approach outperforms independent L1-based covariate selection on several data sets, both in accuracy and number
of selected covariates.**Keyword note:**Obozinski__Guillaume Taskar__Ben Jordan__Michael_I**Report ID:**743**Relevance:**100

**Title:**Conservative Statistical Post-Election Audits**Author(s):**Stark, Philip B.; **Date issued:**October 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0c4d (PDF) **Abstract:**There are many sources of error in counting votes on election day: the apparent winner might not be the rightful winner. Hand
tallies of the votes in a random sample of precincts can be used to test the hypothesis that the wrong candidate was named
the winner. This paper develops a conservative sequential test based on the vote-counting errors found in a hand tally of
a simple or stratified random sample of precincts.**Keyword note:**Stark__Philip_B**Report ID:**741**Relevance:**100

**Title:**On spectral properties of large dimensional correlation matrices and covariance matrices computed from elliptically distributed
data**Author(s):**El Karoui, Noureddine; **Date issued:**October 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0d3c (PDF) **Abstract:**We place ourselves in the setting of high-dimensional statistical inference, where the number of variables p in a dataset
of interest is of the same order of magnitude as the number of observations n. More formally we study the asymptotic properties
of correlation and covariance matrices under the setting that p/n tends to rho in (0,infinity), for general population covariance.
We show that spectral properties for large dimensional correlation matrices are similar to those of large dimensional covariance
matrices, for a large class of models studied in random matrix theory. We also derive a Marcenko-Pastur-type system of equations
for the limiting spectral distribution of covariance matrices computed from elliptically distributed data. The motivation
for this study comes from the relevance of such distributional assumptions to problems in econometrics and portfolio optimization.
A mathematical theme of the paper is the important use we make of concentration inequalities.**Keyword note:**El__Karoui__Noureddine**Report ID:**740**Relevance:**100

**Title:**Consistent Estimates of Deformed Isotropic Gaussian Random Fields on the Plane**Author(s):**Anderes, Ethan; Chatterjee, Sourav; **Date issued:**October 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0f2b (PDF) **Abstract:**This paper proves fixed domain asymptotic results for estimating a smooth invertible transformation $f\colon \Bbb R^2\rightarrow
\Bbb R^2$ when observing the deformed random field $Z\circ f$ on a dense grid in a bounded simply connected domain $\Omega$
where $ Z$ is assumed to be an isotropic Gaussian random field on $\Bbb R^2$. The estimate, $\hat f$, is constructed on a
simply connected domain $U$ such that $\overline U\subset\Omega$ and is defined using kernel smoothed quadratic variations,
Bergman projections and results from quasiconformal theory. We show under mild assumptions on the random field $Z$ and the
deformation $f$ that $\hat f\rightarrow R_\theta f+c$ uniformly on compact subsets of $U$ with probability one as the grid
spacing goes to zero, where $R_\theta$ is an unidentifiable rotation and $c$ is an unidentifiable translation.**Keyword note:**Anderes__Ethan Chatterjee__Sourav**Report ID:**739**Relevance:**100

**Title:**Convergence Analysis of Reweighted Sum-Product Algorithms**Author(s):**Roosta, Tanya; Wainwright, Martin J.; Sastry, Shankar; **Date issued:**August 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0g7m (PDF) **Abstract:**Markov random fields are designed to represent structured dependencies among large collections of random variables, and are
well-suited to capture the structure of real-world signals. Many fundamental tasks in signal processing (e.g., smoothing,
denoising, segmentation etc.) require efficient methods for computing (approximate) marginal probabilities over subsets of
nodes in the graph. The marginalization problem, though solvable in linear time for graphs without cycles, is computationally
intractable for general graphs with cycles. This intractability motivates the use of approximate "message-passing" algorithms.
This paper studies the convergence and stability properties of the family of <i>reweighted sum-product algorithms</i>, a generalization
of the widely-used sum-product or belief propagation algorithm, in which messages are adjusted with graph-dependent weights.
For homogeneous models, we provide a complete characterization of the potential settings and message weightings that guarantee
uniqueness of fixed points, and convergence of the updates. For more general inhomogeneous models, we derive a set of sufficient
conditions that ensure convergence, and provide bounds on convergence rates. The experimental simulations on various classes
of graphs validate our theoretical results.**Keyword note:**Roosta__Tanya Wainwright__Martin Sastry__Shankar**Report ID:**737**Relevance:**100

**Title:**Spectra of random linear combinations of matrices defined via representations and Coxeter generators of the symmetric group**Author(s):**Evans, Steven N.; **Date issued:**August 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0h1t (PDF) **Abstract:**We consider the asymptotic behavior as $n \rightarrow \infty$ of the spectra of random matrices of the form \[\frac{1}{\sqrt{n-1}}
\sum_{k=1}^{n-1} Z_{nk} \rho_n((k,k+1)),\] where for each $n$ the random variables $Z_{nk}$ are i.i.d. standard Gaussian and
the matrices $\rho_n((k,k+1))$ are obtained by applying an irreducible unitary representation $\rho_n$ of the symmetric group
on $\{1,2,\ldots,n\}$ to the transposition $(k,k+1)$ that interchanges $k$ and $k+1$ (thus $\rho_n((k,k+1))$ is both unitary
and self-adjoint, with all eigenvalues either $+1$ or $-1$). Irreducible representations of the symmetric group on $\{1,2,\ldots,n\}$
are indexed by partitions $\lambda_n$ of $n$. A consequence of the results we establish is that if $\lambda_{n,1} \ge \lambda_{n,2}
\ge \cdots \ge 0$ is the partition of $n$ corresponding to $\rho_n$, $\mu_{n,1} \ge \mu_{n,2} \ge \cdots \ge 0$ is the corresponding
conjugate partition of $n$ (that is, the Young diagram of $\mu_n$ is the transpose of the Young diagram of $\lambda_n$),
$\lim_{n \rightarrow \infty} \frac{\lambda_{n,i}}{n} = p_i$ for each $i \ge 1$, and $\lim_{n \rightarrow \infty} \frac{\mu_{n,j}}{n}
= q_j$ for each $j \ge 1$, then the spectral measure of the resulting random matrix converges in distribution to a random
probability measure that is Gaussian with mean $\theta Z$ and variance $1 - \theta^2$, where $\theta$ is the constant $\sum_i
p_i^2 - \sum_j q_j^2$ and $Z$ is a standard Gaussian random variable.**Keyword note:**Evans__Steven_N**Report ID:**736**Relevance:**100

**Title:**On the gene ranking of replicated microarray time course data**Author(s):**Tai, Yu Chuan; Speed, Terence P.; **Date issued:**August 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0h51 (PDF) **Abstract:**Consider the gene ranking problem of replicated microarray time course experiments where there are multiple biological conditions,
and genes of interest are those whose temporal profiles are different across conditions. We derive the multi-sample multivariate
empirical Bayes statistic for ranking genes in the order of differential expression, from both longitudinal and cross-sectional
replicated developmental microarray time course data. Our longitudinal multi-sample model assumes that time course replicates
are i.i.d. multivariate normal vectors. On the other hand, we construct our cross-sectional model using a normal regression
framework with any appropriate basis for the design matrices. In both cases, we use natural conjugate priors in our empirical
Bayes setting which guarantee closed form solutions for the posterior odds. Our simulations and two case studies using published
worm and mouse microarray time course datasets indicate that the proposed approaches work well.**Keyword note:**Tai__Yu_Chuan Speed__Terry_P**Report ID:**735**Relevance:**100

**Title:**Operator norm consistent estimation of large dimensional sparse covariance matrices**Author(s):**El Karoui, Noureddine; **Date issued:**July 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0h97 (PDF) **Abstract:**Estimating covariance matrices is a problem of fundamental importance in multivariate statistics. In practice it is increasingly
frequent to work with data matrices $X$ of dimension $n\times p$, where $p$ and $n$ are both large. Results from random matrix
theory show very clearly that in this setting, standard estimators like the sample covariance matrix perform in general very
poorly. In this "large $n$, large $p$" setting, it is sometime the case that practitioners are willing to assume that many
elements of the population covariance matrix are equal to 0, and hence this matrix is sparse. We develop an estimator to handle
this situation. The estimator is shown to be consistent in operator norm, when $p/n\tendsto l\neq 0$, where $l$ is generally
finite, as $p\tendsto \infty$. In other words the largest eigenvalue of the difference between the estimator and the population
covariance matrix goes to zero. This implies consistency of all the eigenvalues and consistency of eigenspaces associated
to isolated eigenvalues. We also propose a notion of sparsity for matrices that is "compatible" with spectral analysis and
is independent of the ordering of the variables.**Keyword note:**El__Karoui__Noureddine**Report ID:**734**Relevance:**100

**Title:**Event Weighted Tests for Detecting Periodicity in Photon Arrival Times**Author(s):**Bickel, Peter; Kleijn, Bas; Rice, John; **Date issued:**June 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0j3f (PDF) **Abstract:**This paper treats the problem of detecting periodicity in a sequence of photon arrival times, which occurs, for example,
in attempting to detect gamma-ray pulsars. A particular focus is on how auxiliary information, typically source intensity,
background intensity, and incidence angles and energies associated with each photon arrival should be used to maximize the
detection power. We construct a class of likelihood-based tests, score tests, which give rise to event weighting in a principled
and natural way, and derive expressions quantifying the power of the tests. These results can be used to compare the efficacies
of different weight functions, including cuts in energy and incidence angle. The test is targeted toward a template for the
periodic lightcurve, and we quantify how deviation from that template affects the power of detection.**Keyword note:**Bickel__Peter_John Kleijn__Bas Rice__John_Andrew**Report ID:**733**Relevance:**100

**Title:**An experimental study comparing linguistic phylogenetic reconstruction methods**Author(s):**Barbancon, Francois; Warnow, Tandy; Evans, Steven N.; Ringe, Donald; Nakhleh, Luay; **Date issued:**May 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0j9r (PDF) **Abstract:**The estimation of linguistic evolution has intrigued many researchers for centuries, and in just the last few years, several
new methods for constructing phylogenies from languages have been produced and used to analyze a number of language families.
These analyses have led to a great deal of excitement, both within the field of historical linguistics and in related fields
such as archaeology and human genetics. They have also been controversial, since the analyses have not always been consistent
with each other, and the differences between different reconstructions have been potentially critical to the claims made by
the different groups. In this paper, we report on a simulation study we performed in order to help resolve this controversy,
which compares some of the main phylogeny reconstruction methods currently being used in linguistic cladistics. Our simulated
datasets varied in the number of contact edges, the degree of homoplasy, the deviation from a lexical clock, and the deviation
from the rates-across-sites assumption. We find the accuracy of the unweighted methods maximum parsimony, neighbor joining,
lexico-statistics, and the method of Gray & Atkinson, to be remarkably consistent across all the model conditions we studied,
with maximum parsimony being the best, followed (often closely) by Gray & Atkinson's method, then neighbor joining, and finally
lexico-statistics (UPGMA). The accuracy of the two weighted methods (weighted maximum parsimony and weighted maximum compatibility)
depends upon the appropriateness of the weighting scheme, and so depends upon the homoplasy levels produced by the model conditions;
for low-homoplasy levels, however, the weighted methods generally produce the most accurate results of all methods, while
the use of inappropriate weighting schemes can make for poorer results than maximum parsimony and Gray & Atkinson's method
under moderate to high homoplasy levels.**Keyword note:**Barbancon__Francois Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay**Report ID:**732**Relevance:**100

**Title:**Sparse, noisy Boolean functions**Author(s):**Mukherjee, S.; Speed, T. P.; **Date issued:**April 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0k1v (PDF) **Abstract:**This paper addresses the question of making inferences regarding Boolean functions under conditions of (i) noise, or stochastic
variation in observed data, and (ii) sparsity, by which we mean that the number of inputs or predictors far exceeds the arity
of the underlying Boolean function. We put forward a simple probability model for such observations, and discuss model selection,
parameter estimation and prediction. We present results on synthetic data and on a proteomic dataset from a study in cancer
systems biology.**Keyword note:**Mukherjee__Sach Speed__Terry_P**Report ID:**731**Relevance:**100

**Title:**Low density graph codes that are optimal for binning and coding with side information**Author(s):**Wainwright, Martin J.; Martinian, Emin; **Date issued:**April 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0k52 (PDF) **Abstract:**We describe and analyze the joint source/channel coding properties of a class of sparse graphical codes based on compounding
a low-density generator matrix (LDGM) code with a low-density parity check (LDPC) code. Our first pair of theorems establish
that there exist codes from this ensemble, with all degrees remaining bounded independently of block length, that are simultaneously
optimal as both source and channel codes when encoding and decoding are performed optimally. More precisely, in the context
of lossy compression, we prove that finite degree constructions can achieve any pair $(\myrate, \distor)$ on the rate-distortion
curve of the binary symmetric source. In the context of channel coding, we prove that finite degree codes can achieve any
pair $(C, \channoise)$ on the capacity-noise curve of the binary symmetric channel. Next, we show that our compound construction
has a nested structure that can be exploited to achieve the Wyner-Ziv bound for source coding with side information (SCSI),
as well as the Gelfand-Pinsker bound for channel coding with side information (CCSI). Although the current results are based
on optimal encoding and decoding, the proposed graphical codes have sparse structure and high girth that renders them well-suited
to message-passing and other efficient decoding procedures.**Keyword note:**Wainwright__Martin Martinian__Emin**Report ID:**730**Relevance:**100

**Title:**Markov chain Monte Carlo for Structural Inference with Prior Information**Author(s):**Mukherjee, Sach; Speed, Terence P.; **Date issued:**April 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0m0t (PDF) **Abstract:**This paper addresses the question of making inferences regarding features of conditional independence graphs in settings characterized
by the availability of rich prior information regarding such features. We focus on Bayesian networks, and use Markov chain
Monte Carlo to draw samples from the relevant posterior over graphs. We introduce a class of "locally-informative priors"
which are highly flexible and capable of taking account of specific information regarding graph features, and are, in addition,
informative at a scale appropriate to local sampling moves. We present examples of such priors for beliefs regarding edges,
groups and classes of edges, degree distributions and sparsity, applying our methods to challenging synthetic data as well
as data obtained from a biological network in cancer.**Keyword note:**Mukherjee__Sach Speed__Terry_P**Report ID:**729**Relevance:**100

**Title:**Fast Rates for Estimation Error and Oracle Inequalities for Model Selection**Author(s):**Bartlett, Peter L.; **Date issued:**March 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0m41 (PDF) **Abstract:**We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off
estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known
that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of
our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where
complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from
a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between
the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal
deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity
penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion,
in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even
in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation
error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model
selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.**Keyword note:**Bartlett__Peter**Report ID:**728**Relevance:**100

**Title:**Coverage Adjusted Entropy Estimation**Author(s):**Vu, Vincent Q.; Yu, Bin; Kass, Robert E.; **Date issued:**March 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0m87 (PDF) **Abstract:**Data on "neural coding" have frequently been analyzed using information-theoretic measures. These formulations involve the
fundamental, and generally difficult statistical problem of estimating entropy. In this paper we review briefly several methods
that have been advanced to estimate entropy, and highlight a method due to Chao and Shen that appeared recently in the environmental
statistics literature. This method begins with the elementary Horvitz-Thompson estimator, developed for sampling from a finite
population and adjusts for the potential new species that have not yet been observed in the sample---these become the new
patterns or "words" in a spike train that have not yet been observed. The adjustment is due to I.J. Good, and is called the
Good-Turing coverage estimate. We provide a new empirical regularization derivation of the coverage-adjusted probability estimator,
which shrinks the MLE (the naive or "direct" plug-in estimate) toward zero. We prove that the coverage adjusted estimator,
due to Chao and Shen, is consistent and first-order optimal, with rate $O_P(1/\log n)$, in the class of distributions with
finite entropy variance and that within the class of distributions with finite $q$th moment of the log-likelihood, the Good-Turing
coverage estimate and the total probability of unobserved words converge at rate $O_P(1/(\log n)^q)$. We then provide a simulation
study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The
results show that, with a minor modification, the coverage adjusted estimator performs much better than the MLE and is better
than the Best Upper Bound estimator, due to Paninski, when the number of possible words $m$ is unknown or infinite.**Keyword note:**Vu__Vincent_Q Yu__Bin Kass_Robert_E**Report ID:**727**Relevance:**100

**Title:**The Impact of an Inaccurate Diagnostic Biomarker on Phase II Clinical Trials in The Development of Targeted Therapy.**Author(s):**Wang, Nancy N.; Rabbee, Nusrat; **Date issued:**March 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0n30 (PDF) **Abstract:**Current research in oncology aims at developing targeted therapies to treat the heterogeneous patient population. Successful
development of a targeted therapy requires a biomarker that identifies patients who are most likely to benefit from the treatment.
However, most biomarkers are inherently inaccurate. We present a simulation study to examine how the sensitivity and specificity
of a single, binary biomarker influences the Cox estimates of hazard ratios in phase II clinical trials. We discuss how the
bias introduced by marker inaccuracy impacts the decision of whether to carry a drug forward to a phase III clinical trial.
Finally, we propose a bootstrap-based method for reducing the bias of the Cox estimator, in the presence of an inaccurate
marker.**Keyword note:**Wang__Nancy_N Rabbee__Nusrat**Report ID:**726**Relevance:**100

**Title:**Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting**Author(s):**Wainwright, Martin J.; **Date issued:**January 2007

http://nma.berkeley.edu/ark:/28722/bk0005d0n76 (PDF) **Abstract:**The problem of recovering the sparsity pattern of a fixed but unknown vector $\beta^* \in \real^p$ based on a set of $n$ noisy
observations arises in a variety of settings, including subset selection in regression, graphical model selection, signal
denoising, compressive sensing, and constructive approximation. Of interest are conditions on the model dimension $p$, the
sparsity index $s$ (number of non-zero entries in $\beta^*$), and the number of observations $n that are necessary and/or
sufficient to ensure asymptotically perfect recovery of the sparsity pattern. This**Keyword note:**Wainwright__Martin**Report ID:**725**Relevance:**100