Statistics Technical Reports:Search | Browse by year

Sorted by:
Page: 1 2  Next

Title:Boosted Lasso and Reverse Boosting
Author(s):Zhao, Peng; Yu, Bin; 
Date issued:December 2004 (PDF)
Abstract:This paper introduces the concept of "backward" step in contrast with forward fashion algorithms like Boosting and Forward Stagewise Fitting. Like classical elimination methods, this "backward" step works by shrinking the model complexity of an ensemble learner. Through a step analysis, we show that this additional step is necessary for minimizing $L_1$ penalized loss (Lasso loss). We also propose a BLasso algorithm as a combination of both backward and forward steps which is able to produce the complete regularization path for Lasso problems. Moreover, BLasso can be generalized to solve problems with general convex loss with general convex penalty.
Keyword note:Zhao__Peng Yu__Bin
Report ID:678

Title:Estimating the Proportion of False Null Hypotheses among a Large Number of Independently Tested Hypotheses
Author(s):Meinshausen, Nicolai; Rice, John; 
Date issued:October 2004 (PDF)
Abstract:We consider the problem of estimating the number of false null hypotheses among a very large number of independently tested hypotheses, focusing on the situation in which the proportion of false null hypotheses is very small. We propose a family of methods for establishing lower $100(1-\alpha)\%$ confidence bounds for this proportion, based on the empirical distribution of the p-values of the tests. Methods in this family are then compared in terms of ability to consistently estimate the proportion by letting $\alpha \rightarrow 0$ as the number of hypothesis tests increases and the proportion decreases. This work is motivated by a signal detection problem occurring in astronomy.
Keyword note:Meinshausen__Nicolai Rice__John_Andrew
Report ID:676

Title:Variational inference for Dirichlet process mixtures
Author(s):Blei, David M.; Jordan, Michael I.; 
Date issued:October 2004 (PDF)
Abstract:Dirichlet process (DP) mixture models are the cornerstone of nonparametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled their applications to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to explore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper & Saad, 2001; Wainwright & Jordan, 2003). Thus far, variational methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias, 2000; Ghahramani & Beal, 2001; Blei, et al., 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.
Keyword note:Blei__David_M Jordan__Michael_I
Report ID:674

Title:A stochastic model of language evolution that incorporates homoplasy and borrowing
Author(s):Warnow, Tandy; Evans, Steven N.; Ringe, Don; Nakhleh, Luay; 
Date issued:September 2004 (PDF) (PostScript)
Abstract:We propose a stochastic model of language evolution that permits both homoplasy and non-tree evolution (so as to reflect borrowing between lineages). We discuss the issues involved in reconstructing evolutionary histories under the model, specifically showing that the tree component of the model is identifiable, that efficient methods exist for reconstructing the tree, and that full likelihoods can be calculated in linear time. We conclude with a discussion of data selection and analysis, and compare our stochastic model for language evolution to existing models in molecular evolution.
Keyword note:Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay
Report ID:673

Title:A comparison of phylogenetic reconstruction methods on an IE dataset
Author(s):Nakhleh, Luay; Warnow, Tandy; Ringe, Don; Evans, Steven N.; 
Date issued:September 2004 (PDF) (PostScript)
Abstract:Researchers interested in the history of the Indo-European family of languages have used a variety of methods to estimate the phylogeny of the family, and have obtained widely differing results. In this paper we explore the reconstructions of Indo-European phylogeny obtained by using the major phylogeny estimation procedures on an existing database for 24 Indo-European languages compiled by linguists Don Ringe and Ann Taylor. Our study finds that the different methods agree in part, but that there also several striking differences. We discuss the reasons for these differences, and make proposals with respect to phylogenetic reconstruction in historical linguistics.
Keyword note:Nakhleh__Luay Warnow__Tandy Ringe__Don Evans__Steven_N
Report ID:672

Title:Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations
Author(s):Wainwright, M. J.; Jordan, M. I.; 
Date issued:September 2004 (PDF)
Abstract:The Sherali-Adams (SA) and Lasserre (LA) approaches are "lift-and-project" methods that generate nested sequences of linear and/or semidefinite relaxations of an arbitrary 0-1 polytope $P \subseteq [0,1]^n$. Although both procedures are known to terminate with an exact description of $P$ after $n$ steps, there are various open questions associated with characterizing, for particular problem classes, whether exactness is obtained at some step $s < n$. This paper provides sufficient conditions for exactness of these relaxations based on the hypergraph-theoretic notion of treewidth. More specifically, we relate the combinatorial structure of a given polynomial system to an underlying hypergraph. We prove that the complexity of assessing the global validity of moment sequences, and hence the tightness of the SA and LA relaxations, is determined by the \emph(treewidth) of this hypergraph. We provide some examples to illustrate this characterization.
Keyword note:Wainwright__Martin Jordan__Michael_I
Report ID:671

Title:A New Look at Survey Propagation and its Generalizations
Author(s):Maneva, E.; Mossel, E.; Wainwright, M. J.; 
Date issued:September 2004 (PDF)
Abstract:We study the survey propagation algorithm~\cite(MZ02, BMZ03,BMWZ03), which is an iterative technique that appears to be very effective in solving random $k$-SAT problems even with densities close to threshold. We first describe how any SAT formula can be associated with a novel family of Markov random fields (MRFs), parameterized by a real number $\rho \in [0,1]$. We then show that applying belief propagation---a well-known "message-passing" technique---to this family of MRFs recovers various algorithms, ranging from pure survey propagation at one extreme ($\rho = 1$) to standard belief propagation on the uniform distribution over SAT assignments at the other extreme ($\rho = 0$). Configurations in these MRFs have a natural interpretation as generalized satisfiability assignments, on which a partial order can be defined. We isolate \emph(cores) as minimal elements in this partial ordering, and prove that any core is a fixed point of survey propagation. We investigate the associated lattice structure, and prove a weight-preserving identity that shows how any MRF with $\rho > 0$ can be viewed as a "smoothed" version of the naive factor graph representation of the $k$-SAT problem ($\rho = 0$). Our experimental results suggest that random formulas typically do not possess non-trivial cores. This result and additional experiments indicate that message-passing on our family of MRFs is most effective for values of $\rho \neq 1$ (i.e., distinct from survey propagation). Finally, we isolate properties of Gibbs sampling and message-passing algorithms that are typical for an ensemble of $k$-SAT problems.
Keyword note:Maneva__E Mossel__Elchanan Wainwright__Martin
Report ID:669

Title:Unidentifiable divergence times in rates-across-sites models
Author(s):Evans, Steven N.; Warnow, Tandy; 
Date issued:August 2004 (PDF) (PostScript)
Abstract:The rates-across--sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that it occurs faster for some characters than others. To allow estimation of tree topologies and edge lengths for such models, it is commonly assumed that the scale factors are not arbitrary unknown constants, but rather unobserved, independent, identically distributed draws from a member of some parametric family of distributions. A popular choice is the gamma family. We consider an example of a clock-like tree with three taxa, one unknown edge length, and a parametric family of scale factor distributions that contain the gamma family. This model has the property that, for a generic choice of unknown edge length and scale factor distribution, there is another edge length and scale factor distribution which generates data with exactly the same distribution, so that even with infinitely many data it will be typically impossible to make correct inferences about the unknown edge length.
Keyword note:Evans__Steven_N Warnow__Tandy
Report ID:668

Title:A multivariate empirical Bayes statistic for replicated microarray time course data
Author(s):Tai, Yu Chuan; Speed, Terence P.; 
Date issued:August 2004 (PDF)
Abstract:In this paper we derive a one-sample multivariate empirical Bayes statistic (the $MB$-statistic) to rank genes in the order of differential expression from replicated microarray time course experiments . We do this by testing the null hypothesis that the expectation of a $k$-vector of a gene's expression levels is a multiple of $1_k$, the vector of $k$ $1$s. The importance of moderation in this context is explained. Together with the $MB$-statistic we have the one-sample $\widetilde(T)^2$ statistic, a variant of the one-sample Hotelling $T^2$. Both the $MB$-statistic and $\widetilde(T)^2$ statistic can be used to rank genes in the order of evidence of nonconstancy, incorporating any correlation structure among time point samples and the replication. In a simulation study we show that the one-sample $MB$-statistic, $\widetilde(T)^2$ statistic, and moderated Hotelling $T^2$ statistic achieve the smallest number of false positives and false negatives, and all perform equally well. Several special and limiting cases of the $MB$-statistic are derived, and two-sample versions are described.
Keyword note:Tai__Yu_Chuan Speed__Terry_P
Report ID:667

Title:Using Random Forest to Learn Imbalanced Data
Author(s):Chen, Chao; Liaw, Andy; Breiman, Leo; 
Date issued:July 2004 (PDF)
Abstract:In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, $F$-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
Keyword note:Chen__Chao Liaw__Andy Breiman__Leo
Report ID:666

Title:Estimating velocity fields on a freeway from low resolution video recordings
Author(s):Cho, Young; Rice, John; 
Date issued:July 2004 (PDF)
Abstract:We present an algorithm to estimate velocity fields from low resolution video recordings. The algorithm does not attempt to identify and track individual vehicles, nor does it attempt to estimate derivatives of the field of pixel intensities. Rather, we compress a frame by obtaining an intensity profile in each lane along the direction of traffic flow. The speed estimate is then computed by searching for a best matching profile in a frame at a later time. Because the algorithm does not need high quality images, it is directly applicable to a compressed format digital video stream, such as mpeg, from conventional traffic video cameras. We illustrate the procedure using a 15 minute long VHS recording to obtain speed estimates on a one mile stretch of highway I-80 in Berkeley, California.
Keyword note:Cho__Young Rice__John_Andrew
Report ID:665

Title:Measuring Traffic
Author(s):Bickel, Peter; Chen, Chao; Kwon, Jaimyoung; Rice, John; van Zwet, Erik; Varaiya, Pravin; 
Date issued:June 2004 (PDF)
Abstract:A traffic performance measurement system, PeMS, currently functions as a statewide repository for traffic data gathered by thousands of automatic sensors. It has integrated data collection, processing, and communications infrastructure with data storage and analytical tools. In this paper, we discuss statistical issues that have emerged as we attempt to process a data stream of two GB per day of wildly varying quality. In particular, we focus on detecting sensor malfunction, imputation of missing or bad data, estimation of velocity, and forecasting of travel times on freeway networks.
Keyword note:Bickel__Peter_John Chen__Chao Kwon__Jaimyoung Rice__John_Andrew van_Zwet__Erik Varaiya__Pravin
Report ID:664

Title:Cloud Detection over Snow and Ice Using MISR Data
Author(s):Shi, Tao; Yu, Bin; Clothiaux, Eugene E.; Braverman, Amy J.; 
Date issued:June 2004 (PDF)
Abstract:Clouds play a major role in Earth's climate and cloud detection is a crucial step in the processing of satellite observations in support of radiation budget, numerical weather prediction and global climate model studies. To advance the observational capabilities of detecting clouds and retrieving their cloud-top altitudes, NASA launched the Multi-angle Imaging SpectroRadiometer (MISR) in 1999, which provides data in nine different views of the same scene using four spectral channels. Cloud detection is particularly difficult in the snow- and ice-covered polar regions and availability of the novel MISR angle-dependent radiances motivates the current study on cloud detection using statistical methods. Three schemes using MISR data for polar cloud detection are investigated in this study. Using domain knowledge, three physical features are developed for detecting clouds in daylight polar regions. The features measure the correlations between MISR angle-dependent radiances, the smoothness of the reflecting surfaces, and the amount of forward scattering of radiances. The three features are the basis of the the first scheme, called Enhanced Linear Correlation Matching Classification (ELCMC). The ELCMC algorithm thresholds on three features and the thresholds are either fixed or found through the EM algorithm based on a mixture of two 1-dim Gaussians. The ELCMC algorithm results are subsequently used as training data in the development of two additional schemes, one Fisher's Quadratic Discriminate Analysis (ELCMC-QDA) and the other a Gaussian kernel Support Vector Machine(ELCMC-SVM). For both QDA- and SVM-based experiments two types of inputs are tested, the set of three physical features and the red radiances of the nine MISR cameras. All three schemes are applied to two polar regions where expert labels show that the MISR operational cloud detection algorithm does not work well, with a 53% misclassification rate in one region and a 92% nonretrieval rate in the other region. The ELCMC algorithm produces misclassification rates of 6.05% and 6.28% relative to expert labelled regions across the two polar scenes. The misclassification rates are reduced to approximately 4% by ELMCM-QDA and ELCMC-SVM in one region and approximately 2% in the other. Overall, all three schemes provided significantly more accurate results and greater spatial coverage than the MISR operational stereo-based cloud detection algorithm. Compared with ELCMC-QDA, ELCMC-SVM is more robust against mislabels in the ELCMC results and provide slightly better results, but it is computationally slower.
Keyword note:Shi__Tao Yu__Bin Clothiaux__Eugene_E Braverman__Amy_J
Report ID:663

Title:Balls-in-boxes duality for coalescing random walks and coalescing Brownian motion
Author(s):Evans, Steven N.; Zhou, Xiaowen; 
Date issued:June 2004 (PDF) (PostScript)
Abstract:We present a duality between two systems of coalescing random walks and an analogous duality between two systems of coalescing Brownian motions. Our results extend previous work in the literature and we apply them to the study of a system of coalescing Brownian motions with Poisson immigration.
Keyword note:Evans__Steven_N Zhou__Xiaowen
Report ID:662

Title:Sparse Gaussian process classification with multiple classes
Author(s):Seeger, Matthias; Jordan, Michael I.; 
Date issued:April 2004 (PDF) (PostScript)
Abstract:Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the binary classification informative vector machine (IVM) to multiple classes. In contrast to earlier efficient approaches to kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter estimation via marginal likelihood maximization. While most earlier proposals suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points.
Keyword note:Seeger__Matthias Jordan__Michael_I
Report ID:661

Title:Inference of divergence times as a statistical inverse problem
Author(s):Evans, Steven N.; Ringe, Don; Warnow, Tandy; 
Date issued:April 2004 (PDF) (PostScript)
Abstract:Dating of divergence times in historical linguistics is an instance of a statistical inverse problem, and many of the issues that complicate the proper treatment of other inverse problems are also present there.
Keyword note:Evans__Steven_N Ringe__Don Warnow__Tandy
Report ID:660

Title:Stochastic models of language evolution and an application to the Indo-European family of languages
Author(s):Warnow, Tandy; Evans, Steven N.; Ringe, Don; Nakhleh, Luay; 
Date issued:April 2004 (PDF) (PostScript)
Abstract:We propose several models of how languages evolve, and discuss statistical estimation of evolution under these models.
Keyword note:Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay
Report ID:659

Title:Decentralized Detection and Classification using Kernel Methods
Author(s):Nguyen, XuanLong; Wainwright, Martin J.; Jordan, Michael I.; 
Date issued:April 2004 (PDF) (PostScript)
Abstract:We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels, and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm, and demonstrate its performance on both simulated and real data sets.
Keyword note:Nguyen__XuanLong Wainwright__Martin Jordan__Michael_I
Report ID:658

Title:A generalized model of mutation-selection balance with applications to aging
Author(s):Steinsaltz, David; Evans, Steven N.; Wachter, Kenneth W.; 
Date issued:March 2004 (PDF) (PostScript)
Abstract:A probability model is presented for the dynamics of mutation-selection balance in a haploid infinite-population infinite-sites setting sufficiently general to cover mutation-driven changes in full age-specific demographic schedules. The model accommodates epistatic as well as additive selective costs. Closed form characterizations are obtained for solutions in finite time, along with proofs of convergence to stationary distributions and a proof of the uniqueness of solutions in a restricted case. Examples are given of applications to the biodemography of aging, including instabilities in current formulations of mutation accumulation.
Keyword note:Steinsaltz__David Evans__Steven_N Wachter__Kenneth
Report ID:657

Title:Supplement to "Consistent Independent Component Analysis and Prewhitening"
Author(s):Chen, A.; Bickel, P. J.; 
Date issued:March 2004 (PDF)
Abstract:In this paper we study the statistical properties of a characteristic-function based algorithm for independent component analysis (ICA), which was proposed by Eriksson et. al. (2003) and Chen & Bickel (2003) independently. First, statistical consistency of this algorithm with prewhitening is analyzed, especially in existence of heavy-tailed sources. Second, without prewhitening this algorithm is shown to be robust against small additive noise.
Keyword note:Chen__Aiyou Bickel__Peter_John
Report ID:656

Page: 1 2  Next