Title:Structured Prediction, Dual Extragradient and Bregman Projections
Author(s):Taskar, Ben; Lacoste-Julien, Simon; Jordan, Michael I.;
Date issued:November 2005
Abstract:We present a simple and scalable algorithm for maximum-margin estimation of structured prediction models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memory-efficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.
Keyword note:Taskar__Ben Lacoste-Julien__Simon Jordan__Michael_I
Title:Curse-of-dimensionality revisited: Collapse of importance sampling in very high-dimensional systems.
Author(s):Li, Bo; Bengtsson, Thomas; Bickel, Peter;
Date issued:November 2005
Abstract:It has been widely realized that Monte Carlo methods (approximation via a sample ensemble) may fail in large scale systems. This work offers some theoretical insight into this phenomenon. In the context of a particle filter (as well as in general importance samplers), we demonstrate that the maximum of the weights associated with the sample ensemble members converges to one as both sample size and system dimension tends to infinity. Under fairly weak assumptions, this convergence is shown to hold for both a Gaussian case and for a more general case with iid kernels. Similar singularity behavior is also shown to hold for non-Gaussian, spherically symmetric kernels (e.g. multivariate Cauchy distribution). In addition, in certain large scale settings, we show that the estimator of an expectation based on importance sampling converges weakly to a law, rather than the target constant. Our work is presented and discussed in the context of atmospheric data assimilation for numerical weather prediction.
Keyword note:Li__Bo Bengtsson__Thomas Bickel__Peter_John
Title:On divergences, surrogate loss functions and decentralized detection
Author(s):Nguyen, X.; Wainwright, M. J.; Jordan, M. I.;
Date issued:October 2005
Abstract:We develop a general correspondence between a family of loss functions that act as surrogates to 0-1 loss, and the class of Ali-Silvey or $f$-divergence functionals. This correspondence provides the basis for choosing and evaluating various surrogate losses frequently used in statistical learning (e.g., hinge loss, exponential loss, logistic loss); conversely, it provides a decision-theoretic framework for the choice of divergences in signal processing and quantization theory. We exploit this correspondence to characterize the statistical behavior of a nonparametric decentralized hypothesis testing algorithms that operate by minimizing convex surrogate loss functions. In particular, we specify the family of loss functions that are equivalent to 0-1 loss in the sense of producing the same quantization rules and discriminant functions.
Keyword note:Nguyen__XuanLong Wainwright__Martin Jordan__Michael_I
Title:Brownian motion on time scales, basic hypergeometric functions, and some continued fractions of Ramanujan
Author(s):Bhamidi, Shankar; Evans, Steven N.; Peled, Ron; Ralph, Peter;
Date issued:September 2005
Abstract:Motivated by L\'evy's characterization of Brownian motion on the line, we propose an analogue of Brownian motion that has as its state space an arbitrary unbounded closed subset of the line: such a process will a martingale, has the identity function as its quadratic variation process, and is "continuous" in the sense that its sample paths don't skip over points. We show that there is a unique such process, which turns out to be automatically a Feller-Dynkin Markov process. We find its generator, which is a natural generalization of the operator $f \mapsto \frac(1)(2) f''$. We then consider the special case where the state space is the self-similar set $\(\pm q^k : k \in \Z\) \cup \(0\)$ for some $q>1$. Using the scaling properties of the process, we represent the Laplace transforms of various hitting times as certain continued fractions that appear in Ramanujan's "lost" notebook and evaluate these continued fractions in terms of basic hypergeometric functions (that is, $q$-analogues of classical hypergeometric functions). The process has $0$ as a regular instantaneous point, and hence its sample paths can be decomposed into a Poisson process of excursions from $0$ using the associated continuous local time. Using the reversibility of the process with respect to the natural measure on the state space, we find the entrance laws of the corresponding It\^o excursion measure and the Laplace exponent of the inverse local time -- both again in terms of basic hypergeometric functions. By combining these ingredients, we obtain explicit formulae for the resolvent of the process. We also compute the moments of the process in closed form. Some of our results involve $q$-analogues of classical distributions such as the Poisson distribution that have appeared elsewhere in the literature.
Keyword note:Bhamidi__Shankar Evans__Steven_N Peled__Ron Ralph__Peter
Title:A genotype calling algorithm for Affymetrix SNP arrays
Author(s):Rabbee, Nusrat; Speed, Terence P.;
Date issued:August 2005
Abstract:Motivation: A classification algorithm, based on a multi-chip, multi-SNP approach is proposed for Affymetrix SNP arrays. Current procedures for calling genotypes on SNP arrays process all the features associated with one chip and one SNP at a time. Using a large training sample where the genotype labels are known, we develop a supervised learning algorithm to obtain more accurate classification results on new data. The method we propose, RLMM, is based on a robustly fitted, linear model and uses the Mahalanobis distance for classification. The chip-to-chip non-biological variance is reduced through normalization. This model-based algorithm captures the similarities across genotype groups and probes, as well as across thousands of SNPs for accurate classification. In this paper, we apply RLMM to Affymetrix 100K SNP array data, present classification results and compare them to genotype calls obtained from the Affymetrix procedure DM, as well as to the publicly available genotype calls from the HapMap project. Availability: The RLMM software is implemented in R and is available from the first author at email@example.com.
Keyword note:Rabbee__Nusrat Speed__Terry_P
Title:Bivariate variable selection for classification problem
Author(s):Ng, Vivian; Breiman, Leo;
Date issued:August 2005
Abstract:In recent years, large amount of attention has been placed on variable or feature selection in various domains. Varieties of variable selection methods have been proposed in the literature. However, most of them are focused on univariate variable selection -- method that selects relevant variables one by one. Currently, there is not much emphasis on variable selection on pairs of variables. It is not unreasonable, as researchers in industries have been asked to identify pairs of variables that are relevant. All is well using univariate variable selection for identifying independently significant variables, but pairs of independently important variables are not the same as pairs of variables that have joint effect. Therefore, univariate variable selection methods are not applicable in selecting pairs of linked variables. To overcome this obstacle, Professor Breiman and I propose a bivariate variable selection method that detects linked pairs of variables. It is equally important to learn the relationship between each linked pair with the response variable. To this end, a graphical tool is designed for visualizing the relationship uncovered by the proposed bivariate variable selection method.
Keyword note:Ng__Vivian_Wai_Ying Breiman__Leo
Title:Binning in Gaussian Kernel Regularization
Author(s):Shi, Tao; Yu, Bin;
Date issued:April 2005
Abstract:Gaussian kernel regularization is widely used in the machine learning literature and proven successful in many empirical experiments. The periodic version of the Gaussian kernel regularization has been shown to be minimax rate optimal in estimating functions in any finite order Sobolev spaces. However, for a data set with $n$ points, the computation complexity of the Gaussian kernel regularization method is of order $O$($n^3$). In this paper we propose to use binning to reduce the computation of Gaussian kernel regularization in both regression and classification. For the periodic Gaussian kernel regression, we show that the binned estimator achieves the same minimax rates of the unbinned estimator, but the computation is reduced to $O(m^3)$ with $m$ as the number of bins. To achieve the minimax rate in the $k$-th order Sobolev space, $m$ needs to be in the order of $O(kn^(1/(2k+1)))$, which makes the binned estimator computation of order $O(n)$ for $k=1$ and even less for larger $k$. Our simulations show that the binned estimator (binning 120 data points into 20 bins in our simulation) provides almost the same accuracy with only 0.4\% of computation time. For classification, binning with the $L2$-loss Gaussian kernel regularization and the Gaussian kernel Support Vector Machines is tested in a polar cloud detection problem. With basically the same computation time, the $L2$-loss Gaussian kernel regularization on 966 bins achieves better test classification rate (79.22\%) than that (71.40\%) on 966 randomly sampled data. Using the OSU-SVM Matlab package, the SVM trained on 966 bins has a comparable test classification rate as the SVM trained on 27,179 samples, but reduces the training time from 5.99 hours to 2.56 minutes. The SVM trained on 966 randomly selected samples has a similar training time as and a slightly worse test classification rate than the SVM on 966 bins, but has 67\% more support vectors so takes 67\% longer to predict on a new data point. The SVM trained on 512 cluster centers from the k-mean algorithm reports almost the same test classification rate and a similar number of support vectors as the SVM on 512 bins, but the k-mean clustering itself takes 375 times more computation time than binning.
Keyword note:Shi__Tao Yu__Bin
Title:kA Probabilistic Interpretation of Canonical Correlation Analysis
Author(s):Bach, Francis R.; Jordan, Michael I.;
Date issued:April 2005
Abstract:We give a probabilistic interpretation of canonical correlation (CCA) analysis as a latent variable model for two Gaussian random vectors. Our interpretation is similar to the probabilistic interpretation of principal component analysis (Tipping and Bishop, 1999; Roweis, 1998). In addition, we cast Fisher linear discriminant analysis (LDA) within the CCA framework.
Keyword note:Bach__Francis_R Jordan__Michael_I
Title:Subtree prune and re-graft: a reversible real tree valued Markov process
Author(s):Evans, Steven N.; Winter, Anita;
Date issued:February 2005
Abstract:We use Dirichlet form methods to construct and analyze a reversible Markov process, the stationary distribution of which is the Brownian continuum random tree. This process is inspired by the subtree prune and re-graft (SPR) Markov chains that appear in phylogenetic analysis. A key technical ingredient in this work is the use of a novel Gromov--Hausdorff type distance to metrize the space whose elements are compact real trees equipped with a probability measure. Also, the investigation of the Dirichlet form hinges on a new path decomposition of the Brownian excursion.
Keyword note:Evans__Steven_N Winter__Anita