Communication-avoiding Parallel and Sequential QR Factorizations
Hoemmen, Mark Frederick
Technical Report Identifier: EECS-2008-74
May 29, 2008
Abstract: We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases).
Our first algorithm, Tall Skinny QR (TSQR), factors mxn matrices in a one-dimensional (1-D) block cyclic row layout, storing the Q factor (if desired) implictly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Theta(log P) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Theta(mn/W) block reads and writes (as a function of the fast memory size W), which is within a constant factor of the theoretical lower bound. In contrast, the conventional parallel algorithm as implemented in ScaLAPACK requires Theta(n log P) messages, a factor of n times more, and the analogous sequential algorithm transfers Theta(mn^2) words between slow and fast memory, also a factor of n times more. TSQR only uses orthogonal transforms, so it is just as stable as standard Householder QR. Both parallel and sequential performance results show that TSQR outperforms competing methods.
Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization, which both remove a latency bottleneck in ScaLAPACK's current parallel approach, and both bandwidth and latency bottlenecks in ScaLAPACK's out-of-core QR factorization. CAQR achieves modeled speedups of 2.1x on an IBM POWER5 cluster, 3.0x on a future petascale machine, and 3.8x on the Grid.