Advanced Computation Group

Technical Papers Archive

The following papers by Apple’s Advanced Computation Group were originally developed for the PowerPC G4/G5 architectures. However, many of the concepts described below also apply to Intel architectures.

Gigaelement FFTs on Apple G5 clusters

Abstract: This paper describes PowerPC G5-based cluster configurations and software suitable for performing 1-D, 2-D, and 3-D fast Fourier transfoms (FFTs) with 2^30 complex elements. Using the Accelerate framework, Velocity Engine, Xgrid and various MPI implementation we are able to sustain around 2 gigaflops (double-precision) and 4 gigaflops (single-precision) using four dual-CPU 2.0 GHz PowerMac G5s connected via GigaBit Ethernet. Using Myrinet, for single-precision 2-D transforms we achieved over 5 gigaflops with four machines, and nearly 9 gigaflops with eight. — August 2004

Supercomputer-style FFT Library for PowerPC G4

Abstract: A gigaflop G4 FFT is used within a recursive FFT framework to provide excellent, cache-friendly performance for very long signals. Various algorithmic frameworks are compared with respect to G4 performance. Also applicable to G5 processor’s Velocity Engine. — January 2000.

Vector Implementation of Multiprecision Arithmetic

Abstract: An implementation of G4 multiprecision arithmetic is described. We show how vector arithmetic can be used to provide extremely fast multiplication for any operand size, with division and other operations following suit. Performance measurements are reported. Also applicable to G5 processor’s Velocity Engine. — October 1999

Vector Implementation of Color-Image Wavelet Transform

Abstract: We describe two types of very fast wavelet transforms: four-channel (RGBA) direct transform via Daubechies D4 wavelet, and a YUV-based biorthogonal wavelet. We indicate how a clever scheme of "shift-rational" arithmetic can exploit the Velocity Engine vector capability. Performance results are reported for the various algorithmic modes. Also applicable to G5 processor’s Velocity Engine. — October 1999

Fast Matrix Algebra on PowerPC G4

Abstract: We describe first, how small-matrix multiply can proceed above gigaflop rates on G4; and second, how to use such a core matrix operation in a larger, Strassen recursion. We find the breakover for classical vs. Strassen matrix recursion is roughly at 128-by-128 float entries. The overall effect is that square matrices with 6 megaentries can be multiplied several times faster than one used to do with optimized Cray-2 libraries. Also applicable to G5 processor’s Velocity Engine. — Updated July 2004.

Research from Outside Laboratories, Developers and Users

Octuple-Precision Floating-Point on PowerPC G4

by R. Crandall, Apple ACG, and J. Papadopoulos, University of Maryland College Park — May 2002

On the implementation of AKS-class primality tests 

by R. Crandall, Apple ACG, and J. Papadopoulos, University of Maryland College Park — March 2003

Special Applications of 64-bit Arithmetic: Acceleration on the Apple G5

by S. Noble, Apple ACG, and J. Papadopoulos, University of Maryland College Park — May 2006