Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

Gregorio Bernabé


We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones.


3D Fast Wavelet Transform; parallel programming; multicore; CUDA; OpenCL


D. Manocha, General-Purpose Computation Using Graphic Processors, IEEE Computer 38 (8) (2005) 85–88.

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kr¨uger, A. E. Lefohn, T. J. Purcell, A Survey of General-Purpose

Computation on Graphics Hardware, Computer Graphics Forum 26 (1) (2007) 80–113.

Nvidia, CUDA Zone maintained by Nvidia, (2009).

AMD, AMD Stream Computing, (2009).

Nvidia, Tesla GPU Computing Solutions,

tesla computing solutions.html (2009).

The Khronos Group, The opencl core api specification, (2011).

J. Franco, G. Bernabé, J. Fernández, M. Ujaldón, The 2d wavelet transform on emerging architectures: Gpus and multicores,

Accepted and published online in Journal of Real-Time Image Processing.

(October 2011).

J. Franco, G. Bernabé, J. Fernández, M. E. Acacio, A Parallel Implementation of the 2D Wavelet Transform Using CUDA, in:

th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Weimar, Germany, 2009.

J. Franco, G. Bernabé, J. Fernández, M. Ujaldón, Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs, in:

th International Conference on Computational Science, Amsterdam, Netherlands, 2010.

G. Bernabé, G. D. Guerrero, J. Fernández, CUDA and OpenCL Implementations of 3D Fast Wavelet Transform, in: 3rd IEEE

Latin American Symposium on Circuits and Systems, Playa del Carmen, Mexico, 2012.

S. Mallat, A Theory for Multiresolution Signal Descomposition: The Wavelet Representation, IEEE Transactions on Pattern

Analysis and Machine Intelligence 11 (7) (1989) 674–693.

G. Bernabé, J. González, J. M. García, J. Duato, A New Lossy 3-D Wavelet Transform for High-Quality Compression of Medical

Video, in: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, 2000, pp.


I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992.

P. Meerwald, R. Norcen, A. Uhl, Cache Issues with JPEG2000 Wavelet Lifting, in: Proceedings of Visual Communications and

Image Processing Conference, 2002, pp. 626–634.

J. Tao, A. Shahbahrami, B. Juurlink, R. Buchty, W. Karl, S. Vassiliadis, Optimizing Cache Performance of the Discrete Wavelet

Transform Using a Visualization Tool, Procs. of IEEE Intl. Symposium on Multimedia (2007) 153–160.

A. Shahbahrami, B. Juurlink, S. Vassiliadis, Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet

Transform, in: Proceedings of ACM Conference in Computing Frontiers, 2006, pp. 253–260.

ICC, Intel Software Network, (2009).

GCC, GCC, the GNU Compiler Collection, (2009).

M. W. Marcellin, M. J. Gormish, A. Bilgin, M. P. Boliek, An Overview of JPEG-2000, in: Proceedings of Data Compression

Conference, 2000.

D. Santa-Cruz, T. Ebrahimi, A Study of JPEG 2000 Still Image Coding Versus Others Standards, in: Proceedings of X European

Signal Processing Conference, 2000.

Y. Chen, W. A. Pearlman, Three-Dimensional Subband Coding of Video Using the Zero-Tree Method, Proc. of SPIE-Visual

Communications and Image Processing (1996) 1302–1310.

Y. Kim, W. A. Pearlman, Stripe-based spiht lossy compression of volumetric medical images for low memory usage and uniform

reconstruction quality, in: Proceedings of International Conference on Acoustics, Speech and Signal Processing, 2000, pp.


S. Battista, F. Casalino, C. Lande, MPEG-4: A Multimedia Standard for the Third Millenium, Part 1, IEEE Multimedia (October


S. Battista, F. Casalino, C. Lande, MPEG-4: A Multimedia Standard for the Third Millenium, Part 2, IEEE Multimedia (January


G. Bernabé, J. M. García, J. González, Reducing 3D Wavelet Transform Execution Time Using Blocking and the Streaming SIMD

Extensions, Journal of VLSI Signal Processing 41 (2) (2005) 209–223.

P. Sander, N. Tartachuk, J. L. Mitchell, Explicit early-z culling for efficient fluid flow simulation and rendering, Technical Report,

ATI Research Journal (August 2004).

M. Harris, Fast Fluid Dynamics Simulation on the GPU. In GPU Gems, Addisson Wesley, 2004.

I. Viola, A. Kanitsar, M. E. Groller, Hardware-Based Nonlinear Filtering and Segmentation Using High-Level Shading Languages,

IEEE Visualization (2003) 309–316.

T. Sumanaweera, D. Liu, Medical Image Reconstruction with the FFT. In GPU Gems, Addisson Wesley, 2004.

R. Yang, M. Pollefeys, A Versatile Stereo Implementation on Commodity Graphics Hardware, Real Time Imaging 11 (1) (2005)


D. Weiskopf, T. Schafhitzel, T. Ertl, GPU-Base Bonlinear Ray Tracing, Computer Graphics Forum 23 (3) (2004) 625–633.

N. K. Govindaraju, M. Henson, M. C. Lin, D. Manocha, Interactive Visibility Ordering of Geometric Primitives in Complex

Environments, Symposium on Interactive 3D Graphics and Games (2005) 49–56.

N. K. Govindaraju, B. LLoyd, W. Wang, M. Lin, D. Manocha, Fast Computation of Database Operations Using Graphics

Processors, ACM SIGMOD International Conference on Management of Data (2004) 215–226.

M. Ansari, Video image processing using shaders, Presentation at Game Developers Conference (2003).

J. Sptizer, Implementing a CPU-Efficient FFT, Nvidia Course Presentation, SIGGRAPH.

K. Moreland, E. Angel, The FFT on a GPU, Graphics Hardware (2003) 112–119.

NVIDIA Corporation, NVIDIA CUDA CUFFT Library Version 1.1 (October 2007).

T. T. Wong, C. S. Leung, P. A. Heng, J. Wang, Discrete Wavelet Transform on Consumer-Level Graphics Hardware, IEEE

Transactions on Multimedia 9 (3) (2007) 668–673.

C. Tenllado, J. Setoain, M. Prieto, L. P. . nand F. Tirado, Parallel Implementation of the 2D Discrete Wavelet Transform on

Graphics Processing Units: Filter Bank versus Lifting, IEEE Transactions on Parallel and Distributed Systems 19 (2) (2008)


A. García, H. Shen, GPU-Based 3D Wavelet Reconstruction with Tileboarding, The Visual Computer 21 (8–10) (2005) 755–763.

OpenMP, The OpenMP API Specification, (2009).

Full Text: PDF