R&D Result Detail

Original Title

Implementation of 3D FFTs across Multiple GPUs in Shared Memory Environments

English Title

Implementation of 3D FFTs across Multiple GPUs in Shared Memory Environments

Type

Paper in proceedings (conference paper)

Original Abstract

In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented. The 3D FFT is the core of many simulation methods, thus its fast calculation is critical. The main bottleneck of the distributed 3D FFT is the global data exchange which must be performed. The latest version of CUDA introduces direct GPU-to-GPU transfers using a Unified Virtual Address space (UVA) that provides new possibilities for optimising the communication part of the FFT. Here, we propose different implementations of the distributed 3D FFT, investigate their behaviour, and compare their performance with the single GPU CUFFT and CPU-based FFTW libraries.
In particular, we demonstrate the advantage of direct GPU-to-GPU transfers over data exchanges via host main memory. Our preliminary results show that running the distributed 3D FFT with four GPUs can bring a 12% speedup over the single node (CUFFT) while also enabling the calculation of 3D FFTs of larger datasets. Replacing the global data exchange via shared memory with direct GPU-to-GPU transfers reduces the execution time by up to 49%. This clearly shows that direct GPU-to-GPU transfers are the key factor in obtaining good performance on multi-GPU systems.

English abstract

In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented. The 3D FFT is the core of many simulation methods, thus its fast calculation is critical. The main bottleneck of the distributed 3D FFT is the global data exchange which must be performed. The latest version of CUDA introduces direct GPU-to-GPU transfers using a Unified Virtual Address space (UVA) that provides new possibilities for optimising the communication part of the FFT. Here, we propose different implementations of the distributed 3D FFT, investigate their behaviour, and compare their performance with the single GPU CUFFT and CPU-based FFTW libraries.
In particular, we demonstrate the advantage of direct GPU-to-GPU transfers over data exchanges via host main memory. Our preliminary results show that running the distributed 3D FFT with four GPUs can bring a 12% speedup over the single node (CUFFT) while also enabling the calculation of 3D FFTs of larger datasets. Replacing the global data exchange via shared memory with direct GPU-to-GPU transfers reduces the execution time by up to 49%. This clearly shows that direct GPU-to-GPU transfers are the key factor in obtaining good performance on multi-GPU systems.

Keywords

GPU; UVA; unified-virtual-address; multi-GPU; FFT; distributed; shared-memory;

Key words in English

GPU; UVA; unified-virtual-address; multi-GPU; FFT; distributed; shared-memory;

Authors

NANDAPALAN, N.; JAROŠ, J.; TREEBY, B.; RENDELL, A.

Released

14.12.2012

Location

Beijing

ISBN

978-0-7695-4879-1

Book

Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings

Pages from

167

Pages to

172

Pages count

6

URL

http://ieeexplore.ieee.org/document/6589258/

BibTex

@inproceedings{BUT193918,
  author="NANDAPALAN, N. and JAROŠ, J. and TREEBY, B. and RENDELL, A.",
  title="Implementation of 3D FFTs across Multiple GPUs in Shared Memory Environments",
  booktitle="Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings",
  year="2012",
  pages="167--172",
  address="Beijing",
  doi="10.1109/PDCAT.2012.79",
  isbn="978-0-7695-4879-1",
  url="http://ieeexplore.ieee.org/document/6589258/"
}

Documents

PDCAT 2012 paper
PDCAT 2012 paper
PDCAT 2012 paper

VUT

Faculties and university institutes

Parts

Implementation of 3D FFTs across Multiple GPUs in Shared Memory Environments