I also visited the NVIDIA booth and spoke with a CUDA expert, and confirmed what I had been suspecting after I actually thought this through a little better - the GPGPU may not be the right fit for my FFT-heavy program. The problem is the program does a lot of small FFTs (it could be a bunch of 15x15 2D FFTs, definately not a lot of the 128x128 or 256x256 FFTs that the NVIDIA CUDA expert says would be necessary to start seeing a bennefit from offloading to the GPU). There is a latency to transfer the data to and from the GPU - you want to have a larger FFT so that you don't end up spending more time waiting for the data to move across the PCI bus than you save by off-loading the FFT. The good news is their CUDA-based FFT library is very similar to the FFTW library I am using and that it would require very little coding changes. He said it may be possible to batch multipe small FFTs together, but I don't know if that will fit in well with this code. Perhaps I can get someone to donate a TESLA-equipped worksation to do some testing on.
I also spent some time in a break out room at the Cluster Resources booth with Chris Samuel of VPAC and Scott Jackson of CRI to discuss TORQUE. We agreed that getting job arrays finished and solid is a high priority task for the following year. Hopefully by SC in Portland job arrays will be in wide use among TORQUE users.
No comments:
Post a Comment