src/fft_kernelstring.cpp · 48a3c01992950756934d2a148c7952a20ce9f09e · einsteinathome / libclfft

12 years ago

Bug #1608: clFFT use of native_sin , native_cos can cause validation problems · 48a3c019

Heinz-Bernd Eggenstein authored 12 years ago

Still experimental: replace calls to native_sin in clFFT
This change explores the performance impacts of using a set of LUTs, precomputed on the CPU
to perform sin(x_i) and cos(x_i) in a grid x_i= +/- 2*pi *i/N , N fixed.

On a 6770M, this code is still ca 3% slower than the original native_sin/native_cos varaint
for a BRP4-like transform

This variant should have a very high accuracy, versions with lesser accuracy but
higher performance should be explored next. Eventually the method should be selectable
by a parameter to the plan creator as suggested by Bernd.

TODO: - remove some diagnostic code,
      - optimze total size of LUTs perhaps by using
        cos(x) = sin(x+pi/2), so no need to keep separate LUTs for sin and cos, just one slighly longer with
        an additional alias pointer
      - try caching the LUTs in shared memory (using constant memory didn't help)

48a3c019

History

Bug #1608: clFFT use of native_sin , native_cos can cause validation problems

Heinz-Bernd Eggenstein authored 12 years ago

Still experimental: replace calls to native_sin in clFFT
This change explores the performance impacts of using a set of LUTs, precomputed on the CPU
to perform sin(x_i) and cos(x_i) in a grid x_i= +/- 2*pi *i/N , N fixed.

On a 6770M, this code is still ca 3% slower than the original native_sin/native_cos varaint
for a BRP4-like transform

This variant should have a very high accuracy, versions with lesser accuracy but
higher performance should be explored next. Eventually the method should be selectable
by a parameter to the plan creator as suggested by Bernd.

TODO: - remove some diagnostic code,
      - optimze total size of LUTs perhaps by using
        cos(x) = sin(x+pi/2), so no need to keep separate LUTs for sin and cos, just one slighly longer with
        an additional alias pointer
      - try caching the LUTs in shared memory (using constant memory didn't help)

Admin message