The clFFT library is an OpenCL library implementation of discrete Fast Fourier Transforms. It:
clFFT library pre-compiled packages for recent versions of Microsoft Windows operating systems and several flavors of Linux are available.
The downloadable binary packages are freely available at https://github.com/clMathLibraries/clFFT/releases
Once the appropriate package for the respective OS has finished downloading, uncompress the package using the native tools available on the platform in a directory of the user's choice. Everything needed to build a program using clFFT is included in the directory tree, including documentation, header files, binary library components, and sample programs for programming illustration.
After the clFFT package is uncompressed on the user's hard drive, a samples directory exists with source code, but no Visual Studio project files, Unix makefiles, or other native build system exist. Instead, it contains a CMakeLists.txt
file. clFFT uses CMake as its build system, and other build files, such as Visual Studio projects, nmake makefiles, or Unix makefiles, are generated by the CMake build system, during configuration. CMake is freely available for download from: http://www.cmake.org/
CMake is written to pull compiler information from environment variables, and to look in default install directories for tools. Once installed, a popular interface to control the process of creating native build files is CMake-gui. When the GUI is launched, two text boxes appear at the top of the dialog: a path to source and a separate path to generate binaries. For the browse
source... box, find the path to where you unzipped clFFT, and select the root samples
directory that contains the CMakeLists.txt; for clFFT, this should be clFFT/samples
. For browse
build
..., select an appropriate directory where the build environment generates build files; a convenient location is a sibling directory to the source. This makes it easy to wipe all the binaries and start a fresh build. For instance, for a debug configuration of NMake, an example directory could be clFFT/bin/NMakeDebug
. This is where the generated makefile, native build files, and intermediate object files are built. These generated files are kept separate from the source; this is referred to as 'out-of-source' builds, and is very similar in concept to what 'autotools' does for Linux. To build using NMake, simply type NMake in the build directory containing the makefile. To build using Visual Studio, generate the solution and project files into a directory such as clFFT/bin/vs10
, find the generated .sln file, and open the solution.
The first time the configure
button near the bottom of the screen is clicked, it causes CMake to prompt for what type of native build files to make. Various properties appear in red in the properties
box. Red indicates that the value has changed since last time configure
was clicked. (The first time configure is clicked, everything is red.) CMake tries to configure itself automatically to the client's system by looking at a systems environment variables and by searching through default install locations for project dependencies. Take a moment to verify the settings and paths that are displayed on the configuration screen; if any changes must be made, you can provide correct paths or adjust settings by typing directly into the CMake configuration screen. Click the configure
button a second time to 'bake' those settings and serialize them to disk.
Options relevant to the clFFT project include:
'AMDAPPSDKROOT'
: Location of the Stream SDK installation. This value is already populated if CMake could determine the location by looking at the environment variables. If not, the user must provide a path to the root installation of the Stream SDK here.'BOOST_ROOT'
: Location of the Boost SDK installation. This value is already populated if CMake could determine the location by looking at the environment variables or default install locations. If not, the user must provide a path to the root installation of the Stream SDK here. This dependency is only relevant to the sample client; the FFT library does not depend on Boost.'CMAKE_BUILD_TYPE'
: Defines the build type (default is debug). For Visual Studio projects, this does not appear (modifiable in IDE); for makefile-based builds, this is set in CMake.'CMAKE_INSTALL_PREFIX'
: The path to install all binaries and headers generated from the build. This is used when the user types make
install
or builds the INSTALL project in Visual Studio. All generated binaries and headers are copied into the path prefixed with CMAKE_INSTALL_PREFIX
. The Visual Studio projects are self explanatory, but a few other projects are autogenerated; these might be unfamiliar.The Visual Studio projects are self explanatory, but a few other projects are autogenerated; these might be unfamiliar.
'ALL_BUILD'
: A project that is empty of files, but since it depends on all user projects, it provides a convenient way to rebuild everything.'ZERO_CHECK'
: A CMake-specific project that checks to see if the generated solution and project files are in sync with the CMakeLists.txt
file. If these files are modified, the solutions and projects are now out-of-sync, and this project prompts the user to regenerate their environment.clFFT includes one sample project that has source dependencies on Boost: the sample client project. Boost is freely available from: http://www.boost.org/.
The command-line clFFT sample client links with the program_options
library, which provides functionality for parsing command-line parameters and .ini files in a cross-platform manner. Once Boost is downloaded and extracted on the hard drive, the
program_options
library must be compiled. The Boost build system uses the BJam builder (a project for a CMake-based Boost build is available for separate download). This is available for download from the Boost website, or the user can build BJam; Boost includes the source to BJam in its distribution, and the user can execute bootstrap.bat
(located in the root boost directory) to build it.
After BJam is either built or installed, an example BJam command-line is given below for building a 64-bit program_options
binary, for both static and dynamic linking:
The last step to make boost readily available and usable by CMake and the native compiler is to add an environment variable to the system called BOOST_ROOT
. In Windows, right click on the computer icon and go to
Remember to relaunch any new processes that are open, in order to inherit the new environment variable. On Linux, consider modifying the .bash_rc file (or shell equivalent) to export a new environment variable every time you log in.
If you are on a Linux system and have used a package manager to install Boost, you may have to confirm where the Boost include
and library
files have been placed. For example, after installing Boost with the Ubuntu Synaptic Package Manager, the Boost include
files are in /usr/include/boost
, and the library files either /usr/lib
or /usr/lib64
. The CMakeLists.txt
file in this project defaults the BOOST_ROOT
value to /usr
on Linux; so, if the system is set up similarly, no further action is necessary. If the system is set up differently, you may have to set the BOOST_ROOT
environmental variable accordingly.
libboost_module_name.so.x.xx.x
file (where x.xx.x is the version of Boost), the user may need to manually create a soft link called libboost_module_name.so
to the versioned libboost_module_name.so.x.xx.x
. See the clFFT binary artifacts in the install directory for an example.The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O( \(N^2\)) to O( \( N \log N\)) when the sequence length, N
, is the product of small prime factors. Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that are optimized for AMD graphics processors, and that also functional across CPU and other compute devices.
clFFT supports powers of 2, 3 and 5 sizes. This means that the vector lengths that can be configured through a plan can be any length that is a power of two, three, and five; examples include \(2^7, 2^1*3^1, 3^2*5^4, 2^2*3^3*5^5\), up to the limit that the device can support.
Currently, there is an upper bound on the transform size the library supports. This limit is \(2^{24}\) for single precision and \(2^{22}\) for double precision. This means that the product of transform lengths must not exceed these values. As an example, a 1D single-precision FFT of size 1024 is valid since 1024 \(<= 2^{24}\). Similarly, a 2D double-precision FFT of size 1024x1024 is also valid, since 1024*1024 \(<= 2^{22}\). But, a 2D single-precision FFT of size 4096x8192 is not valid because 4096*8192 > 224.
clFFT currently supports FFTs of up to three dimensions, given by the enum clFFT-Dim
. This enum is a required parameter into clfftCreateDefaultPlan()
to create an initial plan; there is no default for this parameter. Depending on the dimensionality that the client requests, clFFT uses the formulations shown below to compute the DFT.
The definition of a 1D complex DFT used by clFFT is given by:
\[ {\tilde{x}}_j = {{1}\over{scale}}\sum_{k=0}^{n-1}x_k\exp\left({\pm i}{{2\pi jk}\over{n}}\right)\hbox{ for } j=0,1,\ldots,n-1 \]
where \(x_k\) are the complex data to be transformed, \(\tilde{x}_j\) are the transformed data, and the sign of \(\pm\) determines the direction of the transform: \(-\) for forward and \(+\) for backward. Note that the user must provided the scaling factor. Typically, the scale is set to 1 for forward transforms, and \({{1}\over{N}}\) for backwards transforms.
The definition of a complex 2D DFT used by clFFT is given by:
\[ {\tilde{x}}_{jk} = {{1}\over{scale}}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rq}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right) \]
for \(j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\), where \(x_{rq}\) are the complex data to be transformed, \(\tilde{x}_{jk}\) are the transformed data, and the sign of \(\pm\) determines the direction of the transform. Typically, the scale is set to 1 for forwards transforms and \({{1}\over{M \cdot N}}\) for backwards transforms.
The definition of a complex 3D DFT used by clFFT is given by:
\[ \tilde{x}_{jkl} = {{1}\over{scale}}\sum_{s=0}^{p-1}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1} x_{rqs}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)\exp\left({\pm i}{{2\pi ls}\over{p}}\right) \]
for \(j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1\), where \(x_{rqs}\) are the complex data to be transformed, \(\tilde{x}_{jkl}\) are the transformed data, and the sign of \(\pm\) determines the direction of the transform. Typically, the scale is set to 1 for forwards transforms and \({{1}\over{M \cdot N \cdot P}}\) for backwards transforms.
clFFT is initialized by a call to clfftSetup()
, which must be called before any other API exported from clFFT. This allows the library to create resources used to manage the plans that are created and destroyed by the user. This API also takes a structure clfftInitSetupData
that is initialized by the client to control the behavior of the library. The corresponding clfftTeardown()
method must be called by the client when it is done using the library. This instructs clFFT to release all resources, including any acquired references to any OpenCL objects that may have been allocated or passed to it through the API.
The clFFT API is designed to be thread-safe. It is safe to create plans from multiple threads, and to destroy those plans in separate threads. Multiple threads can call clfftEnqueueTransform()
to place work into a command queue at the same time. clFFT does not provide a single-threaded version of the library. It is expected that the overhead of the synchronization mechanisms inside of clFFT thread safe is minor.
Currently, multi-device operation must be managed by the user. OpenCL contexts can be created that are associated with multiple devices, but clFFT only uses a single device from that context to transform the data. Multi-device operation can be managed by the user by creating multiple contexts, where each context contains a different device, and the user is responsible for scheduling and partitioning the work across multiple devices and contexts.
clFFT expects all multi-dimensional input passed to it to be in row-major format. This is compatible with C-based languages. However, clFFT is very flexible in the input and output data organization it accepts by allowing the user to specify a stride for each dimension. This feature can be used to process data in column major arrays, and other non-contiguous data formats. See clfftSetPlanInStride and clfftSetPlanOutStride.
OpenCL objects, such as contexts, cl_mem
buffers, and command queues, are the responsibility of the user application to allocate and manage. All of the clFFT interfaces that must interact with OpenCL objects take those objects as references through the API. Specifically, the plan creation function clfftCreateDefaultPlan() takes an OpenCL context as a parameter reference, increments the reference count on that object, and keeps the object alive until the corresponding plan has been destroyed through a call to clfftDestroyPlan().
The clFFT API operates asynchronously, and with the exception of thread safety locking with multiple threads, all APIs return immediately. Specifically, the clfftEnqueueTransform() API does not explicitly flush the command queues that are passed by reference to it; it pushes the transform work onto the command queues and returns the modified queues to the client. The client is free to issue its own blocking logic, using OpenCL synchronization mechanisms, or push further work onto the queue to continue processing.
A plan is the collection of (almost) all of the parameters needed to specify an FFT computation. This includes:
The plan does not include:
These are specified when the plan is executed.
When a new plan is created by calling clfftCreateDefaultPlan, its parameters are initialized as follows:
CLFFT_SINGLE
. CLFFT_INPLACE
. CLFFT_COMPLEX_INTERLEAVED
. CLFFT_COMPLEX_INTERLEAVED
. Writing client programs that depend on these initial values is not recommended.
There are two main families of Discrete Fourier Transform (DFT):
RRRRR
IIIII
RIRIRIRIRIRI
For one-dimensional data, if clStrides[0] = strideX = 1, successive elements in the first dimension are stored contiguously in memory. If strideX is an integral value greater than 1, gaps in memory exist between each element of the vectors.
For multi-dimensional data, if clStrides[1] = strideY = LenX for 2 dimensional data and clStrides[2] = strideZ = LenX*LenY for 3 dimensional data, no gaps exist in memory between each element, and all vectors are stored tightly packed in memory. Here, LenX, LenY, and LenZ denote the transform lengths clLengths[0], clLengths[1], and clLengths[2], respectively, which are used to set up the plan.
By specifying non-default strides, it is possible to process either row-major or column-major arrays. Data can be extracted from arrays of structures. Almost any regular data storage pattern can be accommodated.
Distance is the amount of memory that exists between corresponding elements in an FFT primitive in a batch. Distance is measured in the units of the FFT primitive; complex data measures in complex units, and real data measures in real data. Stride between tightly packed elements is 1 in either case. Typically, one can measure the distance between any two elements in a batch primitive, be it 1D, 2D, or 3D data. For tightly packed data, the distance between FFT primitives is the size of the FFT primitive, such that dist=LenX for 1D data, dist=LenX*LenY for 2D data, and dist=LenX*LenY*LenZ for 3D data. It is possible to set the distance of a plan to be less than the size of the FFT vector; most often 1 for this case. When computing a batch of 1D FFT vectors, if distance == 1, and strideX == length( vector ), a transposed output is produced for a batch of 1D vectors. It is left to the user to verify that the distance and strides are valid (not intersecting); if not valid, undefined results can occur.
A simple example is to perform a 1D length 4096 on each row of an array of 1024 rows x 4096 columns of values stored in a column-major array, such as a FORTRAN program might provide. (This would be equivalent to a C or C++ program that had an array of 4096 rows x 1024 columns stored in a row-major manner, and you wanted to perform a 1-D length 4096 transform on each column.) In this case, specify the strides [1024, 1].
For a more complex example, an input buffer contained a raster grid of 1024 x 1024 monochrome pixel values, and you want to compute a 2D FFT for each 64 x 64 subtile of the grid. Specifying strides allows you to treat each horizontal band of 1024 x 64 pixels as an array of 16 64 x 64 matrixes, and process an entire band with a single call to clfftEnqueueTransform. (Specifying strides is not quite flexible enough to transform the entire grid of this example with a single kernel execution.) It is possible to create a Plan to compute arrays of 64 x 64 2D FFTs, then specify three strides: [1, 1024, 64]. The first stride, 1, indicates that the rows of each matrix are stored consecutively; the second stride, 1024, gives the distance between rows, and the third stride, 64, defines the distance from matrix to matrix. Then call clfftEnqueueTransform 16 times: once for each horizontal band of pixels.
Both CLFFT_SINGLE
and CLFFT_DOUBLE
precisions are supported by the library for all supported radices. With both of these enums the host computer's math functions are used to produce tables of sines and cosines for use by the OpenCL kernel.
Both CLFFT_SINGLE_FAST
and CLFFT_DOUBLE_FAST
are meant to generate faster kernels with reduced accuracy, but are disabled in the current build..
See clfftPrecision, clfftSetPlanPrecision, and clfftGetPlanPrecision.
The direction of the transform is not baked into the plan; the same plan can be used to specify both forward and backward transforms. Instead, clfftDirection is passed as a parameter into clfftEnqueueTransform.
The clFFT API supports both in-place and out-of-place transforms. With inplace transforms, only input buffers are provided to the clfftEnqueueTransform() API, and the resulting data is written in the same buffers, overwriting the input data. With out-of-place transforms, distinct output buffers are provided to the clfftEnqueueTransform() API, and the inputdata is preserved. In-place transforms require that the cl_mem
objects the client creates have both read
and write
permissions. This is given in the nature of the in-place algorithm. Out-of-place transforms require that the destination buffers have read
and write
permissions, but input buffers can still be created with read-only permissions. This is a clFFT requirement because internally the algorithms may go back and forth between the destination buffers and internally allocated temp buffers. For out-of-place transforms, clFFT never writes back to the input buffers.
The efficiency of clFFT is improved by utilizing transforms in batches. Sending as much data as possible in a single transform call leverages the parallel compute capabilities of OpenCL devices (and GPU devices in particular), and minimizes the penalty of transfer overhead. It's best to think of an OpenCL device as a high-throughput, high-latency device. Using a networking analogy as an example, it's similar to having a massively high-bandwidth pipe with very high ping response times. If the client is ready to send data to the device for compute, it should be sent in as few API calls as possible. This can be done by batching. clFFT plans have a parameter to describe the number of transforms being batched: clfftSetPlanBatchSize(), and to describe how those batches are laid out and spaced in memory: clfftSetPlanDistance(). 1D, 2D, or 3D transforms can be batched.
To perform FFT calculations using clFFT, the client program must:
For each distinct type of FFT needed:
clAmdFFtSet_____
. Optionally, "bake" or finalize the plan, calling clfftBakePlan. This signals to the library the end of the specification phase, and causes it to generate and compile the exact OpenCL kernels needed to perform the specified FFT on the OpenCL device provided.
At this point, all performance-enhancing optimizations are applied, possibly including executing benchmark kernels on the OpenCL device context in order to maximize runtime performance.
Although this step is optional, most users probably want to include it so that they can control when this work is done. Usually, this time consuming step is done when the application is initialized. If the user does not call clfftBakePlan, this work is done during the first call to clfftEnqueueTransform.
Call clfftEnqueueTransform. At this point, specify whether you want to execute a forward or reverse transform; also, provide the OpenCL cl_mem
handles for the input buffer(s), output buffer(s)–unless you want the transformed data to overwrite the input buffers, and (optionally) scratch buffer.
clfftEnqueueTransform performs one or more calls to the OpenCL function clEnqueueNDRangeKernel. Like clEnqueueNDRangeKernel, clfftEnqueueTransform is a non-blocking call. The commands to execute the FFT compute kernel(s) are added to the OpenCL context queue to be executed asynchronously. An OpenCL event handle is returned to the caller. If multiple NDRangeKernel operations are queued, the final event handle is returned.
When real data is subject to DFT transformation, the resulting complex output follows a special property. About half of the output is redundant because they are complex conjugates of the other half. This is called the Hermitian redundancy. So, for space and performance considerations, it is only necessary to store the non-redundant part of the data. Most FFT libraries use this property to offer specific storage layouts for FFTs involving real data. clFFT provides 3 enumerated types to deal with real data FFTs:
CLFFT_REAL
CLFFT_HERMITIAN_INTERLEAVED
CLFFT_HERMITIAN_PLANAR
The first enum specifies that the data is purely real. This can be used to feed real input or get back real output. The second and third enums specify layouts for storing FFT output. They are similar to the corresponding full complex enums in the way they store real and imaginary components. The difference is that they store only about half of the complex output. Client applications can do just a forward transform and analyze the output. Or they can do some processing of the output and do a backward transform to get back real data. This is illustrated in the following figure.
Let us consider a 1D real FFT of length N. The full output looks as shown in following figure.
Here, C* denotes the complex conjugate of. Since the values at indices greater than N/2 can be deduced from the first half of the array, clFFT stores data only up to the index N/2. This means that the output contains only 1 + N/2 complex elements, where the division N/2 is rounded down. Examples for even and odd lengths are given below.
Example for N = 8 is shown in following figure.
Example for N = 7 is shown in following figure.
For length 8, only (1 + 8/2) = 5 of the output complex numbers are stored, with the index ranging from 0 through 4. Similarly for length 7, only (1 + 7/2) = 4 of the output complex numbers are stored, with the index ranging from 0 through 3.
For 2D and 3D FFTs, the FFT length along the least dimension is used to compute the (1 + N/2) value. This is because the FFT along the least dimension is what is computed first and is logically a real-to-hermitian transform. The FFTs along other dimensions are computed afterwards; they are simply 'complex-tocomplex' transforms. For example, assuming clLengths[2] is used to set up a 2D real FFT, let N1 = clLengths[1], and N0 = clLengths[0]. The output FFT has N1*(1 + N0/2) complex elements. Similarly, for a 3D FFT with clLengths[3] and N2 = clLengths[2], N1 = clLengths[1], and N0 = clLengths[0], the output has N2*N1*(1 + N0/2) complex elements.
Out-of-place transforms:
CLFFT_REAL
to CLFFT_HERMITIAN_INTERLEAVED
CLFFT_REAL
to CLFFT_HERMITIAN_PLANAR
CLFFT_HERMITIAN_INTERLEAVED
to CLFFT_REAL
CLFFT_
CLFFT_HERMITIAN_PLANAR to CLFFT_REAL
In-place transforms:
CLFFT_REAL
to CLFFT_HERMITIAN_INTERLEAVED
CLFFT_HERMITIAN_INTERLEAVED
to CLFFT_REAL
The library currently requires the user to explicitly set input and output strides for real transforms. See the following examples to understand what values to use for input and output strides under different scenarios. The examples only show typical usages. The user has flexibility in allocating their buffers and laying out data according to their needs.
The following pages provide figures and examples to explain in detail the real FFT features of this library.