Re: [Projekt] StoneQuest lebt noch!
Verfasst: 21.01.2017, 11:49
Man führt eine horizontale und eine vertikale FFT durch. Die Horizontale arbeitet auf 64 (oder was eben die Wavefront-Größe ist) Zeilen parallel; die vertikale auf entsprechend vielen Spalten.
Problematisch ist vor allem das Anordnen der Reads und Writes – ich hatte da Unterschiede von 50:1(!) je nach Speicher-Layout der Daten. Da gibt es aber heute bestimmt Papers zur optimalen Anordnung. Aber schneller als die CPU ist es in jedem Fall; ich habe anno 2010 auf einer 11.0-Karte 4096×4096×7×30 (Texturgröße × Farbkanäle × fps) Pixel pro Sekunde gefaltet. Die entsprechende CPU-Implementierung möchte ich mal sehen.
Erfahrungen von mir:
Hier ist mein alter Quelltext, falls er jemanden interessiert. Cooley-Tukey-FFT mit radix-4-Butterfly, wenn ich das recht sehe. Eine Gruppe pro Zeile, Synchronisierung nach jedem Radix. Umsortieren der Befehle hat stark unterschiedlichen Code produziert, darum so obskure Folgen wie Radix-4-4-2-4-4-4 usw. Diese spezielle Datei behandelt Bilder von 2048² Pixeln. Ich wette, dass man das heute mindestens doppelt so schnell implementieren kann.
Problematisch ist vor allem das Anordnen der Reads und Writes – ich hatte da Unterschiede von 50:1(!) je nach Speicher-Layout der Daten. Da gibt es aber heute bestimmt Papers zur optimalen Anordnung. Aber schneller als die CPU ist es in jedem Fall; ich habe anno 2010 auf einer 11.0-Karte 4096×4096×7×30 (Texturgröße × Farbkanäle × fps) Pixel pro Sekunde gefaltet. Die entsprechende CPU-Implementierung möchte ich mal sehen.
Erfahrungen von mir:
- die FFT ist periodisch; damit ein heller Punkt am linken Bildrand keinen Blur am rechten Bildrand auslöst, muss man die Auflösung verdoppeln. Mathe-Freaks finden da vielleicht eine bessere Lösung.
- Der Shared Memory limitiert arg; reicht in D3D gerade mal für eine 4096×4096-DCT aus. (Zusammen mit 1. bedeutet das, dass die Bildauflösung auf 2048 Pixel Seitenlänge limitiert ist; und tatsächlich hört meine Sternendemo bei größeren Auflösungen auf, zu funktionieren.)
- Indem man eine FFT auf reellen Zahlen durchführt, kann man fast die Hälfte des Speichers sparen. Dann kriegen aber alle Texturen krumme Auflösungen (2049 Pixel!) und ich hatte keine Zeit, das weiter zu erforschen.
Hier ist mein alter Quelltext, falls er jemanden interessiert. Cooley-Tukey-FFT mit radix-4-Butterfly, wenn ich das recht sehe. Eine Gruppe pro Zeile, Synchronisierung nach jedem Radix. Umsortieren der Befehle hat stark unterschiedlichen Code produziert, darum so obskure Folgen wie Radix-4-4-2-4-4-4 usw. Diese spezielle Datei behandelt Bilder von 2048² Pixeln. Ich wette, dass man das heute mindestens doppelt so schnell implementieren kann.
Code: Alles auswählen
//==============================================================================================================================
// convolution 2048.hlsl
// public domain by Krishty, 2008–2011
//==============================================================================================================================
//
// !ARCHITECTURE! marks architecture-dependent optimization opportunities. They should be enclosed in #ifdef #elif #else #endif
// blocks, so the code base remains the same for all architectures. Check for the following #defines:
// • AMD_RADEON_HD AMD Radeon HD 5xxx or higher architecture (4xxx doesn't offer compute shaders)
//
// Some notes on the many discrete Fourier transformations being done here:
// • All DFTs are realized as fast Fourier transformations with different radixes. Higher radixes require less arithmetic
// instructions and less group synchronizations, but they also require more registers and read/write scattering and they
// are more complex. Lower-radix FFTs perform better on current GPUs (2011) for the following reasons:
// — Compilers fail on large shaders. This applies currently (early 2011) to Microsofts HLSL shader compiler as well as
// to AMD's shader bytecode compiler. Both produce extremely poor code for shaders with hundreds of instructions:
// 20 % of all instructions are useless; ALU utilization is below 50 %; unneeded scratch registers both in local
// and global memory are a huge problem.
// — Read/write scattering is awfully slow and memory bandwidth is a bottleneck. Currently, it is unknown whether this
// is by concept or a driver bug.
// — Simple shaders in many threads perform faster than complex shaders in not-so-many threads, even if the total
// amount of work is bigger (probably because the scheduler can hide latencies better).
// — Although the number of available GPU registers is quite high (AMD provides at least 32 and at most 512 registers,
// depending on the number of threads per group), using more than a few registers (6 on AMD hardware) reduces
// performance drastically.
// For the future, however, we expect higher-radix FFTs to perform better:
// — Read/write scattering performance is constantly improving (GPGPU).
// — GPU performance on complex shaders is constantly improving (more registers, better compilers due to GPGPU).
// • We use the Stockham auto-sort algorithm. This is an out-of-place algorithm which reads from one source, performs an FFT
// with an arbitrary radix and writes the result to another array. It has been proposed in "High Performance Discrete
// Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
// Manferdelli, Microsoft Corporation). Basic functionality is copied from the paper.
// • Most papers propose to invert a FFT by negating the complex exponent. This would, however, break all higher-radix
// butterflies (because their implementation depends on the twiddle factors being used and therefore on the sign of the
// complex exponent). It is — far — easier to swap the real and imaginary components of the operands before and after
// applying an ordinary FFT. This is fully compatible to all FFT butterflies, requires little code change and has nearly no
// cost when implemented as register swizzling. The final division by the number of samples remains. Source:
// "Algorithms for Programmers" (Jörg Arndt), chapter 1.7: "Inverse FFT for free".
// • The group-shared memory in Direct3D 11 is limited to 32 KiB; this is
// — 8192 'float' values
// — 4096 complex ('float2') values
// • For input and output of the FFT algorithm, there are two possibilies:
// — Using two group-shared arrays ("bank 0" and "bank 1") in a ping-pong pattern. Because different FFTs access different
// elements, a group thread synchronization is necessary before the banks are swapped. This doubles the space
// requirement of the algorithm — FFTs are limited to 2048 samples.
// — Using only one group-shared array and local registers as a cache. This consumes less group-shared memory (now FFTs on
// 4096 samples are possible) but requires twice as many group synchronizations.
// Although the first possibility has proven to be slightly faster (5 %) on AMD GPUs, it also takes thrice as much time to
// compile and performs 13.8 % worse on Nvidia GPUs.
// • The red, green and blue components (where needed) are treated sequentially. This greatly simplifies the shaders, reduces
// read/write scattering and allows high parallelization.
// For horizontal FFTs, the input color channels should be stacked vertically and written out horizontally (this behaviour
// is swapped on inverse FFTs). Rationale: the Fourier transformation of a zero signal is zero, again — therefore, all FFTs
// on padding can be omitted. As a fortunate coincidence, out-of-bounds reads from DXGI resources return 0, too — the
// padding can be omitted completely and the texture can be cropped down to the input's actual size. This saves a lot of
// texture space and bandwidth (73 % on vertical passes with full HD — 1080 pixels are read and written instead of 4096)
// and the need to clear the texture with zeroes (which limits the bandwidth, too).
//
//==============================================================================================================================
//==============================================================================================================================
// GENERIC FUNCTIONS
// This code handles mostly FFTs and is the same for all convolution implementations. It can be re-used by simply adjusting the
// "IMAGE_SIDELENGTH" constant.
//==============================================================================================================================
// #define AMD_RADEON_HD
#define IMAGE_SIDELENGTH 2048 // group size attributes require literal constants
static const uint imageSidelength = IMAGE_SIDELENGTH;
// !ARCHITECTURE! Scratch memory for the current DFT. Never read or write it directly — use "loadGroupShared()" and
// "writeGroupShared" to read and write, respectively. This allows quick adjustion of the group-shared memory layout for
// different architectures, e.g. to avoid bank conflicts.
#if defined(AMD_RADEON_HD)
// • on AMD Radeon HD architecture, storing the values as 'float2's has proven to be 5.8 % faster than seperating real and
// imaginary components
// • on AMD Radeon HD architecture, the ping-pong pattern is >5 % faster than using a single group-shared array (but it
// requires more than twice the time to compile) — this is probably due to the compiler having problems transferring
// instructions over group thread barriers (a ping-pong pattern halves the number of required synchronizations)
groupshared float2 groupSharedValues[2][imageSidelength];
static uint currentGSMOutBanksIndex = 1; // first writing to bank 1 saves two cycles; I don't know why
float2 loadGroupShared(
const uint index
) {
return groupSharedValues[currentGSMOutBanksIndex ^ 1][index];
}
//..........................................................................................................................
void writeGroupShared(
const uint index,
const float2 value
) {
groupSharedValues[currentGSMOutBanksIndex][index] = value;
return;
}
//..........................................................................................................................
void switchGSMBank() {
currentGSMOutBanksIndex ^= 1;
return;
}
#else // not AMD Radeon HD:
// • on Nvidia GeForce GT architecture, a single group-shared array is 13.8 % faster than the ping-pong pattern — this is
// probably due to a group-shared memory size limitation there; halving the GSM consumption doubles the effective
// wavefront size
groupshared float2 groupSharedValues[imageSidelength];
float2 loadGroupShared(
const uint index
) {
return groupSharedValues[index];
}
//..........................................................................................................................
void writeGroupShared(
const uint index,
const float2 value
) {
groupSharedValues[index] = value;
return;
}
#endif // default hardware
// For better readability of function parameters.
static const bool invert = true;
static const bool dontInvert = false;
static const bool horizontally = false;
static const bool vertically = true;
// The input and output buffers. Their registers overlap because all shaders expect only one input and write to one output (with
// the exception of the convolution itself, which reads from the image's DFT as well as from the kernel's DFT).
// Stores the glare with its three color channels vertically stacked. At the beginning of the glare operator, this is the
// resolved (and possibly downsampled) scene.
RWTexture2DArray<float> glareRW : register(u0);
Texture2DArray<float> glareRO : register(t0);
// Stores the point spread function with its three color channels vertically stacked.
Texture2DArray<float> PSFRO : register(t0);
// Stores the DFT of the aperture's point spread function. The real and imaginary components are stored in the X and Y component
// and its three color channels are vertically stacked.
RWTexture2DArray<float2> kernelsDFTRW : register(u0);
Texture2DArray<float2> kernelsDFTRO : register(t1);
// Stores the DFT of the scene in the same layout as the aperture PSF DFT's texture.
RWTexture2DArray<float2> imagesDFTRW : register(u0);
Texture2DArray<float2> imagesDFTRO : register(t0);
//------------------------------------------------------------------------------------------------------------------------------
// Swaps the real and imaginary components of the given complex number if the given boolean expression evaluates TRUE.
// Allows re-use of FFT functions for iFFT by just switching a boolean expression.
//------------------------------------------------------------------------------------------------------------------------------
float2 swapIf(
const bool doOrDont,
const float2 value
) {
return doOrDont ? value.yx : value.xy;
}
//------------------------------------------------------------------------------------------------------------------------------
// Multiplies the two given complex numbers.
//------------------------------------------------------------------------------------------------------------------------------
float2 multiplyComplex(
const float2 a,
const float2 b
) {
// !ARCHITECTURE! There are several ways to express a complex multiplication:
return float2(
# if defined(AMD_RADEON_HD)
// Saves one out of 100 instructions and — sometimes — one register. Since the main problem on AMD Radeon HD
// hardware is register pressure, this performs best.
dot(float2(a.x, -a.y), float2(b.x, b.y)),
dot(float2(a.x, a.y), float2(b.y, b.x))
# else
// Saves another instruction, but does not lower register pressure. Decide after profiling.
mad(a.x, b.x, -a.y * b.y),
mad(a.x, b.y, a.y * b.x)
// The default procedure. Source: http://en.wikipedia.org/wiki/Complex_numbers#Multiplication_and_division.
// a.x * b.x - a.y * b.y,
// a.x * b.y + a.y * b.x
# endif
);
}
//------------------------------------------------------------------------------------------------------------------------------
// Computes the base position for an FFT's output.
//------------------------------------------------------------------------------------------------------------------------------
uint baseIndexFor(
const uint radixsIndex,
const uint stepSizeInSamples,
const uint radix
) {
// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
// Dotsenko, Burton Smith, and John Manferdelli; Microsoft Corporation), fig. 2.
return (radixsIndex / stepSizeInSamples) * stepSizeInSamples * radix + (radixsIndex % stepSizeInSamples);
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation on group-shared memory.
// • Radix 2 or 4.
// • Can be used for an inverse FFT.
// This function can be executed in parallel — subsequent calls with the same step size but different radix indices will not
// overlap. It reads and writes out-of-place — after the input has been written, the group-shared memory must be
// synchronized. This is automatically done before this routine returns.
//------------------------------------------------------------------------------------------------------------------------------
void FFTOnGroupSharedMemory(
const uint radix,
const bool inverse,
const uint radixsIndex,
const uint stepSizeInSamples
) {
// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
// Dotsenko, Burton Smith, and John Manferdelli, Microsoft Corporation), fig. 2.
const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix);
# if defined(AMD_RADEON_HD)
// On AMD Radeon HD hardware, a ping-pong pattern is faster.
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
// to (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(angle), sin(angle))
)
};
// Use a radix-2 butterfly to transform the samples. Source: http://en.wikipedia.org/wiki/Butterfly_diagram.
const float2 value0 = cache[0];
const float2 value1 = cache[1];
cache[0] = value0 + value1;
cache[1] = value0 - value1;
// Write the result out-of-order back.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
switchGSMBank();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
// to (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// Use a radix-4 butterfly to transform the samples. Source: unknown; found it somewhere on the internet.
const float2 value0 = cache[0];
const float2 value1 = cache[1];
const float2 value2 = cache[2];
const float2 value3 = cache[3];
cache[0] = float2(
value0.x + value1.x + value2.x + value3.x,
value0.y + value1.y + value2.y + value3.y
);
cache[1] = float2(
value0.x + value1.y - value2.x - value3.y,
value0.y - value1.x - value2.y + value3.x
);
cache[2] = float2(
value0.x - value1.x + value2.x - value3.x,
value0.y - value1.y + value2.y - value3.y
);
cache[3] = float2(
value0.x - value1.y - value2.x + value3.y,
value0.y + value1.x - value2.y - value3.x
);
// Write the result out-of-order back.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
writeGroupShared(mad(2, stepSizeInSamples, baseIndex), swapIf(inverse, cache[2]));
writeGroupShared(mad(3, stepSizeInSamples, baseIndex), swapIf(inverse, cache[3]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
switchGSMBank();
}
# else // default hardware:
// Cache all group-shared values in registers before performing the transformation.
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
// to (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(angle), sin(angle))
)
};
// Wait for all threads to complete their reading before the result is written back.
GroupMemoryBarrierWithGroupSync();
// Use a radix-2 butterfly to transform the samples. Source: http://en.wikipedia.org/wiki/Butterfly_diagram.
const float2 value0 = cache[0];
const float2 value1 = cache[1];
cache[0] = value0 + value1;
cache[1] = value0 - value1;
// Write the result out-of-order back.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
// to (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// Wait for all threads to complete their reading before the result is written back.
GroupMemoryBarrierWithGroupSync();
// Use a radix-4 butterfly to transform the samples. Source: unknown; found it somewhere on the internet.
const float2 value0 = cache[0];
const float2 value1 = cache[1];
const float2 value2 = cache[2];
const float2 value3 = cache[3];
cache[0] = float2(
value0.x + value1.x + value2.x + value3.x,
value0.y + value1.y + value2.y + value3.y
);
cache[1] = float2(
value0.x + value1.y - value2.x - value3.y,
value0.y - value1.x - value2.y + value3.x
);
cache[2] = float2(
value0.x - value1.x + value2.x - value3.x,
value0.y - value1.y + value2.y - value3.y
);
cache[3] = float2(
value0.x - value1.y - value2.x + value3.y,
value0.y + value1.x - value2.y - value3.x
);
// Write the result out-of-order back.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
writeGroupShared(mad(2, stepSizeInSamples, baseIndex), swapIf(inverse, cache[2]));
writeGroupShared(mad(3, stepSizeInSamples, baseIndex), swapIf(inverse, cache[3]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
}
# endif // default hardware
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs two parallel FFTs on group-shared memory.
// This function is necessary because some powers of two can not be expressed by a single radix — e.g. 512 must be transformed
// with radix 4-4-4-4-2. Since the shader cannot switch its thread group size while in execution, two lower-radix FFTs must be
// performed. This would, however, enforce two additional group synchronizations (although both lower-radix FFTs do not
// overlap). This function's purpose is to offer two parallel lower-radix FFTs without unnecessary synchronizations.
// • Radix 2.
// • Can be used for an inverse FFT.
//------------------------------------------------------------------------------------------------------------------------------
void TwoFFTsOnGroupSharedMemory(
const uint radix,
const bool inverse,
const uint radixsSuperiorIndex,
const uint stepSizeInSamples
) {
const uint2 radixsIndex = uint2(
mad(2, radixsSuperiorIndex, 0),
mad(2, radixsSuperiorIndex, 1)
);
const float2 angle = float2(
-6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex[0] % stepSizeInSamples),
-6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex[1] % stepSizeInSamples)
);
const uint2 baseIndex = uint2(
baseIndexFor(radixsIndex[0], stepSizeInSamples, radix),
baseIndexFor(radixsIndex[1], stepSizeInSamples, radix)
);
# if defined(AMD_RADEON_HD)
// On AMD Radeon HD hardware, a ping-pong pattern is faster.
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
const float2 cacheA[radix] = {
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[0]))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[0]))),
float2(cos(angle[0]), sin(angle[0]))
)
};
const float2 cacheB[radix] = {
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[1]))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[1]))),
float2(cos(angle[1]), sin(angle[1]))
)
};
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] + cacheA[1]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] - cacheA[1]));
writeGroupShared(mad(0, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] + cacheB[1]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] - cacheB[1]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
switchGSMBank();
}
# else // default hardware:
// Cache all group-shared values in registers before performing the transformation.
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// Load two samples into a local cache and multiply them with their twiddle factors.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
const float2 cacheA[radix] = {
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[0]))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[0]))),
float2(cos(angle[0]), sin(angle[0]))
)
};
const float2 cacheB[radix] = {
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[1]))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[1]))),
float2(cos(angle[1]), sin(angle[1]))
)
};
// Wait for all threads to complete their reading before the result is written back.
GroupMemoryBarrierWithGroupSync();
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] + cacheA[1]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] - cacheA[1]));
writeGroupShared(mad(0, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] + cacheB[1]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] - cacheB[1]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
}
# endif // default hardware
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from a complex texture to group-shared memory.
// • Radix 2 or 4.
// • Can be used for an inverse FFT.
// • The input is read as complex numbers from a channel of the given source texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTFromComplexTexture(
const uint radix,
const bool inverse,
Texture2DArray<float2> source,
const uint channelsIndex,
const bool vertical,
const uint signalsIndex,
const uint radixsIndex
) {
const uint3 basePosition = vertical
? uint3(signalsIndex, radixsIndex, channelsIndex)
: uint3(radixsIndex, signalsIndex, channelsIndex);
const uint3 delta = vertical
? uint3(0, imageSidelength / radix, 0)
: uint3(imageSidelength / radix, 0, 0);
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
// cos(-2×Pi) + i × sin(-2×Pi)
// yields the identity.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
const float2 cache[radix] = {
swapIf(inverse, source[mad(0, delta, basePosition)]),
swapIf(inverse, source[mad(1, delta, basePosition)])
};
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(radix, radixsIndex, 0), swapIf(inverse, cache[0] + cache[1]));
writeGroupShared(mad(radix, radixsIndex, 1), swapIf(inverse, cache[0] - cache[1]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
// cos(-2×Pi) + i × sin(-2×Pi)
// yields the identity.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
const float2 cache[radix] = {
swapIf(inverse, source[mad(0, delta, basePosition)]),
swapIf(inverse, source[mad(1, delta, basePosition)]),
swapIf(inverse, source[mad(2, delta, basePosition)]),
swapIf(inverse, source[mad(3, delta, basePosition)])
};
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(radix, radixsIndex, 0), swapIf(inverse, float2(
cache[0].x + cache[1].x + cache[2].x + cache[3].x,
cache[0].y + cache[1].y + cache[2].y + cache[3].y
)));
writeGroupShared(mad(radix, radixsIndex, 1), swapIf(inverse, float2(
cache[0].x + cache[1].y - cache[2].x - cache[3].y,
cache[0].y - cache[1].x - cache[2].y + cache[3].x
)));
writeGroupShared(mad(radix, radixsIndex, 2), swapIf(inverse, float2(
cache[0].x - cache[1].x + cache[2].x - cache[3].x,
cache[0].y - cache[1].y + cache[2].y - cache[3].y
)));
writeGroupShared(mad(radix, radixsIndex, 3), swapIf(inverse, float2(
cache[0].x - cache[1].y - cache[2].x + cache[3].y,
cache[0].y + cache[1].x - cache[2].y - cache[3].x
)));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
}
# if defined(AMD_RADEON_HD)
switchGSMBank(); // on AMD Radeon HD hardware, a ping-pong pattern is faster
# endif
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from a packed real texture to group-shared memory.
// Used to perform DFTs on temporary results which had previously been stored in textures.
// • Radix 2 or 4.
// • The input is read as real numbers from a channel of the given texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTFromRealTexture(
const uint radix,
Texture2DArray<float> source,
const uint channelsIndex,
const bool vertical,
const uint signalsIndex,
const uint radixsIndex
) {
const uint3 basePosition = vertical
? uint3(signalsIndex, radixsIndex, channelsIndex)
: uint3(radixsIndex, signalsIndex, channelsIndex);
const uint3 delta = vertical
? uint3(0, imageSidelength / radix, 0)
: uint3(imageSidelength / radix, 0, 0);
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
// cos(-2×Pi) + i × sin(-2×Pi)
// yields the identity.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
const float cache[radix] = {
source[mad(0, delta, basePosition)],
source[mad(1, delta, basePosition)]
};
// Save many arithmetic instructions through complex multiplication with pure real numbers.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(radix, radixsIndex, 0), float2(cache[0] + cache[1], 0.0f));
writeGroupShared(mad(radix, radixsIndex, 1), float2(cache[0] - cache[1], 0.0f));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
// cos(-2×Pi) + i × sin(-2×Pi)
// yields the identity.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
const float cache[radix] = {
source[mad(0, delta, basePosition)],
source[mad(1, delta, basePosition)],
source[mad(2, delta, basePosition)],
source[mad(3, delta, basePosition)]
};
// Save many arithmetic instructions through complex multiplication with pure real numbers.
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
writeGroupShared(mad(radix, radixsIndex, 0), float2(cache[0] + cache[1] + cache[2] + cache[3], 0.0f));
writeGroupShared(mad(radix, radixsIndex, 1), float2(cache[0] - cache[2], -cache[1] + cache[3]));
writeGroupShared(mad(radix, radixsIndex, 2), float2(cache[0] - cache[1] + cache[2] - cache[3], 0.0f));
writeGroupShared(mad(radix, radixsIndex, 3), float2(cache[0] - cache[2], cache[1] - cache[3]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
}
# if defined(AMD_RADEON_HD)
switchGSMBank(); // on AMD Radeon HD hardware, a ping-pong pattern is faster
# endif
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from group-shared memory to a complex texture.
// Used to store temporal results like horizontal DFTs.
// • Radix 2 or 4.
// • The input is read from the "inBank" bank of group-shared memory.
// • The output is written as complex numbers to a channel of the given texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTToComplexTexture(
const uint radix,
const bool inverse,
RWTexture2DArray<float2> destination,
const uint channelsIndex,
const bool vertical,
const uint signalsIndex,
const uint radixsIndex,
const uint stepSizeInSamples
) {
const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
const uint3 basePosition = vertical
? uint3(signalsIndex, baseIndex, channelsIndex)
: uint3(baseIndex, signalsIndex, channelsIndex);
const uint3 delta = vertical
? uint3(0, stepSizeInSamples, 0)
: uint3(stepSizeInSamples, 0, 0);
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
const float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(1.0f * angle), sin(1.0f * angle))
)
};
// (Unroll manually — loop expressions with global memory access are not optimized well.)
destination[mad(0, delta, basePosition)] = swapIf(inverse, cache[0] + cache[1]);
destination[mad(1, delta, basePosition)] = swapIf(inverse, cache[0] - cache[1]);
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
const float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// (Unroll manually — loop expressions with global memory access are not optimized well.)
destination[mad(0, delta, basePosition)] = swapIf(inverse, float2(
cache[0].x + cache[1].x + cache[2].x + cache[3].x,
cache[0].y + cache[1].y + cache[2].y + cache[3].y
));
destination[mad(1, delta, basePosition)] = swapIf(inverse, float2(
cache[0].x + cache[1].y - cache[2].x - cache[3].y,
cache[0].y - cache[1].x - cache[2].y + cache[3].x
));
destination[mad(2, delta, basePosition)] = swapIf(inverse, float2(
cache[0].x - cache[1].x + cache[2].x - cache[3].x,
cache[0].y - cache[1].y + cache[2].y - cache[3].y
));
destination[mad(3, delta, basePosition)] = swapIf(inverse, float2(
cache[0].x - cache[1].y - cache[2].x + cache[3].y,
cache[0].y + cache[1].x - cache[2].y - cache[3].x
));
}
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation on group-shared memory and multiplies the result with complex numbers read from the
// PSF DFT's texture (vertically).
// Used to combine the scene's DFT with the PSF's DFT before transforming back.
// • Radix 2 or 4.
// • The input is read from group-shared memory.
// • The multiplicands are read vertically from the PSF DFT's texture according to the radix's index and the index in the
// signal.
//------------------------------------------------------------------------------------------------------------------------------
void FFTOnGroupSharedMemoryWithVerticalPSFMultiplication(
const uint radix,
const uint channelsIndex,
const uint signalsIndex,
const uint radixsIndex,
const uint stepSizeInSamples
) {
const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
const uint3 basePosition = uint3(signalsIndex, baseIndex, channelsIndex);
const uint3 delta = uint3(0, stepSizeInSamples, 0);
# if defined(AMD_RADEON_HD)
// On AMD Radeon HD hardware, a ping-pong pattern is faster.
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
float2(cos(1.0f * angle), sin(1.0f * angle))
)
};
// Multiply with the PSF's DFT and write to the scene's DFT texture.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(
cache[0] + cache[1],
kernelsDFTRO[mad(0, delta, basePosition)]
));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(
cache[0] - cache[1],
kernelsDFTRO[mad(1, delta, basePosition)]
));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
switchGSMBank();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
const float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)),
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)),
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// Multiply with the PSF's DFT and write to the scene's DFT texture.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x + cache[1].x + cache[2].x + cache[3].x,
cache[0].y + cache[1].y + cache[2].y + cache[3].y
), kernelsDFTRO[mad(0, delta, basePosition)]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x + cache[1].y - cache[2].x - cache[3].y,
cache[0].y - cache[1].x - cache[2].y + cache[3].x
), kernelsDFTRO[mad(1, delta, basePosition)]));
writeGroupShared(mad(2, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x - cache[1].x + cache[2].x - cache[3].x,
cache[0].y - cache[1].y + cache[2].y - cache[3].y
), kernelsDFTRO[mad(2, delta, basePosition)]));
writeGroupShared(mad(3, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x - cache[1].y - cache[2].x + cache[3].y,
cache[0].y + cache[1].x - cache[2].y - cache[3].x
), kernelsDFTRO[mad(3, delta, basePosition)]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
switchGSMBank();
}
# else // default hardware:
// Cache all values in registers before performing the transformation
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
float2(cos(1.0f * angle), sin(1.0f * angle))
)
};
// Wait for all threads to complete their reading.
GroupMemoryBarrierWithGroupSync();
// Multiply with the PSF's DFT and write to the scene's DFT texture.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(
cache[0] + cache[1],
kernelsDFTRO[mad(0, delta, basePosition)]
));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(
cache[0] - cache[1],
kernelsDFTRO[mad(1, delta, basePosition)]
));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
const float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)),
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)),
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// Wait for all threads to complete their reading.
GroupMemoryBarrierWithGroupSync();
// Multiply with the PSF's DFT and write to the scene's DFT texture.
// (Unroll manually — loop expressions with global memory access are not optimized well.)
writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x + cache[1].x + cache[2].x + cache[3].x,
cache[0].y + cache[1].y + cache[2].y + cache[3].y
), kernelsDFTRO[mad(0, delta, basePosition)]));
writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x + cache[1].y - cache[2].x - cache[3].y,
cache[0].y - cache[1].x - cache[2].y + cache[3].x
), kernelsDFTRO[mad(1, delta, basePosition)]));
writeGroupShared(mad(2, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x - cache[1].x + cache[2].x - cache[3].x,
cache[0].y - cache[1].y + cache[2].y - cache[3].y
), kernelsDFTRO[mad(2, delta, basePosition)]));
writeGroupShared(mad(3, stepSizeInSamples, baseIndex), multiplyComplex(float2(
cache[0].x - cache[1].y - cache[2].x + cache[3].y,
cache[0].y + cache[1].x - cache[2].y - cache[3].x
), kernelsDFTRO[mad(3, delta, basePosition)]));
// Wait for all threads to complete their writing.
GroupMemoryBarrierWithGroupSync();
}
# endif // default hardware
return;
}
//------------------------------------------------------------------------------------------------------------------------------
// Performs an inverse fast Fourier transformation from values in group-shared memory and writes the result as packed reals into
// a texture.
// Used to extract the real values from the inverse DFT and to finalize the inverse transformation.
// • Radix 2 or 4.
// • The input is read from the group-shared memory.
// • The result is written as a packed real number (using the real value of the complex result) to the given texture.
//------------------------------------------------------------------------------------------------------------------------------
void iFFTToRealTexture(
const uint radix,
RWTexture2DArray<float> destination,
const uint channelsIndex,
const bool vertical,
const uint signalsIndex,
const uint radixsIndex,
const uint stepSizeInSamples
) {
// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
// Dotsenko, Burton Smith, and John Manferdelli, Microsoft Corporation), fig. 2.
const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
const uint3 basePosition = vertical
? uint3(signalsIndex, baseIndex, channelsIndex)
: uint3(baseIndex, signalsIndex, channelsIndex);
const uint3 delta = vertical
? uint3(0, stepSizeInSamples, 0)
: uint3(stepSizeInSamples, 0, 0);
if(2 == radix) {
static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
float2(cos(angle), sin(angle))
)
};
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
destination[mad(0, delta, basePosition)] = (cache[0].y + cache[1].y) / imageSidelength;
destination[mad(1, delta, basePosition)] = (cache[0].y - cache[1].y) / imageSidelength;
} else if(4 == radix) {
static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
float2 cache[radix] = {
// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
// (1, 0) and the complex multiplication yields the identity.
loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
multiplyComplex(
loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
float2(cos(1.0f * angle), sin(1.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
float2(cos(2.0f * angle), sin(2.0f * angle))
),
multiplyComplex(
loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
float2(cos(3.0f * angle), sin(3.0f * angle))
)
};
// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
destination[mad(0, delta, basePosition)] = (cache[0].y + cache[1].y + cache[2].y + cache[3].y) / (imageSidelength * imageSidelength);
destination[mad(1, delta, basePosition)] = (cache[0].y - cache[1].x - cache[2].y + cache[3].x) / (imageSidelength * imageSidelength);
destination[mad(2, delta, basePosition)] = (cache[0].y - cache[1].y + cache[2].y - cache[3].y) / (imageSidelength * imageSidelength);
destination[mad(3, delta, basePosition)] = (cache[0].y + cache[1].x - cache[2].y - cache[3].x) / (imageSidelength * imageSidelength);
}
}
[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void kernelHorizontal(
const uint3 indexInSignal : SV_DispatchThreadID
) {
const uint currentChannelsIndex = indexInSignal.z;
FFTFromRealTexture (4, PSFRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 4);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 16);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 64);
TwoFFTsOnGroupSharedMemory (2, dontInvert, indexInSignal.x, 256);
FFTToComplexTexture (4, dontInvert, kernelsDFTRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);
return;
}
[numthreads(1, IMAGE_SIDELENGTH / 4, 1)] // 4 pixels per thread — best for radix 4
void kernelVertical(
const uint3 indexInSignal : SV_DispatchThreadID
) {
const uint currentChannelsIndex = indexInSignal.z;
FFTFromComplexTexture (4, dontInvert, kernelsDFTRO, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 4);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 16);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 64);
TwoFFTsOnGroupSharedMemory (2, dontInvert, indexInSignal.y, 256);
FFTToComplexTexture (4, dontInvert, kernelsDFTRW, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y, 512);
return;
}
[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void imageHorizontal(
const uint3 indexInSignal : SV_DispatchThreadID
) {
// !ARCHITECTURE!
// • AMD Radeon HD: Performing the FFT with radix 4-2-4-4-4-4 saves one register.
const uint currentChannelsIndex = indexInSignal.z;
FFTFromRealTexture (4, glareRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
TwoFFTsOnGroupSharedMemory (2, dontInvert, indexInSignal.x, 4);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 8);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 32);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.x, 128);
FFTToComplexTexture (4, dontInvert, imagesDFTRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);
return;
}
[numthreads(1, IMAGE_SIDELENGTH / 4, 1)] // 4 pixels per thread — best for radix 4
void imageVertical(
const uint3 indexInSignal : SV_DispatchThreadID
) {
// !ARCHITECTURE!
// • AMD Radeon HD: Performing the FFT with radix 4-4-4-4-2-4 and the iFFT with radix 2-4-4-4-4-4 saves ten registers and
// 37 scratch registers.
const uint currentChannelsIndex = indexInSignal.z;
FFTFromComplexTexture (4, dontInvert, imagesDFTRO, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 4);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 16);
FFTOnGroupSharedMemory (4, dontInvert, indexInSignal.y, 64);
TwoFFTsOnGroupSharedMemory (2, dontInvert, indexInSignal.y, 256);
FFTOnGroupSharedMemoryWithVerticalPSFMultiplication(4, currentChannelsIndex, indexInSignal.x, indexInSignal.y, 512);
TwoFFTsOnGroupSharedMemory (2, invert, indexInSignal.y, 1);
FFTOnGroupSharedMemory (4, invert, indexInSignal.y, 2);
FFTOnGroupSharedMemory (4, invert, indexInSignal.y, 8);
FFTOnGroupSharedMemory (4, invert, indexInSignal.y, 32);
FFTOnGroupSharedMemory (4, invert, indexInSignal.y, 128);
FFTToComplexTexture (4, invert, imagesDFTRW, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y, 512);
return;
}
[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void imageInverseHorizontal(
const uint3 indexInSignal : SV_DispatchThreadID
) {
// !ARCHITECTURE!
// • AMD Radeon HD: Performing the FFT with radix 4-2-4-4-4-4 saves one register.
const uint currentChannelsIndex = indexInSignal.z;
FFTFromComplexTexture (4, invert, imagesDFTRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
TwoFFTsOnGroupSharedMemory (2, invert, indexInSignal.x, 4);
FFTOnGroupSharedMemory (4, invert, indexInSignal.x, 8);
FFTOnGroupSharedMemory (4, invert, indexInSignal.x, 32);
FFTOnGroupSharedMemory (4, invert, indexInSignal.x, 128);
iFFTToRealTexture (4, glareRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);
return;
}