[Projekt] StoneQuest lebt noch!

Krishty · Beitrag von **Krishty** » 21.01.2017, 11:49

Man führt eine horizontale und eine vertikale FFT durch. Die Horizontale arbeitet auf 64 (oder was eben die Wavefront-Größe ist) Zeilen parallel; die vertikale auf entsprechend vielen Spalten.

Problematisch ist vor allem das Anordnen der Reads und Writes – ich hatte da Unterschiede von 50:1(!) je nach Speicher-Layout der Daten. Da gibt es aber heute bestimmt Papers zur optimalen Anordnung. Aber schneller als die CPU ist es in jedem Fall; ich habe anno 2010 auf einer 11.0-Karte 4096×4096×7×30 (Texturgröße × Farbkanäle × fps) Pixel pro Sekunde gefaltet. Die entsprechende CPU-Implementierung möchte ich mal sehen.

Erfahrungen von mir:

die FFT ist periodisch; damit ein heller Punkt am linken Bildrand keinen Blur am rechten Bildrand auslöst, muss man die Auflösung verdoppeln. Mathe-Freaks finden da vielleicht eine bessere Lösung.
Der Shared Memory limitiert arg; reicht in D3D gerade mal für eine 4096×4096-DCT aus. (Zusammen mit 1. bedeutet das, dass die Bildauflösung auf 2048 Pixel Seitenlänge limitiert ist; und tatsächlich hört meine Sternendemo bei größeren Auflösungen auf, zu funktionieren.)
Indem man eine FFT auf reellen Zahlen durchführt, kann man fast die Hälfte des Speichers sparen. Dann kriegen aber alle Texturen krumme Auflösungen (2049 Pixel!) und ich hatte keine Zeit, das weiter zu erforschen.

Nachtrag: Mir fällt wieder ein, warum ich das fallen gelassen habe – wegen dem scheiß HLSL-Compiler. Der war damals so buggy, dass man ihn kaum benutzen konnte. Mein Shader hat 45 Minuten kompiliert(!) …

Hier ist mein alter Quelltext, falls er jemanden interessiert. Cooley-Tukey-FFT mit radix-4-Butterfly, wenn ich das recht sehe. Eine Gruppe pro Zeile, Synchronisierung nach jedem Radix. Umsortieren der Befehle hat stark unterschiedlichen Code produziert, darum so obskure Folgen wie Radix-4-4-2-4-4-4 usw. Diese spezielle Datei behandelt Bilder von 2048² Pixeln. Ich wette, dass man das heute mindestens doppelt so schnell implementieren kann.

Code: Alles auswählen

//==============================================================================================================================
// convolution 2048.hlsl
// public domain by Krishty, 2008–2011
//==============================================================================================================================
//
// !ARCHITECTURE! marks architecture-dependent optimization opportunities. They should be enclosed in #ifdef #elif #else #endif
//	blocks, so the code base remains the same for all architectures. Check for the following #defines:
//		• AMD_RADEON_HD		AMD Radeon HD 5xxx or higher architecture (4xxx doesn't offer compute shaders)
//
// Some notes on the many discrete Fourier transformations being done here:
//	• All DFTs are realized as fast Fourier transformations with different radixes. Higher radixes require less arithmetic
//		instructions and less group synchronizations, but they also require more registers and read/write scattering and they
//		are more complex. Lower-radix FFTs perform better on current GPUs (2011) for the following reasons:
//			— Compilers fail on large shaders. This applies currently (early 2011) to Microsofts HLSL shader compiler as well as
//				to AMD's shader bytecode compiler. Both produce extremely poor code for shaders with hundreds of instructions:
//				20 % of all instructions are useless; ALU utilization is below 50 %; unneeded scratch registers both in local
//				and global memory are a huge problem.
//			— Read/write scattering is awfully slow and memory bandwidth is a bottleneck. Currently, it is unknown whether this
//				is by concept or a driver bug.
//			— Simple shaders in many threads perform faster than complex shaders in not-so-many threads, even if the total
//				amount of work is bigger (probably because the scheduler can hide latencies better).
//			— Although the number of available GPU registers is quite high (AMD provides at least 32 and at most 512 registers,
//				depending on the number of threads per group), using more than a few registers (6 on AMD hardware) reduces
//				performance drastically.
//		For the future, however, we expect higher-radix FFTs to perform better:
//			— Read/write scattering performance is constantly improving (GPGPU).
//			— GPU performance on complex shaders is constantly improving (more registers, better compilers due to GPGPU).
//	• We use the Stockham auto-sort algorithm. This is an out-of-place algorithm which reads from one source, performs an FFT
//		with an arbitrary radix and writes the result to another array. It has been proposed in "High Performance Discrete
//		Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
//		Manferdelli, Microsoft Corporation). Basic functionality is copied from the paper.
//	• Most papers propose to invert a FFT by negating the complex exponent. This would, however, break all higher-radix
//		butterflies (because their implementation depends on the twiddle factors being used and therefore on the sign of the
//		complex exponent). It is — far — easier to swap the real and imaginary components of the operands before and after
//		applying an ordinary FFT. This is fully compatible to all FFT butterflies, requires little code change and has nearly no
//		cost when implemented as register swizzling. The final division by the number of samples remains. Source:
//		"Algorithms for Programmers" (Jörg Arndt), chapter 1.7: "Inverse FFT for free".
//	• The group-shared memory in Direct3D 11 is limited to 32 KiB; this is
//			— 8192 'float' values
//			— 4096 complex ('float2') values
//	• For input and output of the FFT algorithm, there are two possibilies:
//		— Using two group-shared arrays ("bank 0" and "bank 1") in a ping-pong pattern. Because different FFTs access different
//			elements, a group thread synchronization is necessary before the banks are swapped. This doubles the space
//			requirement of the algorithm — FFTs are limited to 2048 samples.
//		— Using only one group-shared array and local registers as a cache. This consumes less group-shared memory (now FFTs on
//			4096 samples are possible) but requires twice as many group synchronizations.
//		Although the first possibility has proven to be slightly faster (5 %) on AMD GPUs, it also takes thrice as much time to
//		compile and performs 13.8 % worse on Nvidia GPUs.
//	• The red, green and blue components (where needed) are treated sequentially. This greatly simplifies the shaders, reduces
//		read/write scattering and allows high parallelization.
//		For horizontal FFTs, the input color channels should be stacked vertically and written out horizontally (this behaviour
//		is swapped on inverse FFTs). Rationale: the Fourier transformation of a zero signal is zero, again — therefore, all FFTs
//		on padding can be omitted. As a fortunate coincidence, out-of-bounds reads from DXGI resources return 0, too — the
//		padding can be omitted completely and the texture can be cropped down to the input's actual size. This saves a lot of
//		texture space and bandwidth (73 % on vertical passes with full HD — 1080 pixels are read and written instead of 4096)
//		and the need to clear the texture with zeroes (which limits the bandwidth, too).
//
//==============================================================================================================================



//==============================================================================================================================
// GENERIC FUNCTIONS
// This code handles mostly FFTs and is the same for all convolution implementations. It can be re-used by simply adjusting the
//	"IMAGE_SIDELENGTH" constant.
//==============================================================================================================================
// #define AMD_RADEON_HD
#define IMAGE_SIDELENGTH 2048 // group size attributes require literal constants
static const uint imageSidelength = IMAGE_SIDELENGTH;



// !ARCHITECTURE! Scratch memory for the current DFT. Never read or write it directly — use "loadGroupShared()" and
//	"writeGroupShared" to read and write, respectively. This allows quick adjustion of the group-shared memory layout for
//	different architectures, e.g. to avoid bank conflicts.
#if defined(AMD_RADEON_HD)
	//	• on AMD Radeon HD architecture, storing the values as 'float2's has proven to be 5.8 % faster than seperating real and
	//		imaginary components
	//	• on AMD Radeon HD architecture, the ping-pong pattern is >5 % faster than using a single group-shared array (but it
	//		requires more than twice the time to compile) — this is probably due to the compiler having problems transferring
	//		instructions over group thread barriers (a ping-pong pattern halves the number of required synchronizations)

	groupshared float2 groupSharedValues[2][imageSidelength];
	static uint currentGSMOutBanksIndex = 1; // first writing to bank 1 saves two cycles; I don't know why

	float2 loadGroupShared(
		const uint index
	) {
		return groupSharedValues[currentGSMOutBanksIndex ^ 1][index];
	}
	//..........................................................................................................................
	void writeGroupShared(
		const uint index,
		const float2 value
	) {
		groupSharedValues[currentGSMOutBanksIndex][index] = value;
		return;
	}
	//..........................................................................................................................
	void switchGSMBank() {
		currentGSMOutBanksIndex ^= 1;
		return;
	}

#else // not AMD Radeon HD:
	//	• on Nvidia GeForce GT architecture, a single group-shared array is 13.8 % faster than the ping-pong pattern — this is
	//		probably due to a group-shared memory size limitation there; halving the GSM consumption doubles the effective
	//		wavefront size

	groupshared float2 groupSharedValues[imageSidelength];

	float2 loadGroupShared(
		const uint index
	) {
		return groupSharedValues[index];
	}
	//..........................................................................................................................
	void writeGroupShared(
		const uint index,
		const float2 value
	) {
		groupSharedValues[index] = value;
		return;
	}

#endif // default hardware



// For better readability of function parameters.
static const bool invert = true;
static const bool dontInvert = false;
static const bool horizontally = false;
static const bool vertically = true;



// The input and output buffers. Their registers overlap because all shaders expect only one input and write to one output (with
//	the exception of the convolution itself, which reads from the image's DFT as well as from the kernel's DFT).

// Stores the glare with its three color channels vertically stacked. At the beginning of the glare operator, this is the
//	resolved (and possibly downsampled) scene.
RWTexture2DArray<float>		glareRW			: register(u0);
Texture2DArray<float>		glareRO			: register(t0);

// Stores the point spread function with its three color channels vertically stacked.
Texture2DArray<float>		PSFRO			: register(t0);

// Stores the DFT of the aperture's point spread function. The real and imaginary components are stored in the X and Y component
//	and its three color channels are vertically stacked.
RWTexture2DArray<float2>	kernelsDFTRW	: register(u0);
Texture2DArray<float2>		kernelsDFTRO	: register(t1);

// Stores the DFT of the scene in the same layout as the aperture PSF DFT's texture.
RWTexture2DArray<float2>	imagesDFTRW		: register(u0);
Texture2DArray<float2>		imagesDFTRO		: register(t0);



//------------------------------------------------------------------------------------------------------------------------------
// Swaps the real and imaginary components of the given complex number if the given boolean expression evaluates TRUE.
// Allows re-use of FFT functions for iFFT by just switching a boolean expression.
//------------------------------------------------------------------------------------------------------------------------------
float2 swapIf(
	const bool		doOrDont,
	const float2	value
) {
	return doOrDont ? value.yx : value.xy;
}



//------------------------------------------------------------------------------------------------------------------------------
// Multiplies the two given complex numbers.
//------------------------------------------------------------------------------------------------------------------------------
float2 multiplyComplex(
	const float2 a,
	const float2 b
) {
	// !ARCHITECTURE! There are several ways to express a complex multiplication:
	return float2(

#		if defined(AMD_RADEON_HD)
			// Saves one out of 100 instructions and — sometimes — one register. Since the main problem on AMD Radeon HD
			//	hardware is register pressure, this performs best.
			dot(float2(a.x, -a.y), float2(b.x, b.y)),
			dot(float2(a.x, a.y), float2(b.y, b.x))
#		else
			// Saves another instruction, but does not lower register pressure. Decide after profiling.
			mad(a.x, b.x, -a.y * b.y),
			mad(a.x, b.y, a.y * b.x)
			// The default procedure. Source: http://en.wikipedia.org/wiki/Complex_numbers#Multiplication_and_division.
			//	a.x * b.x - a.y * b.y,
			//	a.x * b.y + a.y * b.x
#		endif
	);
}



//------------------------------------------------------------------------------------------------------------------------------
// Computes the base position for an FFT's output.
//------------------------------------------------------------------------------------------------------------------------------
uint baseIndexFor(
	const uint radixsIndex,
	const uint stepSizeInSamples,
	const uint radix
) {
	// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
	//	Dotsenko, Burton Smith, and John Manferdelli; Microsoft Corporation), fig. 2.
	return (radixsIndex / stepSizeInSamples) * stepSizeInSamples * radix + (radixsIndex % stepSizeInSamples);
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation on group-shared memory.
//	• Radix 2 or 4.
//	• Can be used for an inverse FFT.
// This function can be executed in parallel — subsequent calls with the same step size but different radix indices will not
//	overlap. It reads and writes out-of-place — after the input has been written, the group-shared memory must be
//	synchronized. This is automatically done before this routine returns.
//------------------------------------------------------------------------------------------------------------------------------
void FFTOnGroupSharedMemory(
	const uint radix,
	const bool inverse,
	const uint radixsIndex,
	const uint stepSizeInSamples
) {
	// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
	//	Dotsenko, Burton Smith, and John Manferdelli, Microsoft Corporation), fig. 2.
	const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
	const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix);

#	if defined(AMD_RADEON_HD)
		// On AMD Radeon HD hardware, a ping-pong pattern is faster.

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
				//	to (1, 0) and the complex multiplication yields the identity.
				swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
					float2(cos(angle), sin(angle))
				)
			};

			// Use a radix-2 butterfly to transform the samples. Source: http://en.wikipedia.org/wiki/Butterfly_diagram.
			const float2 value0 = cache[0];
			const float2 value1 = cache[1];
			cache[0] = value0 + value1;
			cache[1] = value0 - value1;

			// Write the result out-of-order back.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();
			switchGSMBank();

		} else if(4 == radix) {
			static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
				//	to (1, 0) and the complex multiplication yields the identity.
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
					float2(cos(2.0f * angle), sin(2.0f * angle))
				),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
					float2(cos(3.0f * angle), sin(3.0f * angle))
				)
			};

			// Use a radix-4 butterfly to transform the samples. Source: unknown; found it somewhere on the internet.
			const float2 value0 = cache[0];
			const float2 value1 = cache[1];
			const float2 value2 = cache[2];
			const float2 value3 = cache[3];

			cache[0] = float2(
				value0.x + value1.x + value2.x + value3.x,
				value0.y + value1.y + value2.y + value3.y
			);
			cache[1] = float2(
				value0.x + value1.y - value2.x - value3.y,
				value0.y - value1.x - value2.y + value3.x
			);
			cache[2] = float2(
				value0.x - value1.x + value2.x - value3.x,
				value0.y - value1.y + value2.y - value3.y
			);
			cache[3] = float2(
				value0.x - value1.y - value2.x + value3.y,
				value0.y + value1.x - value2.y - value3.x
			);

			// Write the result out-of-order back.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
			writeGroupShared(mad(2, stepSizeInSamples, baseIndex), swapIf(inverse, cache[2]));
			writeGroupShared(mad(3, stepSizeInSamples, baseIndex), swapIf(inverse, cache[3]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();
			switchGSMBank();

		}

#	else // default hardware:
		// Cache all group-shared values in registers before performing the transformation.

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
				//	to (1, 0) and the complex multiplication yields the identity.
				swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
					float2(cos(angle), sin(angle))
				)
			};

			// Wait for all threads to complete their reading before the result is written back.
			GroupMemoryBarrierWithGroupSync();

			// Use a radix-2 butterfly to transform the samples. Source: http://en.wikipedia.org/wiki/Butterfly_diagram.
			const float2 value0 = cache[0];
			const float2 value1 = cache[1];
			cache[0] = value0 + value1;
			cache[1] = value0 - value1;

			// Write the result out-of-order back.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();

		} else if(4 == radix) {
			static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates
				//	to (1, 0) and the complex multiplication yields the identity.
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
					float2(cos(2.0f * angle), sin(2.0f * angle))
				),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
					float2(cos(3.0f * angle), sin(3.0f * angle))
				)
			};

			// Wait for all threads to complete their reading before the result is written back.
			GroupMemoryBarrierWithGroupSync();

			// Use a radix-4 butterfly to transform the samples. Source: unknown; found it somewhere on the internet.
			const float2 value0 = cache[0];
			const float2 value1 = cache[1];
			const float2 value2 = cache[2];
			const float2 value3 = cache[3];

			cache[0] = float2(
				value0.x + value1.x + value2.x + value3.x,
				value0.y + value1.y + value2.y + value3.y
			);
			cache[1] = float2(
				value0.x + value1.y - value2.x - value3.y,
				value0.y - value1.x - value2.y + value3.x
			);
			cache[2] = float2(
				value0.x - value1.x + value2.x - value3.x,
				value0.y - value1.y + value2.y - value3.y
			);
			cache[3] = float2(
				value0.x - value1.y - value2.x + value3.y,
				value0.y + value1.x - value2.y - value3.x
			);

			// Write the result out-of-order back.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), swapIf(inverse, cache[0]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), swapIf(inverse, cache[1]));
			writeGroupShared(mad(2, stepSizeInSamples, baseIndex), swapIf(inverse, cache[2]));
			writeGroupShared(mad(3, stepSizeInSamples, baseIndex), swapIf(inverse, cache[3]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();

		}

#	endif // default hardware

	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs two parallel FFTs on group-shared memory.
// This function is necessary because some powers of two can not be expressed by a single radix — e.g. 512 must be transformed
//	with radix 4-4-4-4-2. Since the shader cannot switch its thread group size while in execution, two lower-radix FFTs must be
//	performed. This would, however, enforce two additional group synchronizations (although both lower-radix FFTs do not
//	overlap). This function's purpose is to offer two parallel lower-radix FFTs without unnecessary synchronizations.
//	• Radix 2.
//	• Can be used for an inverse FFT.
//------------------------------------------------------------------------------------------------------------------------------
void TwoFFTsOnGroupSharedMemory(
	const uint radix,
	const bool inverse,
	const uint radixsSuperiorIndex,
	const uint stepSizeInSamples
) {
	const uint2 radixsIndex = uint2(
		mad(2, radixsSuperiorIndex, 0),
		mad(2, radixsSuperiorIndex, 1)
	);
	const float2 angle = float2(
		-6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex[0] % stepSizeInSamples),
		-6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex[1] % stepSizeInSamples)
	);

	const uint2 baseIndex = uint2(
		baseIndexFor(radixsIndex[0], stepSizeInSamples, radix),
		baseIndexFor(radixsIndex[1], stepSizeInSamples, radix)
	);

#	if defined(AMD_RADEON_HD)
		// On AMD Radeon HD hardware, a ping-pong pattern is faster.

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
			const float2 cacheA[radix] = {
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[0]))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[0]))),
					float2(cos(angle[0]), sin(angle[0]))
				)
			};
			const float2 cacheB[radix] = {
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[1]))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[1]))),
					float2(cos(angle[1]), sin(angle[1]))
				)
			};

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] + cacheA[1]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] - cacheA[1]));
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] + cacheB[1]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] - cacheB[1]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();
			switchGSMBank();

		}

#	else // default hardware:
		// Cache all group-shared values in registers before performing the transformation.

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// Load two samples into a local cache and multiply them with their twiddle factors.
			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
			const float2 cacheA[radix] = {
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[0]))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[0]))),
					float2(cos(angle[0]), sin(angle[0]))
				)
			};
			const float2 cacheB[radix] = {
					swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex[1]))),
				multiplyComplex(
					swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex[1]))),
					float2(cos(angle[1]), sin(angle[1]))
				)
			};

			// Wait for all threads to complete their reading before the result is written back.
			GroupMemoryBarrierWithGroupSync();

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] + cacheA[1]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex[0]), swapIf(inverse, cacheA[0] - cacheA[1]));
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] + cacheB[1]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex[1]), swapIf(inverse, cacheB[0] - cacheB[1]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();

		}

#	endif // default hardware

	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from a complex texture to group-shared memory.
//	• Radix 2 or 4.
//	• Can be used for an inverse FFT.
//	• The input is read as complex numbers from a channel of the given source texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTFromComplexTexture(
	const uint				radix,
	const bool				inverse,
	Texture2DArray<float2>	source,
	const uint				channelsIndex,
	const bool				vertical,
	const uint				signalsIndex,
	const uint				radixsIndex
) {
	const uint3 basePosition = vertical
		? uint3(signalsIndex, radixsIndex, channelsIndex)
		: uint3(radixsIndex, signalsIndex, channelsIndex);
	const uint3 delta = vertical
		? uint3(0, imageSidelength / radix, 0)
		: uint3(imageSidelength / radix, 0, 0);

	if(2 == radix) {
		static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
		//		cos(-2×Pi) + i × sin(-2×Pi)
		//	yields the identity.
		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		const float2 cache[radix] = {
			swapIf(inverse, source[mad(0, delta, basePosition)]),
			swapIf(inverse, source[mad(1, delta, basePosition)])
		};

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		writeGroupShared(mad(radix, radixsIndex, 0), swapIf(inverse, cache[0] + cache[1]));
		writeGroupShared(mad(radix, radixsIndex, 1), swapIf(inverse, cache[0] - cache[1]));

		// Wait for all threads to complete their writing.
		GroupMemoryBarrierWithGroupSync();

	} else if(4 == radix) {
		static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
		//		cos(-2×Pi) + i × sin(-2×Pi)
		//	yields the identity.
		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		const float2 cache[radix] = {
			swapIf(inverse, source[mad(0, delta, basePosition)]),
			swapIf(inverse, source[mad(1, delta, basePosition)]),
			swapIf(inverse, source[mad(2, delta, basePosition)]),
			swapIf(inverse, source[mad(3, delta, basePosition)])
		};

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		writeGroupShared(mad(radix, radixsIndex, 0), swapIf(inverse, float2(
			cache[0].x + cache[1].x + cache[2].x + cache[3].x,
			cache[0].y + cache[1].y + cache[2].y + cache[3].y
		)));
		writeGroupShared(mad(radix, radixsIndex, 1), swapIf(inverse, float2(
			cache[0].x + cache[1].y - cache[2].x - cache[3].y,
			cache[0].y - cache[1].x - cache[2].y + cache[3].x
		)));
		writeGroupShared(mad(radix, radixsIndex, 2), swapIf(inverse, float2(
			cache[0].x - cache[1].x + cache[2].x - cache[3].x,
			cache[0].y - cache[1].y + cache[2].y - cache[3].y
		)));
		writeGroupShared(mad(radix, radixsIndex, 3), swapIf(inverse, float2(
			cache[0].x - cache[1].y - cache[2].x + cache[3].y,
			cache[0].y + cache[1].x - cache[2].y - cache[3].x
		)));

		// Wait for all threads to complete their writing.
		GroupMemoryBarrierWithGroupSync();

	}

#	if defined(AMD_RADEON_HD)
		switchGSMBank(); // on AMD Radeon HD hardware, a ping-pong pattern is faster
#	endif
	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from a packed real texture to group-shared memory.
// Used to perform DFTs on temporary results which had previously been stored in textures.
//	• Radix 2 or 4.
//	• The input is read as real numbers from a channel of the given texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTFromRealTexture(
	const uint				radix,
	Texture2DArray<float>	source,
	const uint				channelsIndex,
	const bool				vertical,
	const uint				signalsIndex,
	const uint				radixsIndex
) {
	const uint3 basePosition = vertical
		? uint3(signalsIndex, radixsIndex, channelsIndex)
		: uint3(radixsIndex, signalsIndex, channelsIndex);
	const uint3 delta = vertical
		? uint3(0, imageSidelength / radix, 0)
		: uint3(imageSidelength / radix, 0, 0);

	if(2 == radix) {
		static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
		//		cos(-2×Pi) + i × sin(-2×Pi)
		//	yields the identity.
		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		const float cache[radix] = {
			source[mad(0, delta, basePosition)],
			source[mad(1, delta, basePosition)]
		};

		// Save many arithmetic instructions through complex multiplication with pure real numbers.
		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		writeGroupShared(mad(radix, radixsIndex, 0), float2(cache[0] + cache[1], 0.0f));
		writeGroupShared(mad(radix, radixsIndex, 1), float2(cache[0] - cache[1], 0.0f));

		// Wait for all threads to complete their writing.
		GroupMemoryBarrierWithGroupSync();

	} else if(4 == radix) {
		static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// With a step size of 1, the twiddle factors can be omitted completely — a multiplication with the complex number
		//		cos(-2×Pi) + i × sin(-2×Pi)
		//	yields the identity.
		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		const float cache[radix] = {
			source[mad(0, delta, basePosition)],
			source[mad(1, delta, basePosition)],
			source[mad(2, delta, basePosition)],
			source[mad(3, delta, basePosition)]
		};

		// Save many arithmetic instructions through complex multiplication with pure real numbers.
		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		writeGroupShared(mad(radix, radixsIndex, 0), float2(cache[0] + cache[1] + cache[2] + cache[3], 0.0f));
		writeGroupShared(mad(radix, radixsIndex, 1), float2(cache[0] - cache[2], -cache[1] + cache[3]));
		writeGroupShared(mad(radix, radixsIndex, 2), float2(cache[0] - cache[1] + cache[2] - cache[3], 0.0f));
		writeGroupShared(mad(radix, radixsIndex, 3), float2(cache[0] - cache[2], cache[1] - cache[3]));

		// Wait for all threads to complete their writing.
		GroupMemoryBarrierWithGroupSync();

	}

#	if defined(AMD_RADEON_HD)
		switchGSMBank(); // on AMD Radeon HD hardware, a ping-pong pattern is faster
#	endif
	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation from group-shared memory to a complex texture.
// Used to store temporal results like horizontal DFTs.
//	• Radix 2 or 4.
//	• The input is read from the "inBank" bank of group-shared memory.
//	• The output is written as complex numbers to a channel of the given texture (either horizontally or vertically).
//------------------------------------------------------------------------------------------------------------------------------
void FFTToComplexTexture(
	const uint					radix,
	const bool					inverse,
	RWTexture2DArray<float2>	destination,
	const uint					channelsIndex,
	const bool					vertical,
	const uint					signalsIndex,
	const uint					radixsIndex,
	const uint					stepSizeInSamples
) {
	const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
	const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
	const uint3 basePosition = vertical
		? uint3(signalsIndex, baseIndex, channelsIndex)
		: uint3(baseIndex, signalsIndex, channelsIndex);
	const uint3 delta = vertical
		? uint3(0, stepSizeInSamples, 0)
		: uint3(stepSizeInSamples, 0, 0);

	if(2 == radix) {
		static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		const float2 cache[radix] = {
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
				swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
			multiplyComplex(
				swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
				float2(cos(1.0f * angle), sin(1.0f * angle))
			)
		};

		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		destination[mad(0, delta, basePosition)] = swapIf(inverse, cache[0] + cache[1]);
		destination[mad(1, delta, basePosition)] = swapIf(inverse, cache[0] - cache[1]);

	} else if(4 == radix) {
		static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		const float2 cache[radix] = {
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
				swapIf(inverse, loadGroupShared(mad(0, imageSidelength / radix, radixsIndex))),
			multiplyComplex(
				swapIf(inverse, loadGroupShared(mad(1, imageSidelength / radix, radixsIndex))),
				float2(cos(1.0f * angle), sin(1.0f * angle))
			),
			multiplyComplex(
				swapIf(inverse, loadGroupShared(mad(2, imageSidelength / radix, radixsIndex))),
				float2(cos(2.0f * angle), sin(2.0f * angle))
			),
			multiplyComplex(
				swapIf(inverse, loadGroupShared(mad(3, imageSidelength / radix, radixsIndex))),
				float2(cos(3.0f * angle), sin(3.0f * angle))
			)
		};

		// (Unroll manually — loop expressions with global memory access are not optimized well.)
		destination[mad(0, delta, basePosition)] = swapIf(inverse, float2(
														cache[0].x + cache[1].x + cache[2].x + cache[3].x,
														cache[0].y + cache[1].y + cache[2].y + cache[3].y
													));
		destination[mad(1, delta, basePosition)] = swapIf(inverse, float2(
														cache[0].x + cache[1].y - cache[2].x - cache[3].y,
														cache[0].y - cache[1].x - cache[2].y + cache[3].x
													));
		destination[mad(2, delta, basePosition)] = swapIf(inverse, float2(
														cache[0].x - cache[1].x + cache[2].x - cache[3].x,
														cache[0].y - cache[1].y + cache[2].y - cache[3].y
													));
		destination[mad(3, delta, basePosition)] = swapIf(inverse, float2(
														cache[0].x - cache[1].y - cache[2].x + cache[3].y,
														cache[0].y + cache[1].x - cache[2].y - cache[3].x
													));

	}

	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs a fast Fourier transformation on group-shared memory and multiplies the result with complex numbers read from the
//	PSF DFT's texture (vertically).
// Used to combine the scene's DFT with the PSF's DFT before transforming back.
//	• Radix 2 or 4.
//	• The input is read from group-shared memory.
//	• The multiplicands are read vertically from the PSF DFT's texture according to the radix's index and the index in the
//		signal.
//------------------------------------------------------------------------------------------------------------------------------
void FFTOnGroupSharedMemoryWithVerticalPSFMultiplication(
	const uint			radix,
	const uint			channelsIndex,
	const uint			signalsIndex,
	const uint			radixsIndex,
	const uint			stepSizeInSamples
) {
	const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
	const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
	const uint3 basePosition = uint3(signalsIndex, baseIndex, channelsIndex);
	const uint3 delta = uint3(0, stepSizeInSamples, 0);

#	if defined(AMD_RADEON_HD)
		// On AMD Radeon HD hardware, a ping-pong pattern is faster.

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
				//	(1, 0) and the complex multiplication yields the identity.
					loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
				multiplyComplex(
					loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				)
			};

			// Multiply with the PSF's DFT and write to the scene's DFT texture.
			// (Unroll manually — loop expressions with global memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(
															cache[0] + cache[1],
															kernelsDFTRO[mad(0, delta, basePosition)]
														));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(
															cache[0] - cache[1],
															kernelsDFTRO[mad(1, delta, basePosition)]
														));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();
			switchGSMBank();

		} else if(4 == radix) {
			static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			const float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
				//	(1, 0) and the complex multiplication yields the identity.
					loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
				multiplyComplex(
					loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				),
				multiplyComplex(
					loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)),
					float2(cos(2.0f * angle), sin(2.0f * angle))
				),
				multiplyComplex(
					loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)),
					float2(cos(3.0f * angle), sin(3.0f * angle))
				)
			};

			// Multiply with the PSF's DFT and write to the scene's DFT texture.
			// (Unroll manually — loop expressions with global memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x + cache[1].x + cache[2].x + cache[3].x,
															cache[0].y + cache[1].y + cache[2].y + cache[3].y
														), kernelsDFTRO[mad(0, delta, basePosition)]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x + cache[1].y - cache[2].x - cache[3].y,
															cache[0].y - cache[1].x - cache[2].y + cache[3].x
														), kernelsDFTRO[mad(1, delta, basePosition)]));
			writeGroupShared(mad(2, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x - cache[1].x + cache[2].x - cache[3].x,
															cache[0].y - cache[1].y + cache[2].y - cache[3].y
														), kernelsDFTRO[mad(2, delta, basePosition)]));
			writeGroupShared(mad(3, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x - cache[1].y - cache[2].x + cache[3].y,
															cache[0].y + cache[1].x - cache[2].y - cache[3].x
														), kernelsDFTRO[mad(3, delta, basePosition)]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();
			switchGSMBank();

		}

#	else // default hardware:
		// Cache all values in registers before performing the transformation

		if(2 == radix) {
			static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
				//	(1, 0) and the complex multiplication yields the identity.
					loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
				multiplyComplex(
					loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				)
			};

			// Wait for all threads to complete their reading.
			GroupMemoryBarrierWithGroupSync();

			// Multiply with the PSF's DFT and write to the scene's DFT texture.
			// (Unroll manually — loop expressions with global memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(
															cache[0] + cache[1],
															kernelsDFTRO[mad(0, delta, basePosition)]
														));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(
															cache[0] - cache[1],
															kernelsDFTRO[mad(1, delta, basePosition)]
														));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();

		} else if(4 == radix) {
			static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

			// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
			const float2 cache[radix] = {
				// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
				//	(1, 0) and the complex multiplication yields the identity.
					loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)),
				multiplyComplex(
					loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)),
					float2(cos(1.0f * angle), sin(1.0f * angle))
				),
				multiplyComplex(
					loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)),
					float2(cos(2.0f * angle), sin(2.0f * angle))
				),
				multiplyComplex(
					loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)),
					float2(cos(3.0f * angle), sin(3.0f * angle))
				)
			};

			// Wait for all threads to complete their reading.
			GroupMemoryBarrierWithGroupSync();

			// Multiply with the PSF's DFT and write to the scene's DFT texture.
			// (Unroll manually — loop expressions with global memory access are not optimized well.)
			writeGroupShared(mad(0, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x + cache[1].x + cache[2].x + cache[3].x,
															cache[0].y + cache[1].y + cache[2].y + cache[3].y
														), kernelsDFTRO[mad(0, delta, basePosition)]));
			writeGroupShared(mad(1, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x + cache[1].y - cache[2].x - cache[3].y,
															cache[0].y - cache[1].x - cache[2].y + cache[3].x
														), kernelsDFTRO[mad(1, delta, basePosition)]));
			writeGroupShared(mad(2, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x - cache[1].x + cache[2].x - cache[3].x,
															cache[0].y - cache[1].y + cache[2].y - cache[3].y
														), kernelsDFTRO[mad(2, delta, basePosition)]));
			writeGroupShared(mad(3, stepSizeInSamples, baseIndex), multiplyComplex(float2(
															cache[0].x - cache[1].y - cache[2].x + cache[3].y,
															cache[0].y + cache[1].x - cache[2].y - cache[3].x
														), kernelsDFTRO[mad(3, delta, basePosition)]));

			// Wait for all threads to complete their writing.
			GroupMemoryBarrierWithGroupSync();

		}
#	endif // default hardware

	return;
}



//------------------------------------------------------------------------------------------------------------------------------
// Performs an inverse fast Fourier transformation from values in group-shared memory and writes the result as packed reals into
//	a texture.
// Used to extract the real values from the inverse DFT and to finalize the inverse transformation.
//	• Radix 2 or 4.
//	• The input is read from the group-shared memory.
//	• The result is written as a packed real number (using the real value of the complex result) to the given texture.
//------------------------------------------------------------------------------------------------------------------------------
void iFFTToRealTexture(
	const uint				radix,
	RWTexture2DArray<float>	destination,
	const uint				channelsIndex,
	const bool				vertical,
	const uint				signalsIndex,
	const uint				radixsIndex,
	const uint				stepSizeInSamples
) {
	// Source: "High Performance Discrete Fourier Transforms on Graphics Processors" (Naga K. Govindaraju, Brandon Lloyd, Yuri
	//	Dotsenko, Burton Smith, and John Manferdelli, Microsoft Corporation), fig. 2.
	const float angle = -6.2831853071795865f / float(stepSizeInSamples * radix) * float(radixsIndex % stepSizeInSamples);
	const uint baseIndex = baseIndexFor(radixsIndex, stepSizeInSamples, radix); // will be optimized away
	const uint3 basePosition = vertical
		? uint3(signalsIndex, baseIndex, channelsIndex)
		: uint3(baseIndex, signalsIndex, channelsIndex);
	const uint3 delta = vertical
		? uint3(0, stepSizeInSamples, 0)
		: uint3(stepSizeInSamples, 0, 0);

	if(2 == radix) {
		static const uint radix = 2; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		float2 cache[radix] = {
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
				loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
			multiplyComplex(
				loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
				float2(cos(angle), sin(angle))
			)
		};

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		destination[mad(0, delta, basePosition)] = (cache[0].y + cache[1].y) / imageSidelength;
		destination[mad(1, delta, basePosition)] = (cache[0].y - cache[1].y) / imageSidelength;

	} else if(4 == radix) {
		static const uint radix = 4; // redeclare 'static' — HLSL doesn't accept parameters as array dimensions

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		float2 cache[radix] = {
			// The first complex multiplication can be omitted: The angle is always zero; the complex exponent evaluates to
			//	(1, 0) and the complex multiplication yields the identity.
				loadGroupShared(mad(0, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
			multiplyComplex(
				loadGroupShared(mad(1, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
				float2(cos(1.0f * angle), sin(1.0f * angle))
			),
			multiplyComplex(
				loadGroupShared(mad(2, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
				float2(cos(2.0f * angle), sin(2.0f * angle))
			),
			multiplyComplex(
				loadGroupShared(mad(3, imageSidelength / radix, radixsIndex)).yx, // swap because inverse
				float2(cos(3.0f * angle), sin(3.0f * angle))
			)
		};

		// (Unroll manually — loop expressions with group-shared memory access are not optimized well.)
		destination[mad(0, delta, basePosition)] = (cache[0].y + cache[1].y + cache[2].y + cache[3].y) / (imageSidelength * imageSidelength);
		destination[mad(1, delta, basePosition)] = (cache[0].y - cache[1].x - cache[2].y + cache[3].x) / (imageSidelength * imageSidelength);
		destination[mad(2, delta, basePosition)] = (cache[0].y - cache[1].y + cache[2].y - cache[3].y) / (imageSidelength * imageSidelength);
		destination[mad(3, delta, basePosition)] = (cache[0].y + cache[1].x - cache[2].y - cache[3].x) / (imageSidelength * imageSidelength);

	}

}











[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void kernelHorizontal(
	const uint3 indexInSignal : SV_DispatchThreadID
) {
	const uint currentChannelsIndex = indexInSignal.z;

	FFTFromRealTexture			(4, PSFRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x,   4);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x,  16);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x,  64);
	TwoFFTsOnGroupSharedMemory	(2, dontInvert, indexInSignal.x, 256);
	FFTToComplexTexture			(4, dontInvert, kernelsDFTRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);

	return;
}

[numthreads(1, IMAGE_SIDELENGTH / 4, 1)] // 4 pixels per thread — best for radix 4
void kernelVertical(
	const uint3 indexInSignal : SV_DispatchThreadID
) {
	const uint currentChannelsIndex = indexInSignal.z;

	FFTFromComplexTexture		(4, dontInvert, kernelsDFTRO, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,   4);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,  16);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,  64);
	TwoFFTsOnGroupSharedMemory	(2, dontInvert, indexInSignal.y, 256);
	FFTToComplexTexture			(4, dontInvert, kernelsDFTRW, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y, 512);

	return;
}

[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void imageHorizontal(
	const uint3 indexInSignal : SV_DispatchThreadID
) {
	// !ARCHITECTURE!
	//	• AMD Radeon HD: Performing the FFT with radix 4-2-4-4-4-4 saves one register.
	const uint currentChannelsIndex = indexInSignal.z;

	FFTFromRealTexture			(4, glareRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
	TwoFFTsOnGroupSharedMemory	(2, dontInvert, indexInSignal.x,   4);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x,   8);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x,  32);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.x, 128);
	FFTToComplexTexture			(4, dontInvert, imagesDFTRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);

	return;
}

[numthreads(1, IMAGE_SIDELENGTH / 4, 1)] // 4 pixels per thread — best for radix 4
void imageVertical(
	const uint3 indexInSignal : SV_DispatchThreadID
) {
	// !ARCHITECTURE!
	//	• AMD Radeon HD: Performing the FFT with radix 4-4-4-4-2-4 and the iFFT with radix 2-4-4-4-4-4 saves ten registers and
	//		37 scratch registers.
	const uint currentChannelsIndex = indexInSignal.z;

	FFTFromComplexTexture		(4, dontInvert, imagesDFTRO, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,   4);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,  16);
	FFTOnGroupSharedMemory		(4, dontInvert, indexInSignal.y,  64);
	TwoFFTsOnGroupSharedMemory	(2, dontInvert, indexInSignal.y, 256);
	FFTOnGroupSharedMemoryWithVerticalPSFMultiplication(4, currentChannelsIndex, indexInSignal.x, indexInSignal.y, 512);

	TwoFFTsOnGroupSharedMemory	(2, invert, indexInSignal.y,   1);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.y,   2);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.y,   8);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.y,  32);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.y, 128);
	FFTToComplexTexture			(4, invert, imagesDFTRW, currentChannelsIndex, vertically, indexInSignal.x, indexInSignal.y, 512);

	return;
}

[numthreads(IMAGE_SIDELENGTH / 4, 1, 1)] // 4 pixels per thread — best for radix 4
void imageInverseHorizontal(
	const uint3 indexInSignal : SV_DispatchThreadID
) {
	// !ARCHITECTURE!
	//	• AMD Radeon HD: Performing the FFT with radix 4-2-4-4-4-4 saves one register.
	const uint currentChannelsIndex = indexInSignal.z;

	FFTFromComplexTexture		(4, invert, imagesDFTRO, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x);
	TwoFFTsOnGroupSharedMemory	(2, invert, indexInSignal.x,   4);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.x,   8);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.x,  32);
	FFTOnGroupSharedMemory		(4, invert, indexInSignal.x, 128);
	iFFTToRealTexture			(4, glareRW, currentChannelsIndex, horizontally, indexInSignal.y, indexInSignal.x, 512);

	return;
}

Zudomon · Beitrag von **Zudomon** » 21.01.2017, 16:22

Krishty hat geschrieben:Der war damals so buggy, dass man ihn kaum benutzen konnte. Mein Shader hat 45 Minuten kompiliert(!) …

Ja das kenne ich auch nur zu gut. Vor allem wenn man anfängt zwei verschachtelte Schleifen zu bauen. Weil ich gerne schnell Feedback habe, lasse ich daher alle meine Shader ohen Optimierungen kompilieren. Das ist wirklich wesentlich schneller. Die Instruktionsgröße ist dadurch durchschnittlich verdoppelt aber zumindest was meine Erfahrungen angeht, läuft es dadurch nicht wirklich langsamer. Also fürs rumtesten ist das vielleicht einen Versuch wert.

Bisher würde mich das FFT Zeugs am ehesten noch für das Glare reizen. Aber dann denke ich mir, ich muss mir auch noch Dinge für später auf bewahren... Version 2.0 :D
Gibt es da eigentlich gute Quellen um das ganze irgendwie zu verstehen? Ich brauche immer lange, bis ich solche Abstrakten Dinge begreife. Klar könnte man sich durch fertigen Quellcode wühlen, aber mir wäre es wichtiger, wenn ich das ganze erstmal prinzipiell begreifen könnte.

Jonathan · Beitrag von **Jonathan** » 24.01.2017, 09:46

Zudomon hat geschrieben: Gibt es da eigentlich gute Quellen um das ganze irgendwie zu verstehen? Ich brauche immer lange, bis ich solche Abstrakten Dinge begreife. Klar könnte man sich durch fertigen Quellcode wühlen, aber mir wäre es wichtiger, wenn ich das ganze erstmal prinzipiell begreifen könnte.

Fourier Transformationen sind nicht unbedingt leicht, deshalb wird es immer eine Weile dauern, die zu verstehen. Aber wenn du ins Thema einsteigen willst, habe ich einen super Tipp für dich:

https://see.stanford.edu/materials/lsof ... all-07.pdf

Die allermeisten anderen Quellen die ich gefunden habe sind ziemlich oberflächlich und irgendwie liest man dann die selben Grundlagen immer und immer wieder anstatt etwas neues zu lernen. Mit dem Buch bin ich noch nicht durch, aber es ist sehr locker und witzig geschrieben, bemüht sich darum verständlich zu sein und ist wirklich umfassend - ich glaube wenn man das durchgearbeitet und verstanden hat, hat man eine gute Grundlage.

Alexander Kornrumpf · Beitrag von **Alexander Kornrumpf** » 24.01.2017, 10:50

http://www.springer.com/de/book/9783642049514

Zudomon · Beitrag von **Zudomon** » 24.01.2017, 15:36

Danke euch beiden für die Buchempfehlungen. Aber leider kenne ich mich. Ich lese da dann die ersten 30 Seiten akribisch durch, weil ich denke, wenn dann will ich es richtig machen. Und dann steige ich aus und fasse das nie mehr an. Bin nicht so die Leseratte. Das muss anders in meinen Kopf. Am besten per Animation oder App zum rumprobieren :D

Zudomon · Beitrag von **Zudomon** » 26.01.2017, 08:17

Ich glaube, meine Vegetation war doch etwas zu grün. Deswegen habe ich jetzt einfach mal ne Colorcorrection darüber gehauen.
FPS und HUD sind mit Absicht auf vielen Bildern aus... irgendwie stört das die Immersion extrem. FPS mäßig liege ich bei den Bildern bei 60 - 80

Noch ein bisschen was fürs Auge:

Krishty · Beitrag von **Krishty** » 26.01.2017, 09:34

Es kommt auch darauf an, welche Jahreszeit du darstellen willst. April/Mai können durchaus so grün sein wie vorher. Für Juni/Juli gefällt mir die neue Palette aber sehr gut!

joeydee · Beitrag von **joeydee** » 26.01.2017, 10:44

Die nahen Details sehen schon längst super aus - wird Zeit sich um die Ferne zu kümmern ;)
Die Felsen im Hintergrund sehen noch wie abgeschliffen und angemalt aus und passen nicht mehr so richtig in die Landschaft. Die dunkleren schattigen Strukturen des Vordergrunds fehlen einfach. Vielleicht auch eine dunkle Linie an der Grasnabe, dort wo man direkt draufblickt, damit diese als Schicht wirkt. Ich weiß wie kostenintensiv Details in der Ferne sind, aber vielleicht fällt dir noch eine günstige Lösung ein um die Optik mehr anzupassen.

Ansonsten Top, immer wieder schön anzuschauen :)

Zudomon · Beitrag von **Zudomon** » 27.01.2017, 05:39

Ich spiele gerade noch ein bisschen mit der Colorcorrection rum. Momentan ist es so, dass ich den Grün Kanal ein wenig auf die anderen übertrage. Nun ist das aber etwas radikal weil es aufs ganze Bild angewendet wird. Dabei ist mir ja eigentlich nur das Grün von der Vegetation zu krass. Die grauen Felsen sollten eigentlich grau bleiben und nicht ins rote kippen.
Die Frage ist also, wäre es nicht besser, statt nachträglich eine Korrektur der Farben global anzuwenden, lieber schon im Vorfeld die Texturen anzupassen?

joggel · Beitrag von **joggel** » 27.01.2017, 07:48

Ich finde das immer unheimlich interessant anzusehen was hier wird, und deine Landschaft vermittelt echt ne sehr schöne Stimmung....wann kommt eigentlich mal was spielbares?^^

Zudomon · Beitrag von **Zudomon** » 27.01.2017, 08:06

joggel hat geschrieben:Ich finde das immer unheimlich interessant anzusehen was hier wird, und deine Landschaft vermittelt echt ne sehr schöne Stimmung....wann kommt eigentlich mal was spielbares?^^

Danke! Ja eigentlich wollte ich ja mehr Gameplay machen. Aber mir würde es schon gefallen, wenn das auch besser auf dem Laptop laufen würde. Deswegen bin ich eigentlich immer noch mit der Optimiererei zugange. Das sich grafisch dann auch ab und zu was tut, ist eher so ne Nebengeschichte. Aber die kann man auch wesentlich besser zeigen.
Oder meintest du überhaupt was zum selbst rum testen?
StoneQuest ist eigentlich immer zugänglich. Aber so dass mal jemand da ist, ist doch recht selten. Ich denke, das liegt wohl am ehesten noch daran, das es schlecht läuft. Ich muss auch immer noch das aktualisieren der Texturen usw. parallelisieren, damit diese Microlags beim drehen und laufen aufhören. Tja. Irgendwie schiebt man das schwere gerne vor sich her.
Instanzing hatte ich jetzt auch eingebaut für die Objekte... damit auf HIGH die DrawCalls von etwa 2000 auf 800 reduziert... aber dadurch ist es wenn überhaupt eher etwas langsamer geworden. Auch da kann ich nach einigen Tests auch eher wieder raten, woran das liegt.

Was Objekte in der Ferne angeht, muss ich leider auch noch drauf verzichten. Das macht das ganze nämlich ziemlich langsam. Da ich für die Objekte noch kein wirklich gutes LOD habe, haben die "low"-Poly Bäume immer noch ein paar Tausend Dreiecke und wenn davon dann ein paar hundert gerendert werden, ist das auch nicht so pralle.
Auf ULTRA werden die Flächen aber auf größere Distanz noch mit Detail versorgt, was die mittlere Entfernung eigentlich doch noch sehr zugute kommt. Auf kosten der FPS versteht sich... hier ein Vergleich:

HIGH - 73 FPS

ULTRA - 39 FPS

joggel · Beitrag von **joggel** » 27.01.2017, 08:13

Oder meintest du überhaupt was zum selbst rum testen?

Na ich meinte, dass man spielen kann.
Im moment ist es ja so, das man "nur" die Welt erkunden kann...aber für ein Spiel ist das ja bissl zu wenig.
Oder soll es mal so wie bei Minecraft werden?

Zudomon · Beitrag von **Zudomon** » 27.01.2017, 08:57

Eigentlich war ja mein Hauptanliegen damals ein Minecraft mit realistischer Grafik. Momentan suchte ich aber Stardew Valley und würde da gerne Elemente übernehmen. Ich weiß nur nicht so recht wie sich Open World MMO damit vereinen lässt.

joggel · Beitrag von **joggel** » 27.01.2017, 10:18

Auf jeden Fall hat dein Stonequest echt potential um auch zu suchten ;)

Zudomon · Beitrag von **Zudomon** » 27.01.2017, 12:00

joggel hat geschrieben:Auf jeden Fall hat dein Stonequest echt potential um auch zu suchten ;)

Nur ein unscheinbarer Satz, der aber extrem motivierend ist. Um ehrlich zu sein habe ich ja ein leicht ungutes Gefühl, dass ich nach etwa 5,5 Jahren immer noch nicht mehr vorweisen kann. Wenn du Tipps hast, was du besonders "suchtend" findest würdest oder überhaupt... ich meine, ja, Minecraft. Aber im Gegensatz zu vielen anderen hat mich das Spiel eigentlich nur solange gefesselt, bis man eine halbwegs brauchbare Unterkunft hatte. Wirklich auf dauer habe ich dann die Lust verloren. Und irgendwie habe ich entsprechend noch gar keine wirkliche Vorstellung, wie mein Spiel sein müsste. Wie gesagt, Stardew Valley, hab ich mir jetzt vor ein paar Wochen geholt und muss sagen, das fesselt mich sehr. Auch wenn es ein abgespeckten 2D Minecraft ist, so ist die Grafik doch wesentlich schöner, weil es künstlerisch wirkt. Die Jahreszeiten da und die unzähligen Spielelemente begeistern mich. So viele Dinge, die da auch unerwartet passieren und immer hat man irgendeine Beschäftigung und kommt nicht mehr von los. Naja, zumindest wird mich das fesseln, bis ich das Gefühl habe, alles da erreicht zu haben. Nur geschickterweise dauert das ganze natürlich ewig... und wenn man das eine Geschafft hat, sind gleich wieder neue Sachen da, die einen dann auch nochmal reizen.
Und das es einer ganz alleine gemacht hat, ist mir natürlich auch sehr sympathisch. Also mich hat seine Leistung echt geflashed, wobei es mich etwas enttäuscht hat, dass auch er vieles an Mechaniken und Dinge, wo ich mir dachte, krass ist das klever gelöst, dann doch auch von anderen Spielen abgekupfert hat. Aber was wirklich neues schaffen ist wohl eh nur eine Wunschvorstellung, die man vielleicht nur Ansatzweise erfüllen kann. Die Kombination, alles in allem ist dann vielleicht was Neues.

Heute Nacht hatte ich den Tipp bekommen, ich sollte das Spiel extrem aufwendig machen... also kleines Inventar, so wie in Wirklichkeit... das man dann auch zum Abtransport wirklich Loren und alles erstmal ranschaffen muss. Und ihm würde es dann gefallen, wenn man eigentlich Wochenlang damit zu tun hat, alles besser und automatisierter zu machen und zu optimieren. Finde ich eigentlich auch nicht schlecht. Umso mehr man für Dinge, die man erschafft, arbeiten muss, umso wertvoller erscheint einem das auch. Bis dann der böse Zudo kommt und die Welt resetten muss :(

Alexander Kornrumpf · Beitrag von **Alexander Kornrumpf** » 27.01.2017, 12:36

Zudomon hat geschrieben: Heute Nacht hatte ich den Tipp bekommen, ich sollte das Spiel extrem aufwendig machen... also kleines Inventar, so wie in Wirklichkeit... das man dann auch zum Abtransport wirklich Loren und alles erstmal ranschaffen muss. Und ihm würde es dann gefallen, wenn man eigentlich Wochenlang damit zu tun hat, alles besser und automatisierter zu machen und zu optimieren. Finde ich eigentlich auch nicht schlecht. Umso mehr man für Dinge, die man erschafft, arbeiten muss, umso wertvoller erscheint einem das auch. Bis dann der böse Zudo kommt und die Welt resetten muss :(

Hat ja für No Man's Sky auch gut funktioniert :)

joggel · Beitrag von **joggel** » 27.01.2017, 13:03

Was ich „suchtend“ finden würde?
Mh…schwer zu sagen.
Ich finde die Atmosphäre in deinem Spiel sehr schön; richtig zum eintauchen. Ich kann mir da gut vorstellen, eine Fantasiewelt zu entdecken oder so.
Auch bin ich ein Freund von Strategie-/Wirtschaftsspielen.
Vlt kann man das irgendwie verbinden/vereinen. Vlt so etwas, dass man ein Dorf oder so hat, und man muss das Dorf wachsen lassen, indem man Rohstoffe besorgt, die dann von unterschiedlichen Dorfbewohnern verarbeitet werden….oder so.
Krieg führen wäre auch cool, aber wenn ich mir eine Armee in deiner Welt vorstelle, dann glaube ich eher, dass das die Rechenleistung überschreitet...

scheichs · Beitrag von **scheichs** » 27.01.2017, 18:05

Auf der Wii U fanden viele "Cube Life:Island Survival" ziemlich suchtend und zwar den Survival Modus, der -zwar ähnlich wie Minecraft- jedoch Storydriven ist. Heisst: Die Entwickler haben sich einen festen Seed ausgesucht und dann darin eine Welt gebaut und mit Gamelogik gefüllt. Das fanden viele Spieler cool. Das Game an sich ist eigentlich ziemlich scheisse gemacht, aber das hat die Leute nicht abgehalten.

Meine Empfehlung also:
1) Storymodus (im Endeffekt ist das bei Stardew Valley glaube ich auch so)
2) Creative-Modus

Ich werde das ähnlich machen, werde mir ein altes RPG nehmen und das nachbauen.

Oprah Noodlemantra · Beitrag von **Oprah Noodlemantra** » 27.01.2017, 21:03

Du könntest auch ein Kreuzfahrtschiff untergehen lassen und die Spieler sind die, die sich an Land gerettet haben. Neue Spieler können später noch per Rettungsboot angespült werden. Die Überlebenden sind dabei natürlich weniger auf Burgen und Diamantminen aus, sondern eher auf zweckmäßige Bauten und sie versuchen auch nicht Bäume mit bloßen Händen zu spalten. Werkzeug ist selten und kostbar, muss ja erstmal angespült werden. Die Spieler konkurrieren und vielleicht gibt es auch Nachbarinseln, auf denen sich ebenfalls Überlebende durchschlagen wollen. Die aber den Bewohnern der anderen Insel ihre schicke Insel ohne Krokodile neiden.

Nach einiger Zeit kommt dann Rettung an. Damit endet das Spiel eigentlich. Unglücklicherweise bringen die Retter schlechte Neuigkeiten und weils gar nicht mehr schlimmer werden kann, sind die Retter auch gar keine Retter, sondern die letzten Überlebenden eines nuklear geführten Kriegs.

Es vergeht viel Zeit des Nachdenkens.

Dann kommen doch nochmal Leute und erklären, dass das alles nur ein missglücktes Reality-TV Experiment war. Ihr eigentlicher Grund war aber, die Spieler darüber zu informieren, dass die Produktionsfirma beschlossen hat, die Sendung doch nicht auszustrahlen. Man habe konzeptionelle Bedenken.

Eben ein Survival-Spiel, in dem man eher kurzfristige Ziele verfolgt und keine Plantagen anlegen will und das auch ein Ende hat. Aber das wird vermutlich viel zu kompliziert: Richtiges Wetter gehört in so ein Spiel und Gegenstände müssen sich frei manipulieren lassen: Ein kaputtes Rettungsboot wird zum Dach umfunktioniert, sowas eben.

Zudomon · Beitrag von **Zudomon** » 27.01.2017, 22:57

Da habt ihr auf jeden Fall einige Denkanstöße. Vielleicht ist das ein oder andere wirklich umsetzbar.
Grundsätzlich sollte es bei StoneQuest allerdings schon so sein, dass man einen anderen Planeten kolonisieren soll. Um der Menschheit eine neue Heimat vorzubereiten. Allerdings wird man nur auf einen kleinen Teil des Planeten Fuss fassen können. Denn dieser ist auch nicht unbewohnt. Und man kann dann auch, zumindest wenn ich das hinterher so umgesetzt bekomme, in die Außenbezirke um gegen die Aliens kämpfen... und vielleicht neue Territorien erobern.
Aber das man vielleicht auch so eine Art Story mit rein baut, vielleicht dann sogar eine globale Story finde ich gar nicht schlecht.

joggel · Beitrag von **joggel** » 28.01.2017, 07:18

@Oprah
Beste Story!!! :D
Ich musste grinsen als ich das gelesen habe...

joeydee · Beitrag von **joeydee** » 28.01.2017, 11:11

Zudomon hat geschrieben:dass man einen anderen Planeten kolonisieren soll. Um der Menschheit eine neue Heimat vorzubereiten.

Weiterer Denkanstoß: Nach kurzer Survival-Zeit (in Kapitellänge, z.B. als Tutorial) stoßen deine Pioniere auf erste Artefakte. Da war schonmal jemand zu Besuch. Gleicher Plan. Und war gescheitert.
Sachen finden, nutzbar machen, vervollständigen, herausfinden wie sie funktionieren. Aufzeichnungen entziffern. Herausfinden wer das war und was passiert ist.
Dadurch werden Belohnungen freigeschaltet. Alle Teile einer Waffe/eines Werkzeugs finden, Schemazeichnungen finden damit man Dinge nutzen kann, Anlagen Monitore zum Laufen bringen die Pläne zeigen, Metalldetektoren um mehr Technikteile in der Landschaft zu finden, Dinge upgraden.
Langfristige Ziele: Eine kleine kaputte Basis finden, Stück für Stück wieder zum Laufen bringen, sie verstehen lernen, danach kann man neue nach gleichem Schema bauen (Komponenten notwendig wie Wasser, Strom, 3D-Drucker, Materialwandler, ..., ). Durch gesicherte Gebiete irgendwann Zugang zu einem anderen Tal bekommen etc. Einen Steinschmelzer entwickeln um schneller Höhlen zu bauen. Hover-Transportmittel. "Luftpost"-Tunnel für Schnellreisen. Hoffnung keimen lassen, dass man eines Tages ein Schiff bauen kann um wieder zu entkommen.
Direkt Hinweise auf die früheren Entdecker gibt es keine, sprich keine Überlebenden oder Leichen. Das treibt einen jedenfalls zu Entdeckungstouren, nicht nur Nahrungsbeschaffung/Aufbau, und man hat immer ein ungutes Gefühl im Nacken (Sind die noch da? Im nächsten Tal? Kommen sie zurück? Seuche, Angriff? Wie sehen die wohl aus? Warum keine Leichen? Droht dasselbe Schicksal?). Du kannst so eine beliebig lange Story erzählen. Immer nur ein paar Brocken hinwerfen die etwas aufdecken, aber gleichzeitig neue Fragen aufwerfen (siehe Serien wie "Lost"), Gerüchte in die Welt streuen (Die Firma verfolgt finstere Pläne und weiß was los war, die Besatzung nicht, siehe "Aliens").
Weiteres mögliches Element: nach einiger Zeit (Jahr) werden weitere Pioniere erwartet. Man sieht am Horizont ein Schiff abstürzen. Gibt es Überlebende? Kann man hingelangen? Eine Verbindung aufbauen? Was ist passiert, warum ging das ebenfalls schief? Usw...

joggel · Beitrag von **joggel** » 28.01.2017, 13:46

Was mir noch zu Oprahs Story einfällt:
Vielleicht sich an "Herr der Fliegen" so story-mäsig anlehnen. So wegen künstlerisch wertvoll.
Kommt vlt auch cool, wenn sich ein Spiel an einem stück KulturGut orientiert :)

Zudomon · Beitrag von **Zudomon** » 28.01.2017, 20:30

Irgendwie ist das blöde, dass sich das für mich gerade so anfühlt wie: Ich hab da gerade ein Loch in der Erde und das Fundament steht so halb und bis dann das Hochhaus oder Palast darauf steht, könnte noch ETWAS Zeit vergehen... mist. Also eure Ideen hören sich sehr gut an. Habe nur das Gefühl, allein bis ich da Ansatzweise wirklich sowas umgesetzt bekomme könnte noch es noch dauern.

Ich habe mal noch eine Frage was die Grafik angeht. Ich sitze hier vor nem 48" Bildschirm, etwa einen Meter von weg. Wenn ich nun so eine Nahaufnahme mache, dann wirkt das schon gut. Aber wenn ich mich dann wirklich mal 3 Meter weit weg stelle, dann wirkt es bestimmt drei mal so realistisch. Zumindest für mich :lol: . Meine Frage ist, woran liegt das? Sieht man dann die ganzen Imperfektionen im Bild nicht, oder hat das Bild noch nicht genug Detail? Vor Jahren wurde hier ja schon im Forum angemerkt, dass die Bilder krass aussehen, solange die nur in der kleinen Vorschau angezeigt werden aber man in groß dann schon sieht, dass es nach Computergeneriert aussieht.

Hier noch ein Testbild... da hab ich ja den Showroom wieder vollgespammt... sorry dafür... werde mal die anderen Beiträge editieren, damit nicht alle Bilder da landen.

EDIT: Oder liegt es jetzt einfach nur noch daran, dass die Szene unrealistisch ist, viel zu wenig verschiedene Pflanzen und anderes, also was zur "unvollkommenheit" beiträgt, Steine, Äste. Oder ist es einfach das gefaked Licht? Ist ja nur SSAO, Shadowmaps usw. im Gegensatz zum realistischen Pathtracing liegen da ja Welten.

Krishty · Beitrag von **Krishty** » 28.01.2017, 21:55

Zudomon hat geschrieben:Hier noch ein Testbild... da hab ich ja den Showroom wieder vollgespammt... sorry dafür... werde mal die anderen Beiträge editieren, damit nicht alle Bilder da landen.

Dafür ist der doch da!

Das mit der Skalierung wüsste ich auch gern. Ich persönlich empfinde deine Schnappschüsse immer als sehr verschwommen (die Chromatische Aberration hatte das noch verschlimmert, darum habe ich sehr dagegen gewettert) und ab 50–60 % Originalgröße wird’s scharf und sieht gut aus.

Du kannst ja alles intern in vierfacher Auflösung berechnen (doppelte Seitenlänge) und zum Anzeigen herunterskalieren. Vielleicht sogar mit Lanczos. Falls es dann auch aus der Nähe realistisch aussieht, hast du die Ursache.

xq · Beitrag von xq » 28.01.2017, 22:23

Ich finde irgendwie die neue Farbsättigung nicht so geil wie vorher. Vielleicht finde ich ja die Muse, das ganze mal nach meinem Geschmack abzustimmen, hab da ein paar passende Sommerbilder rumliegen, die die Farbstimmung liefern

Zudomon · Beitrag von **Zudomon** » 28.01.2017, 23:10

@Krishty
Ja ich werde mal in der Richtung testen.
Ich finde die chromatische Aberation eigentlich extrem krass, weil ich das Bild danach nicht mehr so sehr als Computergeneriert empfinde. Leider leider ist dieser Effekt extremst aufwendig. Klar, wie andere benutze ich wohl die Billigversion davon... das Ergebnisbild nochmal je nach Farbe skalieren. Vorher hatte ich das nur nach RGB gemacht, jetzt nehme ich R, RG, G, GB und B. Aber wenn man das exakt machen wollen würde, dann müsste man jeden Farbkanal auch rendern. Dann würde das auch nicht mehr so verschwommen, weil es beim Rastern ja skaliert würde und nicht erst das Ergebnisbild nachbearbeitet. Ich hoffe, ich hab das jetzt verständlich wiedergegeben... also statt Nachbearbeiten müsste man pro Kanal oder sogar öfter, wenn man weiter splittet, sprich pro Lichtwellenlänge seperat rendern.

@MasterQ32
Wäre cool, wenn du meine Bilder in der Nachbearbeitung verbessern kannst. Was ich auch gemerkt habe, dass man alles halt so relativ wahr nimmt erschwert das ganze ungemein. Man verändert was, im ersten Moment sieht es toll aus... auf dauer aber dann nicht. Oder man gewöhnt sich zu sehr dran oder es sieht wieder an anderen Stellen gut aus. Also ich bin da immer am Werte hin und her schupsen, so dass das Gesamterlebnis nach und nach besser wird (manchmal verschlimmbessert man da aber auch).

Krishty · Beitrag von **Krishty** » 28.01.2017, 23:40

Ich komme mit CA gut klar, wenn sie halt nur einen Pixel oder weniger dick ist (z.B. hier). Dein Bild ist aber eh schon verschwommen, und auf Schnappschüssen wie hier sieht mir das eher nach zwei bis drei Pixeln aus …

Zudomon · Beitrag von **Zudomon** » 29.01.2017, 00:14

Krishty hat geschrieben:Ich komme mit CA gut klar, wenn sie halt nur einen Pixel oder weniger dick ist (z.B. hier). Dein Bild ist aber eh schon verschwommen, und auf Schnappschüssen wie hier sieht mir das eher nach zwei bis drei Pixeln aus …

Also ich gebe dir recht. Ich mag das auch nicht, wenn das Bild zu verschwommen wird. Und ich kenne viele negativbeispiele, wo CA zu übertrieben angewendet wird...
z.B. Alien Isolation
Dagegen sind es bei mir eben nur ganz wenige Pixel. Aber ich finde, bei Witcher da sehe ich nur, dass sich das Bild nach außen hin um vielleicht 1 Pixel verschiebt, aber von der CA erkenne ich darauf gar nichts. Ich finde, da kann man den Effekt dann gleich weg lassen.

Zudomon · Beitrag von **Zudomon** » 29.01.2017, 00:31

Die Post-Effekte werden in Monitorauflösung berechnet, wenn ich mich jetzt nicht irre.

Hier mit 100% Auflösung (extra als PNG ;) ):

Und bei 200%:

Irgendwie macht das schon viel aus, oder bilde ich mir das nur ein?
Und ich glaube auch, am meisten ändert sich die Texturschärfe an sich.

ZFX + Developia

[Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!

Re: [Projekt] StoneQuest lebt noch!