Shader branching performance

Theodox · May 31, 2018, 10:07pm

View in #shaders on Slack

@sneddon.mark: Question: I’ve been told that branching in the pixel shader is expensive. I’ve also been told branching isn’t that expensive as long as all branches have the same/similar number of instructions
Which is it ?

@spiderspy: It is
I have no idea about the later but I think it is smarter to lerp if instead if you can.
Lerp works like an if statement without branching
Trickier to write and understand though

@theodox: a bit of both. A ‘soft’ branch, like an if statement, is basically like a lerp with an alpha of one or zero but with a litle overhead – it runs both branches. A ‘hard’ branch (with the [branch] semantic) has more overhead, but only executes the actual code path you provide. The cost is high enough that you should only use it to distinguish between significant things.
So, a regular if is basically a glorified lerp, and as [branch] if is only worth using to go down a radically different path - like not sampling a texture in the distance, or turning a shader feature off entirely

@redmazer: Aren’t there some functions that can get messed up if it branches? Like derivatives

Also @theodox, your terminology confuses me a bit, would you mind giving some quick psuedo code of a hard and soft branch?

@theodox: Semantics-wise, its the difference between
if
and
[branch] if
in HLSL. But it’s a bit more complicated than that at runtime.
The accepted answer to this is closer to the real truth than my rule of thumb however:
https://stackoverflow.com/questions/37827216/does-if-statements-slow-down-my-shader

Stack Overflow: Does If-statements slow down my shader?

TLDR – a ‘soft’ branch always runs both sides, so it’s not an optimization trick but rather a code organization thing. a hard branch imposes a minimum cost, which can be higher or not so high depending on things like how you pixels get sent to the hardware and your compiler. I’ve been told – though I am certain this is implementation specific for different compilers and hardware – that a branch is roughly equivalent to a texture sample in perf cost, so it doesn’t win much unless it’s hiding multiple texture taps or very expensive math
most non mobile HW is more likely to be limited by texture taps or overdraw than by pure math
but as with all perf related stuff the truth is what you measure, all rules of thumb are only approximations and your results may vary

@spiderspy: So you should know the hardware you are writing for? AKA it makes sense for consoles but not for pc?

@theodox: Not exactly. PC and console will be broadly similar, but there are implementation specific things in each platform compiler that might change the performance envelope. Mobile is usually rather different; most mobile hardware has a different mix of compute to texture sample ability than PC/Console does.
You should definitely figure out where the weak spot is. An Xbox is different from a scorpio, for example, because the xbox has slow memory so many things that stress moving memory around, particularly deferred shading with big fat buffers, are not great on xboxes but find on scorpios
Mobile is usually less math-friendly than console because of battery and heat constraints. but different mobile chips are different
So the ‘ideal’ shader for a given platform might not be ideal for another, but a lot depends on how much you care: 2% differences probably don’t matter but 15% does

@andrewamundrud: Out of interest, what did you need the branch for?

@theodox: I used it a lot for lodding the terrain shader in State of Decay – you had to see the same asset right at your feet and a mile away, and it covered half the screen. So it hard-branched to flat colors instead of textures after a certain distance – and since you could potentially have sampled 8-10 textures in a single pixel far away it was a huge perf win
but the wavefronts were pretty aligned since essentially there was a horizontal line in screen space where that effect turned on – not too many 4x4 pixel blocks straddled that line

@andrewamundrud: I was with you up to the wavefronts. Are you saying it was difficult to determine which lod to use on the horizon?

@theodox: a wavefront is roughly a unit of concurrent work being sent through the hardware. The number of pixels in a wavefront depends on the complexity of the shader. If all the pixels in a wavefront have the same structure (ie,they all go through the same code path) things are better; if they don’t, things are worse
In this case the depth cutoff meant that more or less half of what I was drawing went down a single path, and the rest down another. Not exactly true, but way better than having different paths in every block of pixels
>A SIMD unit can have up to 10 wavefronts in flight at once. Each wavefront contains 64 threads. Hence a SIMD unit can have up to 640 threads in flight at once (in multiples of 64). The scheduler will take the pixels/vertices that need to be processed, allocate one thread per pixel/vertex, and then tries to group up to 64 threads together into a wavefront. That bundle of threads is then given to a SIMD, which runs the shader code. The number of wavefronts that ‘fit’ into a SIMD depends on the complexity of the shader code. For simple shaders, you can squeeze 10 wavefronts at a time into a SIMD, but for complex shaders you may only be able to fit one or two wavefronts into a SIMD.
> This is because different shaders require different numbers of temporary registers, which are stored in the SIMD’s register array. Say the SIMD has 1000 registers in total – if a shader uses 100 or less, then you can fit 10 (or more) “instances” of that shader into the register array. If a shader uses 500 temporary registers, then only two “instances” of that shader will fit into the SIMD - so the SIMD will only accept two concurrent wavefronts. Each “register” actually contains 64 floats – which is why this calculation is done for wavefronts and not threads. One register is used by a wavefront to store a value for each of it’s threads.