| This blog is sponsored and supported by Voxgig. |
|---|
![]() |
Motivated and inspired by Decl. In development and soon to be open-sourced. |
|---|
Table Of Contents
- Switch Case: Peak Performance
About Author
My name is Aleksandar Milenkovic. I am a senior software engineer at Voxgig and a self-driven coder, an autodidact, that creates software and improves its quality at any level.
If you have any feedback or questions, feel free to reach out to me via LinkedIn.
Motivation
Having worked on the implementation of the programming language, Decl for quite a long time now - I needed to squeeze the water out of the stone in terms of all the complier optimizations for the switch case statement out there as any operation in Decl is exclusively determined by a switch case being a dynamic programming language where all the variables and declarations are essentially of any type.
For example, the following code in Decl requires as many as three JMP instructions before any other operations are done.
v = 1 + 1;
// Translates into
Value v = plus(makeNumber(1), makeNumber(1));
// plus(Value left, Value right); JMP, JMP // Is left a number? Is right a number?
// v.~Value(); JMP // Is v an object to be released?
Without the properly optimized switch case statement, this results in about 9 assembly instructions at face value!
In my opinion, I understood that we can do much better than that and that this is not feasible for millions upon millions of operations that can transpire in Decl, and so here we are!
I hope you find all of the information in this article useful and applicable to your code in order to reach that peak performance via micro optimization.
Source Code
The source code of this blog is all in one file, at the Github repository where it is also hosted. The code contains the main function running all the benchmarks in order.
Introduction
This article is the final conclusion of S2S Compilers: Understanding Switch Case Statements blog.
It is here that we finally get to fully and truly optimize the switch case statement so that the developer is confident of using such optimizations in their codebase without the excruciating pain of complexity, and nightmare maintenance problems and concerns each time there is, all but, a small addition to the code logic.
The main focus of the S2S Compilers: Understanding Switch Case Statements blog is to get the readers up to speed of what the switch case is and to contextualize various implementations that the developer can attempt in order to reach the peak performance.
Our conclusion of the previous blog was:
Although there is a lot to unpack in this blog, the best approach is to use the switch statement, as the GCC compiler has a hard time optimizing the high-level C implementations as discussed above. If you're determined to optimize further and want to remove any redundant
ASMinstructions that may impact your code, consider using 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands, which allows you to craft your assembly from the ground up - untouched by the optimizations such as-O1,-O2, or-O3.
Contrary to the statement above - after doing some research, I have finally managed to put together a jump table generated by the high-level switch case statement that truly uses the smallest number of assembly instructions overall without any necessity of crafting any assembly code from the group up as previously concluded.
Without further ado, let's dive in and speed up our code!
Switch Case Optimization
In the previous blog, we have primarily focused on the following switch case implementation.
void _switch_run(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
case 4: {
r = 4;
break;
}
case 5: {
r = 5;
break;
}
case 6: {
r = 6;
break;
}
case 7: {
r = 7;
break;
}
case 8: {
r = 8;
break;
}
case 9: {
r = 9;
break;
}
case 10: {
r = 10;
break;
}
default: {
r = 11;
break;
}
}
}
Generating the following assembly code:
_switch_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L2
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L14
.quad .L13
.quad .L12
.quad .L11
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
.L14:
mov DWORD PTR [rbp-4], 0
jmp .L15 # break
.L13:
mov DWORD PTR [rbp-4], 1
jmp .L15
.L12:
mov DWORD PTR [rbp-4], 2
jmp .L15
.L11:
mov DWORD PTR [rbp-4], 3
jmp .L15
.L10:
mov DWORD PTR [rbp-4], 4
jmp .L15
.L9:
mov DWORD PTR [rbp-4], 5
jmp .L15
.L8:
mov DWORD PTR [rbp-4], 6
jmp .L15
.L7:
mov DWORD PTR [rbp-4], 7
jmp .L15
.L6:
mov DWORD PTR [rbp-4], 8
jmp .L15
.L5:
mov DWORD PTR [rbp-4], 9
jmp .L15
.L3:
mov DWORD PTR [rbp-4], 10
jmp .L15
.L2: # Default Case Label
mov DWORD PTR [rbp-4], 11
nop
.L15: # Break Label
nop
pop rbp
ret
This switch case, while safe due to the default case, can be barely optimized further due to its dangerous nature.
For instance, we can never (even theoretically) optimize out the conditional JUMP that precedes the actual direct JMP of our switch case table.
cmp DWORD PTR [rbp-20], 10
ja .L2
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
This is due to the following concerns:
Removing the
ifstatement andCMPis more performant but the cost on the programmer's side of things is that the programmer has to make their code very strict with covering all the possible values and cases since this eliminates thedefaultcase. Note that if the subscript is out of bounds - the programmer is going to end up withSegmentation fault (core dumped)at runtime. All in all, this can lead to premature optimization.Here is the exact log from
valgrind, showing that the program cannot jump to the invalid address.valgrind -s ./out.out ==49387== Jump to the invalid address stated on the next line ==49387== at 0x0: ??? ==49387== by 0x1097BC: main (in /src/out.out) ==49387== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==49387== ==49387== ==49387== Process terminating with default action of signal 11 (SIGSEGV) ==49387== Bad permissions for mapped region at address 0x0 ==49387== at 0x0: ??? ==49387== by 0x1097BC: main (in /src/out.out)
We can effectively eliminate this concern by using a strict range of enumeration values where we cover each and every case.
For the safety of our code at runtime, this restriction is absolutely crucial for the elimination of the CMP and JA overhead, which can indeed be considered huge overhead in the grand scheme of things.
enum class V {
NUM,
STRING,
ARR,
DICT,
FUNC,
};
int __switch_case(V v) {
volatile int i = 0;
switch(v) {
case V::NUM: {
i = 1;
break;
}
case V::STRING: {
i = 2;
break;
}
case V::ARR: {
i = 3;
break;
}
case V::DICT: {
i = 4;
break;
}
case V::FUNC: {
i = 5;
break;
}
default: {
break;
}
}
return i;
}
Generating the following assembly output:
__switch_case(V):
mov DWORD PTR [rsp-4], 0
cmp edi, 4
ja .L22
mov edi, edi
jmp [QWORD PTR .L24[0+rdi*8]]
.L24:
.quad .L28
.quad .L27
.quad .L26
.quad .L25
.quad .L23
.L28:
mov DWORD PTR [rsp-4], 1
.L22:
mov eax, DWORD PTR [rsp-4]
ret
.L27:
mov DWORD PTR [rsp-4], 2
jmp .L22
.L26:
mov DWORD PTR [rsp-4], 3
jmp .L22
.L25:
mov DWORD PTR [rsp-4], 4
jmp .L22
.L23:
mov DWORD PTR [rsp-4], 5
jmp .L22
Direct JMP Instruction Optimization
Let's get to the crux of the matter and optimize out the following before our actual direct JMP.
mov DWORD PTR [rsp-4], 0
cmp edi, 4
ja .L22
This GCC-specific implementation will do the trick.
int _switch_case(V v) {
volatile int i = 0;
switch(v) {
case V::NUM: {
i = 1;
break;
}
case V::STRING: {
i = 2;
break;
}
case V::ARR: {
i = 3;
break;
}
case V::DICT: {
i = 4;
break;
}
case V::FUNC: {
i = 5;
break;
}
default: {
__builtin_unreachable();
}
}
return i;
}
🚧 Note
The
__builtin_unreachable()function is a non-standard, compiler-specific builtin function (available in GCC and Clang).For more information, refer to 7.12 Other Built-in Functions Provided by GCC.
What the __builtin_unreachable() function does is that it informs the compiler that the control flow will never reach that point in the program, which means that the compiler can safely remove dead code given that we will never reach the default (out-of-bounds) case having covered each and every enumeration value of enum class V.
📘 Note
The C++ Programming Language plays a huge role here as it gives us the ability to strictly specify the type of the enumeration range so that no
runtimefaults can be caused. Withenum classbeing of static type, this validation, needless to say, is done at compile time.For more detailed overview, refer to C++ Language | Enumeration declaration.
As a result, we get the following assembly output.
_switch_case(V):
mov DWORD PTR [rsp-4], 0
mov edi, edi
jmp [QWORD PTR .L15[0+rdi*8]]
.L15:
.quad .L19
.quad .L18
.quad .L17
.quad .L16
.quad .L14
.L19:
mov DWORD PTR [rsp-4], 1
.L20:
mov eax, DWORD PTR [rsp-4]
ret
.L18:
mov DWORD PTR [rsp-4], 2
jmp .L20
.L17:
mov DWORD PTR [rsp-4], 3
jmp .L20
.L16:
mov DWORD PTR [rsp-4], 4
jmp .L20
.L14:
mov DWORD PTR [rsp-4], 5
jmp .L20
It is absolutely safe to say that we have successfully eliminated the overhead of CMP and JA by literally removing those two assembly instructions altogether. Now, we jump straight to (no pun intended) the respective label of the jump table.
Break vs Return
While you may think that we are through with the optimization of our switch case, there is one last touch to hit the peak performance of the switch case statement.
This is the break label of the switch case statement that makes a JMP to get out of the switch case so that we don't possibly get stuck in an infinite loop.
This is the .L20 label that the code needs to jump to where there is the final RET instruction that pops the return address off the stack (of the initial CALL instruction).
_switch_case(V):
mov DWORD PTR [rsp-4], 0
mov edi, edi
jmp [QWORD PTR .L15[0+rdi*8]]
.L15:
.quad .L19
.quad .L18
.quad .L17
.quad .L16
.quad .L14
.L19:
mov DWORD PTR [rsp-4], 1
.L20:
mov eax, DWORD PTR [rsp-4]
ret
.L18:
mov DWORD PTR [rsp-4], 2
jmp .L20
.L17:
mov DWORD PTR [rsp-4], 3
jmp .L20
.L16:
mov DWORD PTR [rsp-4], 4
jmp .L20
.L14:
mov DWORD PTR [rsp-4], 5
jmp .L20
What if we can reduce the number of assembly instructions by removing the JMP where the compiler produces the assembly output similar to the following for each and every switch case label?
.L14:
mov DWORD PTR [rsp-4], 5
mov eax, DWORD PTR [rsp-4]
ret
This is made possible with the following C/C++ code implementation.
int _switch_case_return(V v) {
volatile int i = 0;
switch(v) {
case V::NUM: {
i = 1;
return i;
}
case V::STRING: {
i = 2;
return i;
}
case V::ARR: {
i = 3;
return i;
}
case V::DICT: {
i = 4;
return i;
}
case V::FUNC: {
i = 5;
return i;
}
default: {
__builtin_unreachable();
}
}
return i;
}
This produces cleaner assembly with no extra JMP instruction of the break label whatsoever.
_switch_case_return(V):
mov DWORD PTR [rsp-4], 0
mov edi, edi
jmp [QWORD PTR .L6[0+rdi*8]]
.L6:
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L5
.L10:
mov DWORD PTR [rsp-4], 1
mov eax, DWORD PTR [rsp-4]
ret
.L9:
mov DWORD PTR [rsp-4], 2
mov eax, DWORD PTR [rsp-4]
ret
.L8:
mov DWORD PTR [rsp-4], 3
mov eax, DWORD PTR [rsp-4]
ret
.L7:
mov DWORD PTR [rsp-4], 4
mov eax, DWORD PTR [rsp-4]
ret
.L5:
mov DWORD PTR [rsp-4], 5
mov eax, DWORD PTR [rsp-4]
ret
📘 Keep in mind!
It is important to keep in mind that the
returnstatement terminates the rest of the code execution (as it makes it unreachable) so make sure to code your logic carefully!
With all of our functions implemented, let's put them into practice by benchmarking its results under heavy scenarios by wrapping them into very extensive loops. This is usually a common practice to benchmark your code in general.
Benchmark Results and Conclusion
Assembly instruction count can matter, but latency, branching, and memory behavior are usually more important.
Let's benchmark all of the functions above and see the differences for ourselves!
📘 Note
Note that, as opposed to Computed GOTO: Goto Jump Table with Labels in the previous blog, all of our functions can now be inlined. So while we are safely using our switch case, we are also capitalizing on all the optimizations that the compiler itself can make.
#include <iostream>
#include <cassert>
#include <ctime>
#define start_time clock_t s_t_a_r_t = clock();
int main() {
const size_t TEST_BOUNDARY = 1000000;
const size_t RANGE = 100;
std::cout << "START: SWITCH_CASE_CONDITION DEFAULT BREAK" << std::endl;
{
double _SUM = 0;
volatile V _t = V::NUM;
for(size_t i = 0; i < RANGE; i++) {
start_time;
for(size_t i = 0; i < TEST_BOUNDARY; i++) {
assert(__switch_case(_t) != 0);
}
_SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
}
std::cout << _SUM/RANGE << std::endl;
}
std::cout << "END: SWITCH_CASE_CONDITION DEFAULT BREAK" << std::endl;
std::cout << "START: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE" << std::endl;
{
double _SUM = 0;
volatile V _t = V::NUM;
for(size_t i = 0; i < RANGE; i++) {
start_time;
for(size_t i = 0; i < TEST_BOUNDARY; i++) {
assert(_switch_case(_t) != 0);
}
_SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
}
std::cout << _SUM/RANGE << std::endl;
}
std::cout << "END: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE" << std::endl;
std::cout << "START: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK" << std::endl;
{
double _SUM = 0;
volatile V _t = V::NUM;
for(size_t i = 0; i < RANGE; i++) {
start_time;
for(size_t i = 0; i < TEST_BOUNDARY; i++) {
assert(_switch_case_return(_t) != 0);
}
_SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
}
std::cout << _SUM/RANGE << std::endl;
}
std::cout << "END: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK" << std::endl;
return 0;
}
This benchmark produces the following output after we compile it with g++ and the -O1 optimizations:
g++ switch_case_performance_peak.cpp -o out.out -O1 && ./out.out
START: SWITCH_CASE_CONDITION DEFAULT BREAK
0.00063072
END: SWITCH_CASE_CONDITION DEFAULT BREAK
START: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE
0.00055233
END: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE
START: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK
0.00052462
END: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK
As our ultimate reward, we get to break down the great results of our micro optimization efforts!
📘 Note
The difference between using
breakandreturnmight seem negligible but it is, after all, an improvement in performance as we have optimized out the redundantJMPfor thebreaklabel as detailed out above.

Top comments (0)