Aleksandar Milenkovic

Posted on Mar 9

Switch Case: Peak Performance

#programming #assembly #c #performance

	This blog is sponsored and supported by Voxgig.

	Motivated and inspired by Decl. In development and soon to be open-sourced.

Switch Case: Peak Performance

About Author

My name is Aleksandar Milenkovic. I am a senior software engineer at Voxgig and a self-driven coder, an autodidact, that creates software and improves its quality at any level.

If you have any feedback or questions, feel free to reach out to me via LinkedIn.

Motivation

Having worked on the implementation of the programming language, Decl for quite a long time now - I needed to squeeze the water out of the stone in terms of all the complier optimizations for the switch case statement out there as any operation in Decl is exclusively determined by a switch case being a dynamic programming language where all the variables and declarations are essentially of any type.

For example, the following code in Decl requires as many as three JMP instructions before any other operations are done.

v = 1 + 1;

// Translates into

Value v = plus(makeNumber(1), makeNumber(1));

// plus(Value left, Value right); JMP, JMP // Is left a number? Is right a number?
// v.~Value(); JMP // Is v an object to be released?

Without the properly optimized switch case statement, this results in about 9 assembly instructions at face value!

In my opinion, I understood that we can do much better than that and that this is not feasible for millions upon millions of operations that can transpire in Decl, and so here we are!

I hope you find all of the information in this article useful and applicable to your code in order to reach that peak performance via micro optimization.

Source Code

The source code of this blog is all in one file, at the Github repository where it is also hosted. The code contains the main function running all the benchmarks in order.

Introduction

This article is the final conclusion of S2S Compilers: Understanding Switch Case Statements blog.

It is here that we finally get to fully and truly optimize the switch case statement so that the developer is confident of using such optimizations in their codebase without the excruciating pain of complexity, and nightmare maintenance problems and concerns each time there is, all but, a small addition to the code logic.

The main focus of the S2S Compilers: Understanding Switch Case Statements blog is to get the readers up to speed of what the switch case is and to contextualize various implementations that the developer can attempt in order to reach the peak performance.

Our conclusion of the previous blog was:

Although there is a lot to unpack in this blog, the best approach is to use the switch statement, as the GCC compiler has a hard time optimizing the high-level C implementations as discussed above. If you're determined to optimize further and want to remove any redundant ASM instructions that may impact your code, consider using 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands, which allows you to craft your assembly from the ground up - untouched by the optimizations such as -O1, -O2, or -O3.

Contrary to the statement above - after doing some research, I have finally managed to put together a jump table generated by the high-level switch case statement that truly uses the smallest number of assembly instructions overall without any necessity of crafting any assembly code from the group up as previously concluded.

Without further ado, let's dive in and speed up our code!

Switch Case Optimization

In the previous blog, we have primarily focused on the following switch case implementation.

void _switch_run(unsigned int type) {

  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
      r = 3;
      break;
    }
    case 4: {
      r = 4;
      break;
    }
    case 5: {
      r = 5;
      break;
    }
    case 6: {
      r = 6;
      break;
    }
    case 7: {
      r = 7;
      break;
    }
    case 8: {
      r = 8;
      break;
    }
    case 9: {
      r = 9;
      break;
    }
    case 10: {
      r = 10;
      break;
    }
    default: {
      r = 11;
      break;
    }
  }

}

Generating the following assembly code:

_switch_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L2
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L4[0+rax*8]
        jmp     rax
.L4:
        .quad   .L14
        .quad   .L13
        .quad   .L12
        .quad   .L11
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L5
        .quad   .L3
.L14:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L15 # break
.L13:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L15
.L12:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L15
.L11:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L15
.L10:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L15
.L9:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L15
.L8:
        mov     DWORD PTR [rbp-4], 6
        jmp     .L15
.L7:
        mov     DWORD PTR [rbp-4], 7
        jmp     .L15
.L6:
        mov     DWORD PTR [rbp-4], 8
        jmp     .L15
.L5:
        mov     DWORD PTR [rbp-4], 9
        jmp     .L15
.L3:
        mov     DWORD PTR [rbp-4], 10
        jmp     .L15
.L2: # Default Case Label
        mov     DWORD PTR [rbp-4], 11
        nop
.L15: # Break Label
        nop
        pop     rbp
        ret

This switch case, while safe due to the default case, can be barely optimized further due to its dangerous nature.

For instance, we can never (even theoretically) optimize out the conditional JUMP that precedes the actual direct JMP of our switch case table.

        cmp     DWORD PTR [rbp-20], 10
        ja      .L2
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L4[0+rax*8]
        jmp     rax

This is due to the following concerns:

Removing the if statement and CMP is more performant but the cost on the programmer's side of things is that the programmer has to make their code very strict with covering all the possible values and cases since this eliminates the default case. Note that if the subscript is out of bounds - the programmer is going to end up with Segmentation fault (core dumped) at runtime. All in all, this can lead to premature optimization.

Here is the exact log from valgrind, showing that the program cannot jump to the invalid address.
valgrind -s ./out.out
==49387== Jump to the invalid address stated on the next line
==49387==    at 0x0: ???
==49387==    by 0x1097BC: main (in /src/out.out)
==49387==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==49387== 
==49387== 
==49387== Process terminating with default action of signal 11 (SIGSEGV)
==49387==  Bad permissions for mapped region at address 0x0
==49387==    at 0x0: ???
==49387==    by 0x1097BC: main (in /src/out.out)

We can effectively eliminate this concern by using a strict range of enumeration values where we cover each and every case.

For the safety of our code at runtime, this restriction is absolutely crucial for the elimination of the CMP and JA overhead, which can indeed be considered huge overhead in the grand scheme of things.

enum class V {
    NUM,
    STRING,
    ARR,
    DICT,
    FUNC,
};

int __switch_case(V v) {
    volatile int i = 0;

    switch(v) {
        case V::NUM: {
            i = 1;
            break;
        }
        case V::STRING: {
            i = 2;
            break;
        }
        case V::ARR: {
            i = 3;
            break;
        }
        case V::DICT: {
            i = 4;
            break;
        }
        case V::FUNC: {
            i = 5;
            break;
        }
        default: {
            break;
        }
    }
    return i;

}

Generating the following assembly output:

__switch_case(V):
        mov     DWORD PTR [rsp-4], 0
        cmp     edi, 4
        ja      .L22
        mov     edi, edi
        jmp     [QWORD PTR .L24[0+rdi*8]]
.L24:
        .quad   .L28
        .quad   .L27
        .quad   .L26
        .quad   .L25
        .quad   .L23
.L28:
        mov     DWORD PTR [rsp-4], 1
.L22:
        mov     eax, DWORD PTR [rsp-4]
        ret
.L27:
        mov     DWORD PTR [rsp-4], 2
        jmp     .L22
.L26:
        mov     DWORD PTR [rsp-4], 3
        jmp     .L22
.L25:
        mov     DWORD PTR [rsp-4], 4
        jmp     .L22
.L23:
        mov     DWORD PTR [rsp-4], 5
        jmp     .L22

Direct JMP Instruction Optimization

Let's get to the crux of the matter and optimize out the following before our actual direct JMP.

        mov     DWORD PTR [rsp-4], 0
        cmp     edi, 4
        ja      .L22

This GCC-specific implementation will do the trick.

int _switch_case(V v) {
    volatile int i = 0;

    switch(v) {
        case V::NUM: {
            i = 1;
            break;
        }
        case V::STRING: {
            i = 2;
            break;
        }
        case V::ARR: {
            i = 3;
            break;
        }
        case V::DICT: {
            i = 4;
            break;
        }
        case V::FUNC: {
            i = 5;
            break;
        }
        default: {
            __builtin_unreachable();
        }
    }

    return i;

}

🚧 Note

The __builtin_unreachable() function is a non-standard, compiler-specific builtin function (available in GCC and Clang).

For more information, refer to 7.12 Other Built-in Functions Provided by GCC.

What the __builtin_unreachable() function does is that it informs the compiler that the control flow will never reach that point in the program, which means that the compiler can safely remove dead code given that we will never reach the default (out-of-bounds) case having covered each and every enumeration value of enum class V.

📘 Note

The C++ Programming Language plays a huge role here as it gives us the ability to strictly specify the type of the enumeration range so that no runtime faults can be caused. With enum class being of static type, this validation, needless to say, is done at compile time.

For more detailed overview, refer to C++ Language | Enumeration declaration.

As a result, we get the following assembly output.

_switch_case(V):
        mov     DWORD PTR [rsp-4], 0
        mov     edi, edi
        jmp     [QWORD PTR .L15[0+rdi*8]]
.L15:
        .quad   .L19
        .quad   .L18
        .quad   .L17
        .quad   .L16
        .quad   .L14
.L19:
        mov     DWORD PTR [rsp-4], 1
.L20:
        mov     eax, DWORD PTR [rsp-4]
        ret
.L18:
        mov     DWORD PTR [rsp-4], 2
        jmp     .L20
.L17:
        mov     DWORD PTR [rsp-4], 3
        jmp     .L20
.L16:
        mov     DWORD PTR [rsp-4], 4
        jmp     .L20
.L14:
        mov     DWORD PTR [rsp-4], 5
        jmp     .L20

It is absolutely safe to say that we have successfully eliminated the overhead of CMP and JA by literally removing those two assembly instructions altogether. Now, we jump straight to (no pun intended) the respective label of the jump table.

Break vs Return

While you may think that we are through with the optimization of our switch case, there is one last touch to hit the peak performance of the switch case statement.

This is the break label of the switch case statement that makes a JMP to get out of the switch case so that we don't possibly get stuck in an infinite loop.

This is the .L20 label that the code needs to jump to where there is the final RET instruction that pops the return address off the stack (of the initial CALL instruction).

_switch_case(V):
        mov     DWORD PTR [rsp-4], 0
        mov     edi, edi
        jmp     [QWORD PTR .L15[0+rdi*8]]
.L15:
        .quad   .L19
        .quad   .L18
        .quad   .L17
        .quad   .L16
        .quad   .L14
.L19:
        mov     DWORD PTR [rsp-4], 1
.L20:
        mov     eax, DWORD PTR [rsp-4]
        ret
.L18:
        mov     DWORD PTR [rsp-4], 2
        jmp     .L20
.L17:
        mov     DWORD PTR [rsp-4], 3
        jmp     .L20
.L16:
        mov     DWORD PTR [rsp-4], 4
        jmp     .L20
.L14:
        mov     DWORD PTR [rsp-4], 5
        jmp     .L20

What if we can reduce the number of assembly instructions by removing the JMP where the compiler produces the assembly output similar to the following for each and every switch case label?

.L14:
        mov     DWORD PTR [rsp-4], 5
        mov     eax, DWORD PTR [rsp-4]
        ret

This is made possible with the following C/C++ code implementation.

int _switch_case_return(V v) {
    volatile int i = 0;

    switch(v) {
        case V::NUM: {
            i = 1;
            return i;
        }
        case V::STRING: {
            i = 2;
            return i;
        }
        case V::ARR: {
            i = 3;
            return i;
        }
        case V::DICT: {
            i = 4;
            return i;

        }
        case V::FUNC: {
            i = 5;
            return i;
        }
        default: {
            __builtin_unreachable();
        }
    }
    return i;
}

This produces cleaner assembly with no extra JMP instruction of the break label whatsoever.

_switch_case_return(V):
        mov     DWORD PTR [rsp-4], 0
        mov     edi, edi
        jmp     [QWORD PTR .L6[0+rdi*8]]
.L6:
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L5
.L10:
        mov     DWORD PTR [rsp-4], 1
        mov     eax, DWORD PTR [rsp-4]
        ret
.L9:
        mov     DWORD PTR [rsp-4], 2
        mov     eax, DWORD PTR [rsp-4]
        ret
.L8:
        mov     DWORD PTR [rsp-4], 3
        mov     eax, DWORD PTR [rsp-4]
        ret
.L7:
        mov     DWORD PTR [rsp-4], 4
        mov     eax, DWORD PTR [rsp-4]
        ret
.L5:
        mov     DWORD PTR [rsp-4], 5
        mov     eax, DWORD PTR [rsp-4]
        ret

📘 Keep in mind!

It is important to keep in mind that the return statement terminates the rest of the code execution (as it makes it unreachable) so make sure to code your logic carefully!

With all of our functions implemented, let's put them into practice by benchmarking its results under heavy scenarios by wrapping them into very extensive loops. This is usually a common practice to benchmark your code in general.

Benchmark Results and Conclusion

Assembly instruction count can matter, but latency, branching, and memory behavior are usually more important.

Let's benchmark all of the functions above and see the differences for ourselves!

📘 Note

Note that, as opposed to Computed GOTO: Goto Jump Table with Labels in the previous blog, all of our functions can now be inlined. So while we are safely using our switch case, we are also capitalizing on all the optimizations that the compiler itself can make.

#include <iostream>

#include <cassert>

#include <ctime>

#define start_time clock_t s_t_a_r_t = clock();

int main() {

  const size_t TEST_BOUNDARY = 1000000;

  const size_t RANGE = 100;


  std::cout << "START: SWITCH_CASE_CONDITION DEFAULT BREAK" << std::endl;
  {

    double _SUM = 0;
    volatile V _t = V::NUM;

    for(size_t i = 0; i < RANGE; i++) {
        start_time;
        for(size_t i = 0; i < TEST_BOUNDARY; i++) {
            assert(__switch_case(_t) != 0);
        }
        _SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
    }

    std::cout << _SUM/RANGE << std::endl;

  }
  std::cout << "END: SWITCH_CASE_CONDITION DEFAULT BREAK" << std::endl;


  std::cout << "START: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE" << std::endl;
  {

    double _SUM = 0;
    volatile V _t = V::NUM;

    for(size_t i = 0; i < RANGE; i++) {
        start_time;
        for(size_t i = 0; i < TEST_BOUNDARY; i++) {
            assert(_switch_case(_t) != 0);
        }
        _SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
    }

    std::cout << _SUM/RANGE << std::endl;

  }
  std::cout << "END: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE" << std::endl;

  std::cout << "START: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK" << std::endl;
  {

    double _SUM = 0;
    volatile V _t = V::NUM;

    for(size_t i = 0; i < RANGE; i++) {
        start_time;
        for(size_t i = 0; i < TEST_BOUNDARY; i++) {
            assert(_switch_case_return(_t) != 0);
        }
        _SUM += static_cast<double>(clock() - s_t_a_r_t) / CLOCKS_PER_SEC;
    }

    std::cout << _SUM/RANGE << std::endl;

  }
  std::cout << "END: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK" << std::endl;

  return 0;
}

This benchmark produces the following output after we compile it with g++ and the -O1 optimizations:

g++ switch_case_performance_peak.cpp -o out.out -O1 && ./out.out

START: SWITCH_CASE_CONDITION DEFAULT BREAK
0.00063072
END: SWITCH_CASE_CONDITION DEFAULT BREAK
START: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE
0.00055233
END: SWITCH_CASE_CONDITION DEFAULT UNREACHABLE
START: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK
0.00052462
END: SWITCH_CASE_CONDITION UNREACHABLE AND RETURN AS OPPOSED TO BREAK

As our ultimate reward, we get to break down the great results of our micro optimization efforts!

📘 Note

The difference between using break and return might seem negligible but it is, after all, an improvement in performance as we have optimized out the redundant JMP for the break label as detailed out above.

DEV Community