**2024-01-14**

We have seen in a previous post that switch statements are more complex than they seem. The compiler is able to implement them as a decision tree (even though maybe it shouldn't). Is that everything it can do?

*In this post, I'll focus on gcc. All optimizations, except the bitset lowering, translate to clang. I'll be using the -O3 optimization flag unless indicated otherwise.*

*I've also slightly edited gcc code and removed certain assertions for brevity.*

Let us start with a simple example.
A function that takes two integers `x`

and `val`

. Depending on the value of `x`

, it calls a corresponding function (`f0`

, `f1`

, and so forth) with `val`

as an argument and returns the result.
Were we to implement this manually, we'd would use an array of function pointers, where the input serves as the index into the array.

```
int ;
int ;
int ;
int ;
int ;
int
```

```
compact:
mov eax, edi
mov edi, esi
cmp eax, 4
ja .L2
jmp [QWORD PTR .L4[0+rax*8]]
.L4:
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
.L5:
jmp f3
.L3:
jmp f4
.L8:
jmp f0
.L7:
jmp f1
.L6:
jmp f2
.L2:
xor eax, eax
ret
```

The compiler agrees with this approach, and transforms the switch statement into a jump table.

Moreover, it is able to identify when the values are offset by a constant:

```
int ;
int ;
int ;
int ;
int ;
int
```

```
compact:
lea eax, [rdi-97]
cmp eax, 4
ja .L2
mov edi, esi
jmp [QWORD PTR .L4[0+rax*8]]
.L4:
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
.L5:
jmp f3
.L3:
jmp f4
.L8:
jmp f0
.L7:
jmp f1
.L6:
jmp f2
.L2:
xor eax, eax
ret
```

Our intuition suggests that the more sparse the values, the less advantageous the jump table conversion is.

Through experimentation, we find that `gcc`

considers the threshold to be around eight jump table entries per `case`

label.

In other words, the following function compiles to a jump table:

```
int ;
int ;
int ;
int ;
int ;
int
```

```
compact_7:
mov eax, edi
mov edi, esi
cmp eax, 34
ja .L2
jmp [QWORD PTR .L4[0+rax*8]]
.L4:
.quad .L8
.quad .L2
; ...
.quad .L2
.quad .L7
.quad .L2
; ...
.quad .L2
.quad .L6
.quad .L2
; ...
.quad .L2
.quad .L5
.quad .L2
; ...
.quad .L2
.quad .L3
.L2:
xor eax, eax
ret
.L8:
jmp f0
.L7:
jmp f1
.L6:
jmp f2
.L5:
jmp f3
.L3:
jmp f4
```

But this other function is expanded as a tree of comparisons:

```
int ;
int ;
int ;
int ;
int ;
int
```

```
sparse_8:
mov eax, edi
mov edi, esi
cmp eax, 16
je .L2
jle .L10
cmp eax, 24
je .L7
cmp eax, 40
jne .L1
jmp f4
.L10:
test eax, eax
je .L4
cmp eax, 8
jne .L1
jmp f1
.L1:
xor eax, eax
ret
.L4:
jmp f0
.L7:
jmp f3
.L2:
jmp f2
```

Here's the comparison tree visualized for clarity:

A couple of observations:

- We've only tried equally spaced values so far. Can the compiler identify transformations that are profitable on a subset of all cases?
- So far each
`case`

led to a different result. What if we had different cases with the same result? - The switch in
`sparse_8`

could be simplified by dividing`x`

by eight before the jump. Interestingly,`gcc`

*doesn't*find this simplification.

Let's dig into 1. and 2.

Altering our first example by including an outlier value shows that the compiler is able to recognize dense subsets of cases.

```
int ;
int ;
int ;
int ;
int ;
int ;
int
```

```
outlier:
mov eax, edi
mov edi, esi
cmp eax, 5
jg .L2
test eax, eax
jle .L1
cmp eax, 5
ja .L4
jmp [QWORD PTR .L6[0+rax*8]]
.L6:
.quad .L4
.quad .L4
.quad .L9
.quad .L8
.quad .L7
.quad .L5
.L2:
cmp eax, 100
jne .L1
jmp f5
.L7:
jmp f3
.L5:
jmp f4
.L9:
jmp f1
.L8:
jmp f2
.L4:
jmp f0
.L1:
xor eax, eax
ret
```

The compiler emits a branch for case 100, but emits a jump table for case 1 through 5.

In our examples so far, each case statement lead to a different code block. Let's test what happens when multiple `case`

labels invoke the same function.

```
int ;
int ;
int ;
int ;
int ;
int
```

```
blocks_2:
mov eax, edi
mov edi, esi
cmp eax, 45
jg .L2
cmp eax, 41
jg .L3
cmp eax, 3
jg .L4
test eax, eax
js .L1
jmp f1
.L2:
cmp eax, 66
jg .L8
cmp eax, 62
jle .L1
jmp f4
.L4:
sub eax, 21
cmp eax, 3
ja .L1
jmp f2
.L8:
sub eax, 84
cmp eax, 3
ja .L1
jmp f5
.L3:
jmp f3
.L1:
xor eax, eax
ret
```

We have 20 case labels, with values between 0 and 87. The compiler could lower this `switch`

to a jump table with 88 entries.
Our hypothesis for the heuristic is that a jump table conversion is profitable when its size is smaller than `8 * num_case_labels`

.
Since `88 < 20 * 8`

, the conversion should be profitable in this case, however `gcc`

opted to lower the switch to a sequence of comparisons.

To better understand why, let's examine a specific block of `case`

labels.

```
int
```

```
blocks_2:
mov eax, edi ; Move `val` to $eax
mov edi, esi
cmp eax, 45
; If val <= 45, do _not_ jump
jg .L2
cmp eax, 41
; If val >= 42, jump to L3
jg .L3
; ...
.L2:
; ...
.L3:
; We reach L3 only if 42 <= val <= 45
jmp f3 ; Call f3(val)
; ...
```

The compiler realizes that the cases are contiguous and can be implemented with two comparisons instead of four.
If the input of the jump table heuristic is the number of *comparisons* instead of *case labels*, things work out:

We have a total of 10 comparisons. Since `10 * 8 < 88`

, our revised heuristic suggests that a jump table transformation is unprofitable.

Let us try a non-contiguous case.

```
int ;
int
```

```
bittest:
lea eax, [rdi-48]
cmp eax, 21
ja .L1
mov edx, 2753205 ; ???
bt rdx, rax
jc .L7
.L1:
xor eax, eax
ret
.L7:
jmp f1
```

The compiler emits non-intuitive machine code. Why is there no `cmp`

nor jump tables? Where does the constant `2753205`

come from?

An hint comes from the instruction `bt`

. According to the Intel docs, `bt`

:

Selects the bit in a bit string (specified with the first operand, called the bit base) at the bit-position designated by the bit offset (specified by the second operand) and stores the value of the bit in the CF flag.

If we examine the binary representation of `2753205`

, we find that it is in fact a bitset. The bits that are set to one correspond to the ASCII values of the case labels, each value being reduced by 48.

We've encountered several distinct lowering strategies employed by gcc. For a more in-depth understanding, we can delve into the compiler's source code.

The compiler pass that responsible for lowering `switch`

statements is the (appropriately named) `pass_lower_switch`

, which in turn calls `analyze_switch_statement`

.

Since `gcc`

source can be daunting to follow, I'll provide a step-by-step commented version below.
Following this, I will break down each section to explain its functionality.

`bool `

Some preamble code. `m_switch`

points to a gimple object. Gimple is the internal representation used by `gcc`

.
Through this, we extract the overall count of cases and the individual counts of cases for each outgoing edge.

We also allocate a vector that will record the lowering strategy chosen for each case label.

```
for
```

More setup code. We notice that gimple has a concept of high and low value for each label. That is, contiguous `case`

labels have already joined together at this stage.

For each case label, we allocate a `simple_cluster`

. This cluster serves as a record, capturing the range of values associated with the label, the relevant output edge, and the likelihood of this cluster being chosen. This probability comes from profile-guided optimization. By default it is uniform among case labels.

```
// ...
vec<cluster *> output = ;
```

We transform the vector of clusters we have created. The name `bit_test_cluster`

hints to the bitset lowering strategy we saw before.

```
/* Find jump table clusters. */
vec<cluster *> output2;
auto_vec<cluster *> tmp;
// ...
for
```

This section allocates a vector for the transformed cases and a scratch vector. Any case that wasn't altered during the bitset transformation process is gathered in this scratch vector. These cases are then processed and cleared whenever we come across an optimized cluster.

Essentially, in this section we are looking for sequences of case labels that couldn't be converted into bitset clusters.

```
vec<cluster *> n = ;
output2.;
n.;
```

We attempt to convert these consecutive labels into a jump table and add the outcome to the output.

```
/* We still can have a temporary vector to test. */
if
```

This is a coda for the loop above.

```
// ...
bool expanded = ;
;
return expanded;
}
```

Whatever is left goes through another pass, which turns the switch statement into a decision tree. The bitsets and jump table are leaves of the decision tree.

Let's dig further into `bit_test_cluster::find_bit_tests`

. The high level structure of the code is as follow.

Initially, we aim to find the optimal method to bundle all clusters into bitsets. Optimality, in this context, refers to minimizing the number of bitsets used. We do so using dynamic programming.

Subsequently, we iterate over the bitsets and apply a profitability heuristic. If the bitsets meet this heuristic, we emit a `bit_test_cluster`

; otherwise, we keep the original `simple_cluster`

s.

`vec<cluster *> `

For the dynamic programming, we'll use a vector of `min_cluster_item`

structures.
Each `min_cluster_item`

has three components:

- The first records how many bitsets do we need to represent the first
*i*case labels, where`i`

is the index in the vector. - The second records the
**start**of the last bitset. - The third is unused.

The dynamic programming is as follows. We want to find the minimum number of bitsets to represent the first *i* case labels.

The base case is straightforward, a single case label can be represented by a single bitset.

Suppose we have already have found the optimal number of bitsets for the first *j* case label, for all
.
We define
to be the optimal number of splits for the first `i`

elements. This means we already computed
, and are now seeking
.

If the range of case labels can be represented as a bitset cluster, it must be true that

Conversely, assume we have the optimal assignment for the first *i* labels. If the last bitset starts at label
, this assignment is also optimal when restricted to
. Were this not the case, we could improve the assignment on
as well.

Therefore, it must be that:

These conditions are exactly what the `if`

statement in the code above checks for.

The `can_be_handled`

function controls whether a range of `case`

labels can be turned into a bitset test. It tests two criteria:

- The
*combined*range of all the case labels must be smaller than the width of a register. On contemporary systems, the combined range should be less than 64. - The case labels must have at most 3
*unique*outgoing edges.

If there is more than one edge, we emit a cascade of bit tests.

```
vec<cluster *> output;
// ...
/* Find and build the clusters. */
for
output.;
return output;
}
```

Now we convert only bitsets that satisfy the `is_beneficial`

test.
In the code `count`

represents the number of case labels handled by the cluster, while `uniq`

denotes the number of distinct outgoing edges.

```
bool
```

The logic for jump table conversion mirrors the one used for bit test conversion. It leverages dynamic programming to identify "clusters" that can be profitably converted to a jump table.

Let's delve into the code.

`vec<cluster *> `

We start by preallocating a vector of `min_cluster_item`

. We'll use this vector
to keep track of the best jump table assignments for consecutive `case`

clusters.

```
min.;
```

We also push the base case for dynamic programming. This corresponds
to using a single jump table for the whole `switch`

.

Let's focus now on the main loop.

```
unsigned HOST_WIDE_INT max_ratio = /* Used in the main heuristic. See next snippet. */;
for
```

The dynamic programming loop mirrors the one we used for bitsets. We want to find the minimum number of jump tables to represent the first i case labels.

We iterate over splits at the `j`

-th case label, for `j < i`

. During each iteration we track the best scoring split.
The special case `j == 0`

represents a single jump table for the entire `switch`

statement.

Inside this inner loop we also keep track of:

- How many distinct jump tables we would need to lower the switch.
- The
**start**of the*last*jump table of the decomposition. - How many cases
*cannot*be handled by a jump table. In the code, this is the accumulator`s`

.

Point 1. and 2. are analogous to the bitset lowering pass, while 3. is new and specific to jump tables.

Namely, there is some overhead to dispatching to a jump table, due to the indirect jump and the need to check bounds.
Therefore `gcc`

requires a minimum amount of `case`

labels to generate a jump table.

If a split would generate a jump table that is too small, the size of the jump table is added to an accumulator `s`

, which is used as a tie-breaker between different split candidates.

To compute the scoring, we need the comparison count between `j`

-th and the `i`

-th case label. We compute it summing up the number of comparisons for all labels, and decreasing the count as we progress in the inner loop.

We also need the number of unhandled `case`

labels. We determine it comparing the size of the jump table to the result of the `case_values_threshold`

function. This function is architecture dependent and takes into account the presence of switch-case specific instructions, but on modern x64 and Arm64 machines, it defaults to 5.

```
unsigned int
```

Let's now focus on the main heuristic, starting with how `max_ratio`

is computed. I've reproduced below the main loop, omitting lines unrelated to the main heuristic.

We've seen in the examples section that gcc uses the ratio between jump table entries and comparisons as a metric for the lowering the heuristic. The code below shows that distinct ratios are used when optimizing for speed (flags `-O[123|fast]`

) or size (flag `-Os`

). Specifically, a ratio of 8 is used when optimizing for speed, while a ratio of 2 is used when optimizing for size.

```
unsigned HOST_WIDE_INT = ; // This is `8`
for
}
```

Now onto the main loop. Given two splits, we prefer the one that results in fewer jump tables being emitted.
Or, if they would result in the same number of jump tables, we prefer the one that handles the most `case`

labels.
We also ignore splits that would result in unprofitable jump tables, where profitability is checked by the `can_be_handled`

function.

The `can_be_handled`

condition, checks for overflow and compares the ratio of case labels to jump table entries:

```
bool
```

The coda of the function also mirrors the one for bitset clusters.
It iterates over all jump tables in reverse order, and prunes the ones that fail the `is_beneficial`

check.

```
vec<cluster *> output;
output.;
/* Find and build the clusters. */
for
output.;
return output;
}
```

The `is_beneficial`

check for jump tables aims to prune small jump tables. It compares the number of case labels to `case_values_threshold`

.

```
bool
```

If `gcc`

cannot emit a bitset cluster or a jump table, it falls back to a binary
tree of comparison.

In emitting the tree, it takes into account the probability of each case label
cluster. The probability can be obtained via profile-guided optimization, or with
the `__builtin_expect_with_probability`

builtin.

The code is pretty straightforward.

`void `

Sum probability of all case labels.

```
/* Count the number of entries on branch. Also count the ranges. */
while
```

Split the case label list such that both halves have equal probability.

` if `

Recurse in both halves to build the binary tree.

```
;
;
np->m_c->m_subtree_prob = np->m_c->m_prob;
if
np->m_c->m_subtree_prob += np->m_left->m_c->m_subtree_prob;
if
np->m_c->m_subtree_prob += np->m_right->m_c->m_subtree_prob;
} else
}
}
```

We've seen the heuristics and lowering strategies `gcc`

employs when compiling switch statements, including:

- Compressing multiple
`case`

labels into a bitset. - Transforming the
`switch`

statement into a jump table. - Transforming the
`switch`

statement into a binary decision tree.

For anyone interested in delving deeper, I'd suggest examining the `gcc`

source code, and reading this paper.

*Finally, I want to thank B.X.V., C.P., and V.P. for their feedback when writing this post*