2023-06-11
This post is a followup to two posts by Wojciech Muła. One on parsing HTTP verbs, and another on using pext
for perfect hashing.
The first post analyzes alternate implementations of a function string_to_verb
that translates and HTTP verb, represented by a std::string_view
, into an enum.
It explores three different strategies:
boost::beast
library.gperf
tool.The SWAR strategy turns out to be the least performant.
The three strategies are tested on two syntethic datasets:
The second post introduces a new strategy. This strategy first branches on the length of the input.
Then it uses the pext
Intel-specific instruction to extract a few discriminating bits from the input.
Finally interprets those bits as a index into a per-length lookup table.
We'll call this new strategy pext_by_len
.
This second implementation allows each codepath to know at compile-time how much data it needs to load from memory. We will see that this is critical for performance.
Wojciech Muła doesn't test this new strategy on the original problem, but it finds it to perform very well on a range of similar tasks.
In this post I will:
pext
as well. This will use a global table instead of per-length ones as in pext_by_len
. We'll see how some characteristics of pext
synergize with the use of SWAR techniques and a global table.memcmp
with a more specialized implementation.All benchmarks were run on my machine, a ThinkPad laptop with an Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
.
Here’s what I did to set up the benchmark program:
isolcpus
kernel boot parameter.taskset
.To ensure accurate results and compensate for cache warming effects, I ran each benchmark approximately 256'000 times. In the tables below, I report both the average timing in nanoseconds and cycles and the median absolute percent error across all runs. Additionally, I report the cache misses rate obtained through hardware perfomance counters, as it will be useful in the analysis.
As a start, I have reproduced the benchmarks on my machine, timing both in nanoseconds and cycles, and added an annotation for cache misses.
Of the strategies in the first post, we see that SWAR implementations lag behind the boost::beast
implementation, while using perfect hashing yields the best performance by far.
The ranking of the results matches the results of the original post.
The backported pext_by_len
strategy shines on the three verbs benchmark, and is still the second best on the full benchmark.
We see that pext_by_len
has a lot more cache misses and takes more time on the full benchmark.
This is the first hint that the input size branch is difficult to predict, and that removing it might yield improvements.
ns/op | err% | cyc/op | miss% | all verbs |
---|---|---|---|---|
14.91 | 1.6% | 29.22 | 51.9% | noop |
63.68 | 0.1% | 126.02 | 21.7% | reference_impl |
72.16 | 0.1% | 142.79 | 10.6% | swar |
68.98 | 0.2% | 136.47 | 10.8% | swar32 |
67.39 | 0.2% | 133.39 | 10.6% | swar32 v2 |
29.22 | 0.4% | 57.66 | 0.6% | perfect_hash |
44.21 | 0.2% | 87.50 | 14.9% | pext_by_len |
ns/op | err% | cyc/op | miss% | GET/PUT/POST |
---|---|---|---|---|
14.77 | 1.2% | 29.10 | 46.5% | noop |
45.76 | 0.7% | 90.62 | 9.8% | reference_impl |
50.24 | 0.1% | 99.48 | 4.3% | swar |
49.45 | 0.1% | 97.92 | 3.6% | swar32 |
47.80 | 0.1% | 94.65 | 4.9% | swar32 v2 |
38.07 | 0.9% | 75.15 | 2.7% | perfect_hash |
31.65 | 0.2% | 62.65 | 7.2% | pext_by_len |
Let's analyze the SWAR 32 strategy. The code is roughly as follows:
verb
The performance table shows a very high level of cache misses. Given that the switch statement has 33 cases, we suspect it might be the culprit.
To confirm this, we can use Linux's perf
utility:
$ env NANOBENCH_ENDLESS=swar32 timeout 10 perf record -e branches,branch-misses taskset -c 1 ./benchmark_nb
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 3.583 MB perf.data (78219 samples) ]
An initial look at perf
's output doesn't show any particular hotspot.
To see why, let us look at how GCC compiles the switch
statement:
Samples lea rax,[rdi-0x3]
cmp rax,0xa
↓ ja 510
push rbp
mov rdx,rdi
mov rbp,rsi
push rbx
1 mov rbx,rdi
sub rsp,0x18 ; reserve space for `buf`
mov rdi,rsp
mov QWORD PTR [rsp],0x0 ; zero out the first 64 bits of `buf`
→ call memcpy@plt ; copy the `v` to `buf`
6 mov eax,DWORD PTR [rsp] ; load the first 4 characters of `buf` into $eax
; This is the body of the switch statement. We compare the first 4 characters of `buf`
; lexicographically, to create a n-ary tree.
; Due to endianess, verbs are ordered by the last to the first letter.
31 cmp eax,0x49424e55 ; buf[:4] == "UNBI"
↓ je 4f0 ; Verb must be UNBIND or not a verb. Check the rest of buf
↓ ja d0
1 cmp eax,0x44414548 ; buf[:4] == "HEAD"
↓ je 4e0
117 ↓ ja 90
97 cmp eax,0x43414b4d ; buf[:4] == "MKAC"
↓ je 488
108 ↓ jbe 150
28 cmp eax,0x43454843 ; buf[:4] == "CHEC"
↓ je 3a0
cmp eax,0x43544150 ; buf[:4] == "PATC"
↓ jne 2a0
9 cmp BYTE PTR [rbp+0x4],0x48
47 mov edx,0x1c
↓ jne 180
↓ jmp 109
nop
; ...
The code listing above shows the output of perf
. On the left, under the Samples
column, we see how many branch misses there were for each instruction.
On the right, is the assembly code of the string_to_verb
function.
We see that the switch statement has been expanded to a series of comparisons.
Interestingly, we see not only jumps on equality (je
), but also for greater (ja
) or smaller-or-equal (jbe
).
This suggests that GCC has turned the switch statement into a decision tree. With some post-processing, we can extract the structure of the decision tree and visualize it:
Since the switch statement has been decoded into multiple comparisons, we'll need to postprocess the data and sum the count of branch misses across multiple instructions.
We also see that not all misses are recorded on a jump instruction. E.g.
;Samples Instructions ; Comment
; ...
97 cmp eax,0x43414b4d ; buf[:4] == "MKAC"
↓ je 488
108 ↓ jbe 150
; ...
This is a phenomenon known as skew. Namely, since there are multiple instructions in flight in an out-of-order processor, the CPU cannot precisely attribute the branch miss to a single instruction and can be off by an instruction or two.
To account for this we can use the following heuristic to attribute branch misses to the switch statement:
cmp
instruction seen in the assembly.cmp
instructions. Instructions of the kind cmp eax,0x[...]
correspond to our switch statement.When we do so, we see that more than 40% of the branch misses come from our switch statement:
kind | miss cnt | miss % |
---|---|---|
cmp eax | 1830 | 47.7 % |
mov | 718 | 18.7 % |
cmp DWORD PTR [rsp+0x4] | 642 | 16.7 % |
cmp DWORD PTR [rbp+0x4] | 193 | 5.0 % |
jmp | 149 | 3.9 % |
cmp BYTE PTR [rbp+0x4] | 128 | 3.3 % |
cmp rbx | 54 | 1.4 % |
lea | 49 | 1.3 % |
add | 23 | 0.6 % |
cmp WORD PTR [rbp+0x8] | 23 | 0.6 % |
xor | 19 | 0.5 % |
cmp rax | 3 | 0.1 % |
cmp BYTE PTR [rbp+0x8] | 3 | 0.1 % |
pext
We want to eliminate the switch statement and replace it with a branchless solution, such as a jump table. However, the compiler didn't automatically do this for us because the values we're comparing against are sparse, which would result in a large jump table. A naive jump table of pointers would be 4 megabytes.
But what if we ignored some of the bits in our comparison value? Most of our verbs consist of uppercase ASCII letters. By examining the binary representation of uppercase ASCII letters, we can see that the upper 3 bits of each letter are always the same.
C | Binary | C | Binary |
---|---|---|---|
A | 0b01000001 | N | 0b01001110 |
B | 0b01000010 | O | 0b01001111 |
C | 0b01000011 | P | 0b01010000 |
D | 0b01000100 | Q | 0b01010001 |
E | 0b01000101 | R | 0b01010010 |
F | 0b01000110 | S | 0b01010011 |
G | 0b01000111 | T | 0b01010100 |
H | 0b01001000 | U | 0b01010101 |
I | 0b01001001 | V | 0b01010110 |
J | 0b01001010 | W | 0b01010111 |
K | 0b01001011 | X | 0b01011000 |
L | 0b01001100 | Y | 0b01011001 |
M | 0b01001101 | Z | 0b01011010 |
Even discarding those 3 bits would reduce our jump table size to 1 kilobyte!
Ideally we would want to discard even more bits, to maximally compress our jump table. To do this, we need:
The first requirement is straightforward. pext
accomplishes this in a single cycle.
The second requirement is a bit more complex. Instead of performing a naive search over all possible 32-bit unsigned integers, we can use a technique called "snoob" that allows us to enumerate only integers with a specific number of set bits.
For example, a brute force search for a bitmask that extracts 7 bits needs up to four billion iterations. One that uses snoob
needs only three million.
However, even with this optimization, we still face a challenge with the verbs PROPFIND and PROPPATCH, as they share a common 4-character prefix. While we could add a branch to handle this particular case, our preference is to maintain a completely branchless solution.
To distinguish them we exploit the fact that PROPFIND
and PROPPATCH
have a different length.
A quick python oneliner confirms that the first four characters of our verb together with its length uniquely identify the verb.
Python 3.11.3 (main, May 24 2023, 00:00:00) [GCC 12.3.1 20230508 (Red Hat 12.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> verbs = [
"ACL", "BIND", "CHECKOUT", "CONNECT", "COPY", "DELETE",
"GET", "HEAD", "LINK", "LOCK", "M-SEARCH", "MERGE",
"MKACTIVITY", "MKCALENDAR", "MKCOL", "MOVE", "NOTIFY", "OPTIONS",
"PATCH", "POST", "PROPFIND", "PROPPATCH", "PURGE", "PUT",
"REBIND", "REPORT", "SEARCH", "SUBSCRIBE", "TRACE", "UNBIND",
"UNLINK", "UNLOCK", "UNSUBSCRIBE",
]
>>> prefix_and_len = {(len(v), v[:4]) for v in verbs}
>>> len(verbs) == len(prefix_and_len)
True
The length has another nice property. Our function takes an std::string_view
as a parameter.
A string_view
is implemented as a struct of pointer to data and length.
Due to the x64 calling convention, the length will already be passed in a register (rdi
in this case). We can confirm this by examining the assembly code once more:
; Here we implement the check
; if (v.size() < 3 or v.size() > 13)
; return verb::unknown;
; $rdi is the length of `v`
lea rax,[rdi-0x3] ; $rax = v.size() - 3ul
; Notice that the previous subtraction *underflows* if `v.size() < 3`
cmp rax,0xa ; check if `$rax > 10`. If `v.size() < 3`, `$rax` is a very big number and this check still succeeds
ja 510
[...]
We can simply "mix in" the string length by xor-ing with the output of pext
:
uint32_t
This runs in ~50ms and quickly finds an appropriate bitmask:
$ ./find_mask
Found mask: 33687137
The implementation is very straightforward. For simplicity, I generate the lookup table at startup. An alternative would be to use code generation.
static const uint32_t MASK = ;
uint32_t ;
using lut_elem_t = uint8_t; //verb;
static const std::array<lut_elem_t, > LOOKUP_TABLE = ;
verb
With this change, we see greatly improved results. All branch misses are gone.
On the full benchmark, this implementation (dubbed pext
) is the second fastest, beaten only the hash table generated by gperf
.
ns/op | err% | cyc/op | miss% | all verbs |
---|---|---|---|---|
63.68 | 0.1% | 126.02 | 21.7% | reference_impl |
68.98 | 0.2% | 136.47 | 10.8% | swar32 |
29.22 | 0.4% | 57.66 | 0.6% | perfect_hash |
44.21 | 0.2% | 87.50 | 14.9% | pext_by_len |
39.74 | 0.2% | 78.70 | 0.4% | pext |
On the common verbs benchmark results are less good, but still improved.
ns/op | err% | cyc/op | miss% | GET/PUT/POST |
---|---|---|---|---|
45.76 | 0.7% | 90.62 | 9.8% | reference_impl |
49.45 | 0.1% | 97.92 | 3.6% | swar32 |
38.07 | 0.9% | 75.15 | 2.7% | perfect_hash |
31.65 | 0.2% | 62.65 | 7.2% | pext_by_len |
43.15 | 0.2% | 85.44 | 1.8% | pext |
Here our solution is beaten both by gperf
and by the per-length pext
solution.
Despite the improvements in the previous section, our solution is still dominated by gperf
. Can we do better?
Let us profile again with perf
, this time focusing on cycles spent on each instruction.
; ...
0.05 │ mov eax,0x4
1.51 │ cmp rdi,rax
0.01 │ cmovbe rax,rdi ; Compute min(sv.size(), 4);
0.04 │ mov rbp,rsi
1.81 │ mov DWORD PTR [rsp+0xc],0x0 ; Zero out buf
0.01 │ mov rbx,rdi
0.01 │ mov esi,eax ; Put min(sv.size(), 4) into $esi
0.03 │ test eax,eax
│ ↓ je 45
2.63 │ xor eax,eax
; GCC has inlined std::memcpy. This is a for loop that copies the bytes
; from the input string one by one into `buf`
; $eax is the counter, $esi is min(sv.size(), 4), $rbp is a pointer to the input string
; and $rsp+0xc is a pointer to buf
2.35 │34: mov edx,eax
2.80 │ movzx ecx,BYTE PTR [rbp+rdx*1+0x0]
2.72 │ inc eax
2.62 │ mov BYTE PTR [rsp+rdx*1+0xc],cl
2.42 │ cmp eax,esi
0.05 │ ↑ jb 34
; After we have finished copying the input string into buf, copy buf back into a register
38.93 │45: mov eax,DWORD PTR [rsp+0xc]
0.08 │ mov edx,0x2020661
8.38 │ pext eax,eax,edx
2.96 │ xor eax,ebx
14.18 │ movzx r12d,BYTE PTR [rax+0x424620]
; ...
We see that gcc
is loading data from memory in a non-optimal way. It has inlined memcpy
into a for loop.
This is likely because we are passing a variable length to memcpy.
We can get gcc
to generate better assembly by giving it an hint.
We can use a union to load the first three characters, and load the fourth character conditionally on input length.
This is undefined behaviour in C++, but is explicitely supported by gcc
.
uint32_t
With this change, our implementation is competitive with gperf
on both benchmarks.
ns/op | err% | cyc/op | miss% | all verbs |
---|---|---|---|---|
29.22 | 0.4% | 57.66 | 0.6% | perfect_hash |
44.21 | 0.2% | 87.50 | 14.9% | pext_by_len |
39.74 | 0.2% | 78.70 | 0.4% | pext |
29.75 | 1.4% | 58.71 | 0.5% | pext v2 |
It is still behind the per-length pext
solution on the common verbs benchmark though:
ns/op | err% | cyc/op | miss% | GET/PUT/POST |
---|---|---|---|---|
38.07 | 0.9% | 75.15 | 2.7% | perfect_hash |
31.65 | 0.2% | 62.65 | 7.2% | pext_by_len |
43.15 | 0.2% | 85.44 | 1.8% | pext |
38.67 | 0.4% | 76.47 | 2.5% | pext v2 |
A look at the assembly shows that we are still using 3 different movs to load the beginning of our string into memory.
; Samples
8.11 │ movzx edx,BYTE PTR [rsi+0x2] ; Load the first two bytes
0.38 │ movzx eax,WORD PTR [rsi] ; Load the third byte
1.02 │ shl edx,0x10
1.28 │ or eax,edx ; Combine the first three bytes
5.70 │ mov rbx,rdi
│ mov rbp,rsi
│ xor edx,edx
2.11 │ cmp rdi,0x3 ; Check if v.size() is bigger than 3
0.10 │ ↓ je 2c ; If v.size() is bigger than 3, load the fourth byte. Otherwise $edx will be zero
6.14 │ movzx edx,BYTE PTR [rsi+0x3]
6.49 │2c: shl edx,0x18
1.05 │ or eax,edx ; Combine with the fourth byte
Ideally, we would like to load the four bytes in a single mov $edx, DWORD PTR [rsi]
, but we can't.
The issue is the following: a string_view
is not guaranteed to be null-terminated. Therefore, for inputs of length 3 a DWORD read might touch uninitialized memory.
That said, we might be able to do better by borrowing some tricks from glibc
.
We've seen that the memcmp
implementation reads beyond the end of the array when the input pointers are far from a page boundary.
Since we are loading only 4 bytes from one pointer, the page boundary check simplifies to:
But we can simplify even further. Notice that, by the point we are loading our input string into a register, we have already checked that the length of the string is bigger than three. Therefore the only case where we could cause a pagefault is when the input pointer is exactly 3 bytes before a page boundary.
This further refines the page boundary check to:
At a first glance, inputs of length three still need extra handling.
We are loading a byte beyond the end of the string that we need to zero.
We could do so branchlessly using the bzhi
instruction:
Initially, it would appear that inputs of length three require additional handling.
For inputs of length three we load a byte beyond the end of the string, which needs to be zeroed out.
One way to do so is the bzhi
instruction:
uint32_t data = ;
However, an even better approach is to do nothing. In other words, not masking the extra byte.
To understand why this works, observe that pext
discards input bits, but does not mix them.
As a result, we can distinguish the output bits of pext
into two groups:
We can also split the inputs into two kinds:
Bits in the prefix are sufficient to distinguish strings of lenghth three by construction. However the fourth byte only affects bits in the suffix. Therefore it is enough to ensure that bits extracted from the prefix distinguish strings of length three from keys of length four or more.
That would guarantee that altering the bits from the suffix cannot cause a collision. We can simply add multiple entries in our lookup table for strings of length three, one for each combination of bits in the suffix.
This is true for our mask, and can be enforced by exhaustive checking in our find_mask
function.
With this improvement, we beat gperf
on the full benchmark!
ns/op | err% | cyc/op | miss% | all verbs |
---|---|---|---|---|
29.22 | 0.4% | 57.66 | 0.6% | perfect_hash |
44.21 | 0.2% | 87.50 | 14.9% | pext_by_len |
39.74 | 0.2% | 78.70 | 0.4% | pext |
29.75 | 1.4% | 58.71 | 0.5% | pext v2 |
26.25 | 0.7% | 51.71 | 0.0% | pext v3 |
We also beat pext_by_len
on the restricted benchmark.
ns/op | err% | cyc/op | miss% | GET/PUT/POST |
---|---|---|---|---|
38.07 | 0.9% | 75.15 | 2.7% | perfect_hash |
31.65 | 0.2% | 62.65 | 7.2% | pext_by_len |
43.15 | 0.2% | 85.44 | 1.8% | pext |
38.67 | 0.4% | 76.47 | 2.5% | pext v2 |
26.06 | 0.2% | 51.51 | 0.0% | pext v3 |
Can we do even better?
Running the latest version of our benchmark under perf shows that the bottleneck is now memcmp
:
)
That is, the bottleneck is not hashing our input string view, but checking that our hash has not given us a false positive:
verb
We notice several inefficiencies in our false positive check:
Perf shows that memcmp
is not getting inlined. Our page alignment check logic is inspired by memcmp
's implementation, so it is likely that inlining memcmp
the compiler would be able to elide redundant computations.
memcmp
has to branch on the size of the input arrays. But we know that our input string size is between 3 and 13 characters. Therefore we can use 16 bytes loads and omit all branching.
memcmp
needs to check that it is safe to read beyond the end of the array for both inputs. But we control one of the input arrays, and we can just pad the input with nulls.
memcmp
has to find the first different byte, we don't. So we can use vptest
instead of vmovmskb
and save one instruction.
Our lookup table implementation has an inefficient layout. We are using a lookup table of string_view
s, but a string_view
is itself a pair of (pointer, length)
. This leads to a double indirection on lookup.
A better implementation would keep the length and the data side by side, in the same cacheline.
After implementing the changes above, we see a substantial speedup on both benchmarks:
ns/op | err% | cyc/op | miss% | all verbs |
---|---|---|---|---|
29.22 | 0.4% | 57.66 | 0.6% | perfect_hash |
44.21 | 0.2% | 87.50 | 14.9% | pext_by_len |
39.74 | 0.2% | 78.70 | 0.4% | pext |
29.75 | 1.4% | 58.71 | 0.5% | pext v2 |
26.25 | 0.7% | 51.71 | 0.0% | pext v3 |
22.04 | 1.1% | 43.40 | 0.0% | pext v4 |
ns/op | err% | cyc/op | miss% | GET/PUT/POST |
---|---|---|---|---|
38.07 | 0.9% | 75.15 | 2.7% | perfect_hash |
31.65 | 0.2% | 62.65 | 7.2% | pext_by_len |
43.15 | 0.2% | 85.44 | 1.8% | pext |
38.67 | 0.4% | 76.47 | 2.5% | pext v2 |
26.06 | 0.2% | 51.51 | 0.0% | pext v3 |
21.98 | 1.0% | 43.28 | 0.0% | pext v4 |
When we began optimizing, we first examined the full benchmark and concentrated our efforts on reducing branch misses. This ultimately improved performance on the full benchmark.
However, when it came to the restricted benchmark, branch misses were not as significant of an issue. We'd expect the restricted benchmark to be more friendly to a branchy implementation, as fewer cases means easier to predict branches.
Then why did our branchless implementation improve the restricted benchmark, even surpassing pext_by_len
?
One way to interpret the results is the following. When using a well-performing perfect hash,
hashing is no longer the limiting factor. Rather memory loads and memcmp
become bottlenecks.
To address this, the pext_by_len
implementation uses branching based on input size, and optimizes memcpy
based on size.
Our implementation takes a different approach. We specializes memcpy
to always compare 16 bytes, and branch
to a slow path only when the input is aligned to page size.
But a branch on pointer alignment is more predictable than one on input size. Assuming the input pointer is a random number, the slow path will be hit approximately 0.37% of the time:
Therefore, the trade-off between the two options is a net win.
I want to thank A.Y., N.S., V.P. and in particular B.X.V. for their help in writing and editing this post. I also want to thank Wojciech Muła for his analysis of the problem.
All my code is available here.