Optimizing the `pext` perfect hash function


2023-06-11

This post is a followup to two posts by Wojciech Muła. One on parsing HTTP verbs, and another on using pext for perfect hashing.

The first post analyzes alternate implementations of a function string_to_verb that translates and HTTP verb, represented by a std::string_view, into an enum. It explores three different strategies:

The SWAR strategy turns out to be the least performant.

The three strategies are tested on two syntethic datasets:

The second post introduces a new strategy. This strategy first branches on the length of the input. Then it uses the pext Intel-specific instruction to extract a few discriminating bits from the input. Finally interprets those bits as a index into a per-length lookup table. We'll call this new strategy pext_by_len.

This second implementation allows each codepath to know at compile-time how much data it needs to load from memory. We will see that this is critical for performance.

Wojciech Muła doesn't test this new strategy on the original problem, but it finds it to perform very well on a range of similar tasks.

In this post I will:

Benchmarking setup

All benchmarks were run on my machine, a ThinkPad laptop with an Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz.

Here’s what I did to set up the benchmark program:

To ensure accurate results and compensate for cache warming effects, I ran each benchmark approximately 256'000 times. In the tables below, I report both the average timing in nanoseconds and cycles and the median absolute percent error across all runs. Additionally, I report the cache misses rate obtained through hardware perfomance counters, as it will be useful in the analysis.

Reproducing previous results

As a start, I have reproduced the benchmarks on my machine, timing both in nanoseconds and cycles, and added an annotation for cache misses.

Of the strategies in the first post, we see that SWAR implementations lag behind the boost::beast implementation, while using perfect hashing yields the best performance by far. The ranking of the results matches the results of the original post.

The backported pext_by_len strategy shines on the three verbs benchmark, and is still the second best on the full benchmark.

We see that pext_by_len has a lot more cache misses and takes more time on the full benchmark. This is the first hint that the input size branch is difficult to predict, and that removing it might yield improvements.

ns/operr%cyc/opmiss%all verbs
14.911.6%29.2251.9%noop
63.680.1%126.0221.7%reference_impl
72.160.1%142.7910.6%swar
68.980.2%136.4710.8%swar32
67.390.2%133.3910.6%swar32 v2
29.220.4%57.660.6%perfect_hash
44.210.2%87.5014.9%pext_by_len
ns/operr%cyc/opmiss%GET/PUT/POST
14.771.2%29.1046.5%noop
45.760.7%90.629.8%reference_impl
50.240.1%99.484.3%swar
49.450.1%97.923.6%swar32
47.800.1%94.654.9%swar32 v2
38.070.9%75.152.7%perfect_hash
31.650.2%62.657.2%pext_by_len

Profiling the SWAR implementation

Let's analyze the SWAR 32 strategy. The code is roughly as follows:

verb string_to_verb(std::string_view v)
{
    if (v.size() < 3 or v.size() > 13)
        return verb::unknown;

    // Type punning with `memcpy`
    union {
    	char buf[13];
    	uint32_t value;
    }
    value = 0;
    memcpy(buf, v.data(), v.size());

    switch (value) {
	// ...
        case STRING_CONST('M', 'K', 'A', 'C'):      // The STRING_CONST macro turns "MKAC" into a 32 bit integer.
            if (v.substr(4) == /*MKAC*/"TIVITY"_sv)
                return verb::mkactivity;
            break;
	// ...
    }

    return verb::unknown;
}

The performance table shows a very high level of cache misses. Given that the switch statement has 33 cases, we suspect it might be the culprit.

To confirm this, we can use Linux's perf utility:

$ env NANOBENCH_ENDLESS=swar32 timeout 10 perf record -e branches,branch-misses taskset -c 1 ./benchmark_nb
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 3.583 MB perf.data (78219 samples) ]

An initial look at perf's output doesn't show any particular hotspot. To see why, let us look at how GCC compiles the switch statement:

Samples        lea  rax,[rdi-0x3]
               cmp  rax,0xaja   510
               push rbp
               mov  rdx,rdi
               mov  rbp,rsi
               push rbx
     1         mov  rbx,rdi
               sub  rsp,0x18                         ; reserve space for `buf`
               mov  rdi,rsp
               mov  QWORD PTR [rsp],0x0              ; zero out the first 64 bits of `buf`call memcpy@plt                       ; copy the `v` to `buf`
     6         mov  eax,DWORD PTR [rsp]              ; load the first 4 characters of `buf` into $eax
               ; This is the body of the switch statement. We compare the first 4 characters of `buf`
               ; lexicographically, to create a n-ary tree.
               ; Due to endianess, verbs are ordered by the last to the first letter.
    31         cmp  eax,0x49424e55                   ; buf[:4] == "UNBI"je   4f0                              ; Verb must be UNBIND or not a verb. Check the rest of bufja   d0
     1         cmp  eax,0x44414548                   ; buf[:4] == "HEAD"je   4e0
   117ja   90
    97         cmp  eax,0x43414b4d                   ; buf[:4] == "MKAC"je   488
   108jbe  150
    28         cmp  eax,0x43454843                   ; buf[:4] == "CHEC"je   3a0
               cmp  eax,0x43544150                   ; buf[:4] == "PATC"jne  2a0
     9         cmp  BYTE PTR [rbp+0x4],0x48
    47         mov  edx,0x1cjne  180jmp  109
               nop
               ; ...

The code listing above shows the output of perf. On the left, under the Samples column, we see how many branch misses there were for each instruction. On the right, is the assembly code of the string_to_verb function.

We see that the switch statement has been expanded to a series of comparisons. Interestingly, we see not only jumps on equality (je), but also for greater (ja) or smaller-or-equal (jbe). This suggests that GCC has turned the switch statement into a decision tree. With some post-processing, we can extract the structure of the decision tree and visualize it:

Branch tree Bx0 UNBI HEAD MKAC CHEC PATC Bx90 MOVE PURG REBI Bx0:f1->Bx90:w ja Bxd0 UNLO SUBS UNSU COPY Bx0:f0->Bxd0:w ja Bx150 PUT MKCA Bx0:f2->Bx150:w jbe Bx2a0 TRAC Bx0:f4->Bx2a0:w jne Bx270 DELE M-SE Bx90:f0->Bx270:w jbe Bx2d5 MERG Bx90:f2->Bx2d5:w jne Bx118 LOCK CONN MKCO Bxd0:f0->Bx118:w jbe Bx200 PROP SEAR Bxd0:f1->Bx200:w jbe Bx2f8 POST Bxd0:f3->Bx2f8:w jne Bx190 NOTI OPTI Bx118:f0->Bx190:w jbe Bx230 LINK Bx118:f2->Bx230:w jne Bx1c0 ACL GET Bx150:f0->Bx1c0:w jbe Bx1e5 UNLI Bx190:f1->Bx1e5:w jne Bx245 REPO Bx200:f1->Bx245:w jne Bx2c0 BIND Bx270:f1->Bx2c0:w jne

Since the switch statement has been decoded into multiple comparisons, we'll need to postprocess the data and sum the count of branch misses across multiple instructions.

We also see that not all misses are recorded on a jump instruction. E.g.

;Samples       Instructions          ; Comment
; ...
    97         cmp  eax,0x43414b4d   ; buf[:4] == "MKAC"je   488
   108jbe  150
; ...

This is a phenomenon known as skew. Namely, since there are multiple instructions in flight in an out-of-order processor, the CPU cannot precisely attribute the branch miss to a single instruction and can be off by an instruction or two.

To account for this we can use the following heuristic to attribute branch misses to the switch statement:

When we do so, we see that more than 40% of the branch misses come from our switch statement:

kindmiss cntmiss %
cmp eax183047.7 %
mov71818.7 %
cmp DWORD PTR [rsp+0x4]64216.7 %
cmp DWORD PTR [rbp+0x4]1935.0 %
jmp1493.9 %
cmp BYTE PTR [rbp+0x4]1283.3 %
cmp rbx541.4 %
lea491.3 %
add230.6 %
cmp WORD PTR [rbp+0x8]230.6 %
xor190.5 %
cmp rax30.1 %
cmp BYTE PTR [rbp+0x8]30.1 %

Branchless SWAR using pext

We want to eliminate the switch statement and replace it with a branchless solution, such as a jump table. However, the compiler didn't automatically do this for us because the values we're comparing against are sparse, which would result in a large jump table. A naive jump table of pointers would be 4 megabytes.

But what if we ignored some of the bits in our comparison value? Most of our verbs consist of uppercase ASCII letters. By examining the binary representation of uppercase ASCII letters, we can see that the upper 3 bits of each letter are always the same.

CBinaryCBinary
A0b01000001N0b01001110
B0b01000010O0b01001111
C0b01000011P0b01010000
D0b01000100Q0b01010001
E0b01000101R0b01010010
F0b01000110S0b01010011
G0b01000111T0b01010100
H0b01001000U0b01010101
I0b01001001V0b01010110
J0b01001010W0b01010111
K0b01001011X0b01011000
L0b01001100Y0b01011001
M0b01001101Z0b01011010

Even discarding those 3 bits would reduce our jump table size to 1 kilobyte!

Ideally we would want to discard even more bits, to maximally compress our jump table. To do this, we need:

  1. A fast way to extract the bits.
  2. A fast to search for a subset of bits that uniquely identify a verb.

The first requirement is straightforward. pext accomplishes this in a single cycle.

The second requirement is a bit more complex. Instead of performing a naive search over all possible 32-bit unsigned integers, we can use a technique called "snoob" that allows us to enumerate only integers with a specific number of set bits. For example, a brute force search for a bitmask that extracts 7 bits needs up to four billion iterations. One that uses snoob needs only three million.

However, even with this optimization, we still face a challenge with the verbs PROPFIND and PROPPATCH, as they share a common 4-character prefix. While we could add a branch to handle this particular case, our preference is to maintain a completely branchless solution. To distinguish them we exploit the fact that PROPFIND and PROPPATCH have a different length.

A quick python oneliner confirms that the first four characters of our verb together with its length uniquely identify the verb.

Python 3.11.3 (main, May 24 2023, 00:00:00) [GCC 12.3.1 20230508 (Red Hat 12.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> verbs = [
    "ACL", "BIND", "CHECKOUT", "CONNECT", "COPY", "DELETE",
    "GET", "HEAD", "LINK", "LOCK", "M-SEARCH", "MERGE",
    "MKACTIVITY", "MKCALENDAR", "MKCOL", "MOVE", "NOTIFY", "OPTIONS",
    "PATCH", "POST", "PROPFIND", "PROPPATCH", "PURGE", "PUT",
    "REBIND", "REPORT", "SEARCH", "SUBSCRIBE", "TRACE", "UNBIND",
    "UNLINK", "UNLOCK", "UNSUBSCRIBE",
]

>>> prefix_and_len = {(len(v), v[:4]) for v in verbs}
>>> len(verbs) == len(prefix_and_len)
True

The length has another nice property. Our function takes an std::string_view as a parameter. A string_view is implemented as a struct of pointer to data and length. Due to the x64 calling convention, the length will already be passed in a register (rdi in this case). We can confirm this by examining the assembly code once more:

; Here we implement the check
; if (v.size() < 3 or v.size() > 13)
;       return verb::unknown;
; $rdi is the length of `v`
lea  rax,[rdi-0x3] 			; $rax = v.size() - 3ul
; Notice that the previous subtraction *underflows* if `v.size() < 3`
cmp  rax,0xa				; check if `$rax > 10`. If `v.size() < 3`, `$rax` is a very big number and this check still succeeds
ja   510
[...]

We can simply "mix in" the string length by xor-ing with the output of pext:

uint32_t find_mask() {
    uint32_t best_mask = 0u;
    for (uint32_t mask = 0b1'111'111u; snoob(mask) > mask; mask = snoob(mask)) {
        // Our lookup function. Reads up to the first 4 bytes, does `pext` with mask and then XORs with the size
        auto lam = [=](string_view sv) { 
            uint32_t val = 0u;
            memcpy(&val, sv.data(), min(sv.size(), sizeof(val)));
            return _pext_u32(val, mask) ^ sv.size();
        };

        // VERBS_MAP is an hash map of verb to number
        // `is_uniq` checks that all keys in VERBS_MAP hash to different indices
        if (is_uniq<128ul>(VERBS_MAP, lam)) {
            best_mask = mask;
            break;
        }
    }

    return best_mask; // Will be zero if no good mask is found
}

This runs in ~50ms and quickly finds an appropriate bitmask:

$ ./find_mask 
Found mask: 33687137

The implementation is very straightforward. For simplicity, I generate the lookup table at startup. An alternative would be to use code generation.

static const uint32_t MASK = 33687137u;
uint32_t lut_idx (std::string_view sv) { 
    uint32_t val = 0u;
    memcpy(&val, sv.data(), std::min(sv.size(), sizeof(val)));
    return _pext_u32(val, MASK) ^ sv.size();
};


using lut_elem_t = uint8_t; //verb;
static const std::array<lut_elem_t, 128> LOOKUP_TABLE = []() {
    using namespace boost::beast::http;
    std::array<lut_elem_t, 128> lookup_table {(lut_elem_t) verb::unknown};

    for (unsigned int as_int = 0; as_int < verb_count; ++as_int) {
        std::string_view sv = as_string((verb) as_int);
        lookup_table[lut_idx(sv)] = (lut_elem_t) as_int;
    }

    return lookup_table;
}();

verb
string_to_verb(std::string_view v)
{
    if (v.size() < 3 or v.size() > 13) {
        return verb::unknown;
    }

    verb res = (verb) LOOKUP_TABLE[lut_idx(v)];
    if (v == as_string(res)) {
        return res;
    }
    return verb::unknown;
}

With this change, we see greatly improved results. All branch misses are gone. On the full benchmark, this implementation (dubbed pext) is the second fastest, beaten only the hash table generated by gperf.

ns/operr%cyc/opmiss%all verbs
63.680.1%126.0221.7%reference_impl
68.980.2%136.4710.8%swar32
29.220.4%57.660.6%perfect_hash
44.210.2%87.5014.9%pext_by_len
39.740.2%78.700.4%pext

On the common verbs benchmark results are less good, but still improved.

ns/operr%cyc/opmiss%GET/PUT/POST
45.760.7%90.629.8%reference_impl
49.450.1%97.923.6%swar32
38.070.9%75.152.7%perfect_hash
31.650.2%62.657.2%pext_by_len
43.150.2%85.441.8%pext

Here our solution is beaten both by gperf and by the per-length pext solution.

Memcpy woes

Despite the improvements in the previous section, our solution is still dominated by gperf. Can we do better?

Let us profile again with perf, this time focusing on cycles spent on each instruction.

; ...
  0.05mov    eax,0x4
  1.51cmp    rdi,rax
  0.01cmovbe rax,rdi                      ; Compute min(sv.size(), 4);
  0.04mov    rbp,rsi
  1.81mov    DWORD PTR [rsp+0xc],0x0      ; Zero out buf
  0.01mov    rbx,rdi
  0.01mov    esi,eax			  ; Put min(sv.size(), 4) into $esi
  0.03test   eax,eax
       │    ↓ je     45
  2.63xor    eax,eax
  	      ; GCC has inlined std::memcpy. This is a for loop that copies the bytes
  	      ; from the input string one by one into `buf`
  	      ; $eax is the counter, $esi is min(sv.size(), 4), $rbp is a pointer to the input string
  	      ; and $rsp+0xc is a pointer to buf
  2.3534:   mov    edx,eax
  2.80movzx  ecx,BYTE PTR [rbp+rdx*1+0x0]
  2.72inc    eax
  2.62mov    BYTE PTR [rsp+rdx*1+0xc],cl
  2.42cmp    eax,esi
  0.05 │    ↑ jb     34
  	      ; After we have finished copying the input string into buf, copy buf back into a register
 38.9345:   mov    eax,DWORD PTR [rsp+0xc]
  0.08mov    edx,0x2020661
  8.38pext   eax,eax,edx
  2.96xor    eax,ebx
 14.18movzx  r12d,BYTE PTR [rax+0x424620]
; ...

We see that gcc is loading data from memory in a non-optimal way. It has inlined memcpy into a for loop. This is likely because we are passing a variable length to memcpy.

We can get gcc to generate better assembly by giving it an hint. We can use a union to load the first three characters, and load the fourth character conditionally on input length. This is undefined behaviour in C++, but is explicitely supported by gcc.

uint32_t load_sv(std::string_view sv) {
    union {
        char as_char[4];
        uint32_t res;
    };
    as_char[0] = sv[0];
    as_char[1] = sv[1];
    as_char[2] = sv[2];
    as_char[3] = (sv.size() > 3ul) ? sv[3] : '\0';
    return res;
}

With this change, our implementation is competitive with gperf on both benchmarks.

ns/operr%cyc/opmiss%all verbs
29.220.4%57.660.6%perfect_hash
44.210.2%87.5014.9%pext_by_len
39.740.2%78.700.4%pext
29.751.4%58.710.5%pext v2

It is still behind the per-length pext solution on the common verbs benchmark though:

ns/operr%cyc/opmiss%GET/PUT/POST
38.070.9%75.152.7%perfect_hash
31.650.2%62.657.2%pext_by_len
43.150.2%85.441.8%pext
38.670.4%76.472.5%pext v2

Unsafe memcpy optimizations

A look at the assembly shows that we are still using 3 different movs to load the beginning of our string into memory.

; Samples
  8.11movzx edx,BYTE PTR [rsi+0x2]  ; Load the first two bytes
  0.38movzx eax,WORD PTR [rsi]      ; Load the third byte
  1.02shl   edx,0x10
  1.28or    eax,edx                 ; Combine the first three bytes
  5.70mov   rbx,rdimov   rbp,rsixor   edx,edx
  2.11cmp   rdi,0x3                 ; Check if v.size() is bigger than 3
  0.10 │    ↓ je    2c                      ; If v.size() is bigger than 3, load the fourth byte. Otherwise $edx will be zero
  6.14movzx edx,BYTE PTR [rsi+0x3]
  6.49 │2c:   shl   edx,0x18
  1.05or    eax,edx                 ; Combine with the fourth byte

Ideally, we would like to load the four bytes in a single mov $edx, DWORD PTR [rsi], but we can't. The issue is the following: a string_view is not guaranteed to be null-terminated. Therefore, for inputs of length 3 a DWORD read might touch uninitialized memory.

That said, we might be able to do better by borrowing some tricks from glibc.

We've seen that the memcmp implementation reads beyond the end of the array when the input pointers are far from a page boundary. Since we are loading only 4 bytes from one pointer, the page boundary check simplifies to:

But we can simplify even further. Notice that, by the point we are loading our input string into a register, we have already checked that the length of the string is bigger than three. Therefore the only case where we could cause a pagefault is when the input pointer is exactly 3 bytes before a page boundary.

This further refines the page boundary check to:

At a first glance, inputs of length three still need extra handling. We are loading a byte beyond the end of the string that we need to zero. We could do so branchlessly using the bzhi instruction:

Initially, it would appear that inputs of length three require additional handling. For inputs of length three we load a byte beyond the end of the string, which needs to be zeroed out. One way to do so is the bzhi instruction:

uint32_t data = _bzhi_u32(load(v.data()), 8ul * v.size());

However, an even better approach is to do nothing. In other words, not masking the extra byte.

To understand why this works, observe that pext discards input bits, but does not mix them. As a result, we can distinguish the output bits of pext into two groups:

We can also split the inputs into two kinds:

Bits in the prefix are sufficient to distinguish strings of lenghth three by construction. However the fourth byte only affects bits in the suffix. Therefore it is enough to ensure that bits extracted from the prefix distinguish strings of length three from keys of length four or more.

That would guarantee that altering the bits from the suffix cannot cause a collision. We can simply add multiple entries in our lookup table for strings of length three, one for each combination of bits in the suffix.

This is true for our mask, and can be enforced by exhaustive checking in our find_mask function.

With this improvement, we beat gperf on the full benchmark!

ns/operr%cyc/opmiss%all verbs
29.220.4%57.660.6%perfect_hash
44.210.2%87.5014.9%pext_by_len
39.740.2%78.700.4%pext
29.751.4%58.710.5%pext v2
26.250.7%51.710.0%pext v3

We also beat pext_by_len on the restricted benchmark.

ns/operr%cyc/opmiss%GET/PUT/POST
38.070.9%75.152.7%perfect_hash
31.650.2%62.657.2%pext_by_len
43.150.2%85.441.8%pext
38.670.4%76.472.5%pext v2
26.060.2%51.510.0%pext v3

Optimizing the final check

Can we do even better? Running the latest version of our benchmark under perf shows that the bottleneck is now memcmp:

$ perf report -M intel --input=pext-v3-cycles-gcc.data
Samples: 118K of event 'cycles:pppu', Event count (approx.): 51634360904
Overhead  Command       Shared Object         Symbol
  29.87%  benchmark_nb  benchmark_nb          [.] ankerl::nanobench::Bench::run<run_benchmark<boost::beast::http:
  29.50%  benchmark_nb  libc.so.6             [.] __memcmp_avx2_movbe
  23.83%  benchmark_nb  benchmark_nb          [.] pext::string_to_verb_v3
  12.06%  benchmark_nb  benchmark_nb          [.] boost::beast::http::as_string
   2.02%  benchmark_nb  benchmark_nb          [.] memcmp@plt

That is, the bottleneck is not hashing our input string view, but checking that our hash has not given us a false positive:

    verb
    string_to_verb_v3(std::string_view v)
    {
        if (v.size() < 3 or v.size() > 13) {
            return verb::unknown;
        }

        verb res = (verb) LOOKUP_TABLE[lut_idx_v3(v)];

	// string_view::operator!= calls `memcmp` under the hood
        if (v != as_string(res)) { // BOTTLENECK!
            return verb::unknown;
        } else {
            return res;
        }
    }

We notice several inefficiencies in our false positive check:

  1. Perf shows that memcmp is not getting inlined. Our page alignment check logic is inspired by memcmp's implementation, so it is likely that inlining memcmp the compiler would be able to elide redundant computations.

  2. memcmp has to branch on the size of the input arrays. But we know that our input string size is between 3 and 13 characters. Therefore we can use 16 bytes loads and omit all branching.

  1. memcmp needs to check that it is safe to read beyond the end of the array for both inputs. But we control one of the input arrays, and we can just pad the input with nulls.

  2. memcmp has to find the first different byte, we don't. So we can use vptest instead of vmovmskb and save one instruction.

  3. Our lookup table implementation has an inefficient layout. We are using a lookup table of string_views, but a string_view is itself a pair of (pointer, length). This leads to a double indirection on lookup. A better implementation would keep the length and the data side by side, in the same cacheline.

After implementing the changes above, we see a substantial speedup on both benchmarks:

ns/operr%cyc/opmiss%all verbs
29.220.4%57.660.6%perfect_hash
44.210.2%87.5014.9%pext_by_len
39.740.2%78.700.4%pext
29.751.4%58.710.5%pext v2
26.250.7%51.710.0%pext v3
22.041.1%43.400.0%pext v4
ns/operr%cyc/opmiss%GET/PUT/POST
38.070.9%75.152.7%perfect_hash
31.650.2%62.657.2%pext_by_len
43.150.2%85.441.8%pext
38.670.4%76.472.5%pext v2
26.060.2%51.510.0%pext v3
21.981.0%43.280.0%pext v4

Interpreting the results

When we began optimizing, we first examined the full benchmark and concentrated our efforts on reducing branch misses. This ultimately improved performance on the full benchmark.

However, when it came to the restricted benchmark, branch misses were not as significant of an issue. We'd expect the restricted benchmark to be more friendly to a branchy implementation, as fewer cases means easier to predict branches.

Then why did our branchless implementation improve the restricted benchmark, even surpassing pext_by_len?

One way to interpret the results is the following. When using a well-performing perfect hash, hashing is no longer the limiting factor. Rather memory loads and memcmp become bottlenecks.

To address this, the pext_by_len implementation uses branching based on input size, and optimizes memcpy based on size.

Our implementation takes a different approach. We specializes memcpy to always compare 16 bytes, and branch to a slow path only when the input is aligned to page size.

But a branch on pointer alignment is more predictable than one on input size. Assuming the input pointer is a random number, the slow path will be hit approximately 0.37% of the time:

Therefore, the trade-off between the two options is a net win.

Acknowledgments

I want to thank A.Y., N.S., V.P. and in particular B.X.V. for their help in writing and editing this post. I also want to thank Wojciech Muła for his analysis of the problem.

All my code is available here.