Xoranth's musings

Optimizing the `pext` perfect hash function

2023-06-11

This post is a followup to two posts by Wojciech Muła. One on parsing HTTP verbs, and another on using pext for perfect hashing.

The first post analyzes alternate implementations of a function string_to_verb that translates and HTTP verb, represented by a std::string_view, into an enum. It explores three different strategies:

An hardcoded trie, taken from the boost::beast library.
A SWAR strategy that, instead of reading from the input string one character at a time, reads a 32 or 64 bit prefix and switches on that.
A perfect hash function generated by the gperf tool.

The SWAR strategy turns out to be the least performant.

The three strategies are tested on two syntethic datasets:

One with a uniform distribution of all HTTP verbs.
A more realistic one, which only contains the three most frequent verbs (GET/PUT/POST).

The second post introduces a new strategy. This strategy first branches on the length of the input. Then it uses the pext Intel-specific instruction to extract a few discriminating bits from the input. Finally interprets those bits as a index into a per-length lookup table. We'll call this new strategy pext_by_len.

This second implementation allows each codepath to know at compile-time how much data it needs to load from memory. We will see that this is critical for performance.

Wojciech Muła doesn't test this new strategy on the original problem, but it finds it to perform very well on a range of similar tasks.

In this post I will:

Reproduce the results of the original post on my machine. I will add a further annotation on the number of cache misses.
Backport the new strategy to the original problem and quickly discuss its performance.
Analyze the SWAR strategy from the original post to see why it performs so badly. In particular, we will see that GCC implements the SWAR strategy as a trie, which leads to a substantial amount of cache misses.
Modify the SWAR strategy to use pext as well. This will use a global table instead of per-length ones as in pext_by_len. We'll see how some characteristics of pext synergize with the use of SWAR techniques and a global table.
Further optimize things by replacing memcmp with a more specialized implementation.

Benchmarking setup

All benchmarks were run on my machine, a ThinkPad laptop with an Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz.

Here’s what I did to set up the benchmark program:

I excluded the second core of the machine (logical cores 1 and 5) from the scheduler using the isolcpus kernel boot parameter.
I disabled turboboost and set the CPU frequency to 45% of the maximum value before running the benchmark.
I pinned the benchmark program to logical core 1 using taskset.

To ensure accurate results and compensate for cache warming effects, I ran each benchmark approximately 256'000 times. In the tables below, I report both the average timing in nanoseconds and cycles and the median absolute percent error across all runs. Additionally, I report the cache misses rate obtained through hardware perfomance counters, as it will be useful in the analysis.

Reproducing previous results

As a start, I have reproduced the benchmarks on my machine, timing both in nanoseconds and cycles, and added an annotation for cache misses.

Of the strategies in the first post, we see that SWAR implementations lag behind the boost::beast implementation, while using perfect hashing yields the best performance by far. The ranking of the results matches the results of the original post.

The backported pext_by_len strategy shines on the three verbs benchmark, and is still the second best on the full benchmark.

We see that pext_by_len has a lot more cache misses and takes more time on the full benchmark. This is the first hint that the input size branch is difficult to predict, and that removing it might yield improvements.

ns/op	err%	cyc/op	miss%	all verbs
14.91	1.6%	29.22	51.9%	`noop`
63.68	0.1%	126.02	21.7%	`reference_impl`
72.16	0.1%	142.79	10.6%	`swar`
68.98	0.2%	136.47	10.8%	`swar32`
67.39	0.2%	133.39	10.6%	`swar32 v2`
29.22	0.4%	57.66	0.6%	`perfect_hash`
44.21	0.2%	87.50	14.9%	`pext_by_len`

ns/op	err%	cyc/op	miss%	GET/PUT/POST
14.77	1.2%	29.10	46.5%	`noop`
45.76	0.7%	90.62	9.8%	`reference_impl`
50.24	0.1%	99.48	4.3%	`swar`
49.45	0.1%	97.92	3.6%	`swar32`
47.80	0.1%	94.65	4.9%	`swar32 v2`
38.07	0.9%	75.15	2.7%	`perfect_hash`
31.65	0.2%	62.65	7.2%	`pext_by_len`

Profiling the SWAR implementation

Let's analyze the SWAR 32 strategy. The code is roughly as follows:

verb string_to_verb(std::string_view v)
{
    if (v.size() < 3 or v.size() > 13)
        return verb::unknown;

    // Type punning with `memcpy`
    union {
    	char buf[13];
    	uint32_t value;
    }
    value = 0;
    memcpy(buf, v.data(), v.size());

    switch (value) {
	// ...
        case STRING_CONST('M', 'K', 'A', 'C'):      // The STRING_CONST macro turns "MKAC" into a 32 bit integer.
            if (v.substr(4) == /*MKAC*/"TIVITY"_sv)
                return verb::mkactivity;
            break;
	// ...
    }

    return verb::unknown;
}

The performance table shows a very high level of cache misses. Given that the switch statement has 33 cases, we suspect it might be the culprit.

To confirm this, we can use Linux's perf utility:

$ env NANOBENCH_ENDLESS=swar32 timeout 10 perf record -e branches,branch-misses taskset -c 1 ./benchmark_nb
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 3.583 MB perf.data (78219 samples) ]

An initial look at perf's output doesn't show any particular hotspot. To see why, let us look at how GCC compiles the switch statement:

Samples        lea  rax,[rdi-0x3]
               cmp  rax,0xa
             ↓ ja   510
               push rbp
               mov  rdx,rdi
               mov  rbp,rsi
               push rbx
     1         mov  rbx,rdi
               sub  rsp,0x18                         ; reserve space for `buf`
               mov  rdi,rsp
               mov  QWORD PTR [rsp],0x0              ; zero out the first 64 bits of `buf`
             → call memcpy@plt                       ; copy the `v` to `buf`
     6         mov  eax,DWORD PTR [rsp]              ; load the first 4 characters of `buf` into $eax
               ; This is the body of the switch statement. We compare the first 4 characters of `buf`
               ; lexicographically, to create a n-ary tree.
               ; Due to endianess, verbs are ordered by the last to the first letter.
    31         cmp  eax,0x49424e55                   ; buf[:4] == "UNBI"
             ↓ je   4f0                              ; Verb must be UNBIND or not a verb. Check the rest of buf
             ↓ ja   d0
     1         cmp  eax,0x44414548                   ; buf[:4] == "HEAD"
             ↓ je   4e0
   117       ↓ ja   90
    97         cmp  eax,0x43414b4d                   ; buf[:4] == "MKAC"
             ↓ je   488
   108       ↓ jbe  150
    28         cmp  eax,0x43454843                   ; buf[:4] == "CHEC"
             ↓ je   3a0
               cmp  eax,0x43544150                   ; buf[:4] == "PATC"
             ↓ jne  2a0
     9         cmp  BYTE PTR [rbp+0x4],0x48
    47         mov  edx,0x1c
             ↓ jne  180
             ↓ jmp  109
               nop
               ; ...

The code listing above shows the output of perf. On the left, under the Samples column, we see how many branch misses there were for each instruction. On the right, is the assembly code of the string_to_verb function.

We see that the switch statement has been expanded to a series of comparisons. Interestingly, we see not only jumps on equality (je), but also for greater (ja) or smaller-or-equal (jbe). This suggests that GCC has turned the switch statement into a decision tree. With some post-processing, we can extract the structure of the decision tree and visualize it:

Since the switch statement has been decoded into multiple comparisons, we'll need to postprocess the data and sum the count of branch misses across multiple instructions.

We also see that not all misses are recorded on a jump instruction. E.g.

;Samples       Instructions          ; Comment
; ...
    97         cmp  eax,0x43414b4d   ; buf[:4] == "MKAC"
             ↓ je   488
   108       ↓ jbe  150
; ...

This is a phenomenon known as skew. Namely, since there are multiple instructions in flight in an out-of-order processor, the CPU cannot precisely attribute the branch miss to a single instruction and can be off by an instruction or two.

To account for this we can use the following heuristic to attribute branch misses to the switch statement:

Assign each branch miss to the previous cmp instruction seen in the assembly.
Aggretate across the argument of the cmp instructions. Instructions of the kind cmp eax,0x[...] correspond to our switch statement.

When we do so, we see that more than 40% of the branch misses come from our switch statement:

kind	miss cnt	miss %
`cmp eax`	1830	47.7 %
`mov`	718	18.7 %
`cmp DWORD PTR [rsp+0x4]`	642	16.7 %
`cmp DWORD PTR [rbp+0x4]`	193	5.0 %
`jmp`	149	3.9 %
`cmp BYTE PTR [rbp+0x4]`	128	3.3 %
`cmp rbx`	54	1.4 %
`lea`	49	1.3 %
`add`	23	0.6 %
`cmp WORD PTR [rbp+0x8]`	23	0.6 %
`xor`	19	0.5 %
`cmp rax`	3	0.1 %
`cmp BYTE PTR [rbp+0x8]`	3	0.1 %

Branchless SWAR using `pext`

We want to eliminate the switch statement and replace it with a branchless solution, such as a jump table. However, the compiler didn't automatically do this for us because the values we're comparing against are sparse, which would result in a large jump table. A naive jump table of pointers would be 4 megabytes.

But what if we ignored some of the bits in our comparison value? Most of our verbs consist of uppercase ASCII letters. By examining the binary representation of uppercase ASCII letters, we can see that the upper 3 bits of each letter are always the same.

C	Binary	C	Binary
A	0b01000001	N	0b01001110
B	0b01000010	O	0b01001111
C	0b01000011	P	0b01010000
D	0b01000100	Q	0b01010001
E	0b01000101	R	0b01010010
F	0b01000110	S	0b01010011
G	0b01000111	T	0b01010100
H	0b01001000	U	0b01010101
I	0b01001001	V	0b01010110
J	0b01001010	W	0b01010111
K	0b01001011	X	0b01011000
L	0b01001100	Y	0b01011001
M	0b01001101	Z	0b01011010

Even discarding those 3 bits would reduce our jump table size to 1 kilobyte!

Ideally we would want to discard even more bits, to maximally compress our jump table. To do this, we need:

A fast way to extract the bits.
A fast to search for a subset of bits that uniquely identify a verb.

The first requirement is straightforward. pext accomplishes this in a single cycle.

The second requirement is a bit more complex. Instead of performing a naive search over all possible 32-bit unsigned integers, we can use a technique called "snoob" that allows us to enumerate only integers with a specific number of set bits. For example, a brute force search for a bitmask that extracts 7 bits needs up to four billion iterations. One that uses snoob needs only three million.

However, even with this optimization, we still face a challenge with the verbs PROPFIND and PROPPATCH, as they share a common 4-character prefix. While we could add a branch to handle this particular case, our preference is to maintain a completely branchless solution. To distinguish them we exploit the fact that PROPFIND and PROPPATCH have a different length.

A quick python oneliner confirms that the first four characters of our verb together with its length uniquely identify the verb.

Python 3.11.3 (main, May 24 2023, 00:00:00) [GCC 12.3.1 20230508 (Red Hat 12.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> verbs = [
    "ACL", "BIND", "CHECKOUT", "CONNECT", "COPY", "DELETE",
    "GET", "HEAD", "LINK", "LOCK", "M-SEARCH", "MERGE",
    "MKACTIVITY", "MKCALENDAR", "MKCOL", "MOVE", "NOTIFY", "OPTIONS",
    "PATCH", "POST", "PROPFIND", "PROPPATCH", "PURGE", "PUT",
    "REBIND", "REPORT", "SEARCH", "SUBSCRIBE", "TRACE", "UNBIND",
    "UNLINK", "UNLOCK", "UNSUBSCRIBE",
]

>>> prefix_and_len = {(len(v), v[:4]) for v in verbs}
>>> len(verbs) == len(prefix_and_len)
True

The length has another nice property. Our function takes an std::string_view as a parameter. A string_view is implemented as a struct of pointer to data and length. Due to the x64 calling convention, the length will already be passed in a register (rdi in this case). We can confirm this by examining the assembly code once more:

; Here we implement the check
; if (v.size() < 3 or v.size() > 13)
;       return verb::unknown;
; $rdi is the length of `v`
lea  rax,[rdi-0x3] 			; $rax = v.size() - 3ul
; Notice that the previous subtraction *underflows* if `v.size() < 3`
cmp  rax,0xa				; check if `$rax > 10`. If `v.size() < 3`, `$rax` is a very big number and this check still succeeds
ja   510
[...]

We can simply "mix in" the string length by xor-ing with the output of pext:

uint32_t find_mask() {
    uint32_t best_mask = 0u;
    for (uint32_t mask = 0b1'111'111u; snoob(mask) > mask; mask = snoob(mask)) {
        // Our lookup function. Reads up to the first 4 bytes, does `pext` with mask and then XORs with the size
        auto lam = [=](string_view sv) { 
            uint32_t val = 0u;
            memcpy(&val, sv.data(), min(sv.size(), sizeof(val)));
            return _pext_u32(val, mask) ^ sv.size();
        };

        // VERBS_MAP is an hash map of verb to number
        // `is_uniq` checks that all keys in VERBS_MAP hash to different indices
        if (is_uniq<128ul>(VERBS_MAP, lam)) {
            best_mask = mask;
            break;
        }
    }

    return best_mask; // Will be zero if no good mask is found
}

This runs in ~50ms and quickly finds an appropriate bitmask:

$ ./find_mask 
Found mask: 33687137

The implementation is very straightforward. For simplicity, I generate the lookup table at startup. An alternative would be to use code generation.

static const uint32_t MASK = 33687137u;
uint32_t lut_idx (std::string_view sv) { 
    uint32_t val = 0u;
    memcpy(&val, sv.data(), std::min(sv.size(), sizeof(val)));
    return _pext_u32(val, MASK) ^ sv.size();
};


using lut_elem_t = uint8_t; //verb;
static const std::array<lut_elem_t, 128> LOOKUP_TABLE = []() {
    using namespace boost::beast::http;
    std::array<lut_elem_t, 128> lookup_table {(lut_elem_t) verb::unknown};

    for (unsigned int as_int = 0; as_int < verb_count; ++as_int) {
        std::string_view sv = as_string((verb) as_int);
        lookup_table[lut_idx(sv)] = (lut_elem_t) as_int;
    }

    return lookup_table;
}();

verb
string_to_verb(std::string_view v)
{
    if (v.size() < 3 or v.size() > 13) {
        return verb::unknown;
    }

    verb res = (verb) LOOKUP_TABLE[lut_idx(v)];
    if (v == as_string(res)) {
        return res;
    }
    return verb::unknown;
}

With this change, we see greatly improved results. All branch misses are gone. On the full benchmark, this implementation (dubbed pext) is the second fastest, beaten only the hash table generated by gperf.

ns/op	err%	cyc/op	miss%	all verbs
63.68	0.1%	126.02	21.7%	`reference_impl`
68.98	0.2%	136.47	10.8%	`swar32`
29.22	0.4%	57.66	0.6%	`perfect_hash`
44.21	0.2%	87.50	14.9%	`pext_by_len`
39.74	0.2%	78.70	0.4%	`pext`

On the common verbs benchmark results are less good, but still improved.

ns/op	err%	cyc/op	miss%	GET/PUT/POST
45.76	0.7%	90.62	9.8%	`reference_impl`
49.45	0.1%	97.92	3.6%	`swar32`
38.07	0.9%	75.15	2.7%	`perfect_hash`
31.65	0.2%	62.65	7.2%	`pext_by_len`
43.15	0.2%	85.44	1.8%	`pext`

Here our solution is beaten both by gperf and by the per-length pext solution.

Memcpy woes

Despite the improvements in the previous section, our solution is still dominated by gperf. Can we do better?

Let us profile again with perf, this time focusing on cycles spent on each instruction.

; ...
  0.05 │      mov    eax,0x4
  1.51 │      cmp    rdi,rax
  0.01 │      cmovbe rax,rdi                      ; Compute min(sv.size(), 4);
  0.04 │      mov    rbp,rsi
  1.81 │      mov    DWORD PTR [rsp+0xc],0x0      ; Zero out buf
  0.01 │      mov    rbx,rdi
  0.01 │      mov    esi,eax			  ; Put min(sv.size(), 4) into $esi
  0.03 │      test   eax,eax
       │    ↓ je     45
  2.63 │      xor    eax,eax
  	      ; GCC has inlined std::memcpy. This is a for loop that copies the bytes
  	      ; from the input string one by one into `buf`
  	      ; $eax is the counter, $esi is min(sv.size(), 4), $rbp is a pointer to the input string
  	      ; and $rsp+0xc is a pointer to buf
  2.35 │34:   mov    edx,eax
  2.80 │      movzx  ecx,BYTE PTR [rbp+rdx*1+0x0]
  2.72 │      inc    eax
  2.62 │      mov    BYTE PTR [rsp+rdx*1+0xc],cl
  2.42 │      cmp    eax,esi
  0.05 │    ↑ jb     34
  	      ; After we have finished copying the input string into buf, copy buf back into a register
 38.93 │45:   mov    eax,DWORD PTR [rsp+0xc]
  0.08 │      mov    edx,0x2020661
  8.38 │      pext   eax,eax,edx
  2.96 │      xor    eax,ebx
 14.18 │      movzx  r12d,BYTE PTR [rax+0x424620]
; ...

We see that gcc is loading data from memory in a non-optimal way. It has inlined memcpy into a for loop. This is likely because we are passing a variable length to memcpy.

We can get gcc to generate better assembly by giving it an hint. We can use a union to load the first three characters, and load the fourth character conditionally on input length. This is undefined behaviour in C++, but is explicitely supported by gcc.

uint32_t load_sv(std::string_view sv) {
    union {
        char as_char[4];
        uint32_t res;
    };
    as_char[0] = sv[0];
    as_char[1] = sv[1];
    as_char[2] = sv[2];
    as_char[3] = (sv.size() > 3ul) ? sv[3] : '\0';
    return res;
}

With this change, our implementation is competitive with gperf on both benchmarks.

ns/op	err%	cyc/op	miss%	all verbs
29.22	0.4%	57.66	0.6%	`perfect_hash`
44.21	0.2%	87.50	14.9%	`pext_by_len`
39.74	0.2%	78.70	0.4%	`pext`
29.75	1.4%	58.71	0.5%	`pext v2`

It is still behind the per-length pext solution on the common verbs benchmark though:

ns/op	err%	cyc/op	miss%	GET/PUT/POST
38.07	0.9%	75.15	2.7%	`perfect_hash`
31.65	0.2%	62.65	7.2%	`pext_by_len`
43.15	0.2%	85.44	1.8%	`pext`
38.67	0.4%	76.47	2.5%	`pext v2`

Unsafe memcpy optimizations

A look at the assembly shows that we are still using 3 different movs to load the beginning of our string into memory.

; Samples
  8.11 │      movzx edx,BYTE PTR [rsi+0x2]  ; Load the first two bytes
  0.38 │      movzx eax,WORD PTR [rsi]      ; Load the third byte
  1.02 │      shl   edx,0x10
  1.28 │      or    eax,edx                 ; Combine the first three bytes
  5.70 │      mov   rbx,rdi
       │      mov   rbp,rsi
       │      xor   edx,edx
  2.11 │      cmp   rdi,0x3                 ; Check if v.size() is bigger than 3
  0.10 │    ↓ je    2c                      ; If v.size() is bigger than 3, load the fourth byte. Otherwise $edx will be zero
  6.14 │      movzx edx,BYTE PTR [rsi+0x3]
  6.49 │2c:   shl   edx,0x18
  1.05 │      or    eax,edx                 ; Combine with the fourth byte

Ideally, we would like to load the four bytes in a single mov $edx, DWORD PTR [rsi], but we can't. The issue is the following: a string_view is not guaranteed to be null-terminated. Therefore, for inputs of length 3 a DWORD read might touch uninitialized memory.

That said, we might be able to do better by borrowing some tricks from glibc.

We've seen that the memcmp implementation reads beyond the end of the array when the input pointers are far from a page boundary. Since we are loading only 4 bytes from one pointer, the page boundary check simplifies to:

$\newcommand\PageSize[0]{\textnormal{PageSize}} \newcommand\BitAnd[0]{\mathbin{\&}} \newcommand\Ptr[1]{{p}_{#1}} \Ptr{1} \BitAnd \left( \PageSize \right) < \left( \PageSize - 4\right)$

But we can simplify even further. Notice that, by the point we are loading our input string into a register, we have already checked that the length of the string is bigger than three. Therefore the only case where we could cause a pagefault is when the input pointer is exactly 3 bytes before a page boundary.

If the input pointer is at least 4 bytes from a page boundary, a DWORD read doesn't cross the page boundary.
If the input pointer is 2 or less bytes from a page boundary, a read does cross a page boundary, but we know that the next page is already allocated, since we know that at least 3 bytes after the input pointer are allocated.

This further refines the page boundary check to:

$\Ptr{1} \BitAnd \left( \PageSize \right) \neq \left( \PageSize - 3\right)$

At a first glance, inputs of length three still need extra handling. We are loading a byte beyond the end of the string that we need to zero. We could do so branchlessly using the bzhi instruction:

Initially, it would appear that inputs of length three require additional handling. For inputs of length three we load a byte beyond the end of the string, which needs to be zeroed out. One way to do so is the bzhi instruction:

uint32_t data = _bzhi_u32(load(v.data()), 8ul * v.size());

However, an even better approach is to do nothing. In other words, not masking the extra byte.

To understand why this works, observe that pext discards input bits, but does not mix them. As a result, we can distinguish the output bits of pext into two groups:

Bits extracted from the first three bytes of the input string (the prefix).
Bits extracted from the fourth byte of the input string (the suffix).

We can also split the inputs into two kinds:

Inputs of length three.
All other inputs.

Bits in the prefix are sufficient to distinguish strings of lenghth three by construction. However the fourth byte only affects bits in the suffix. Therefore it is enough to ensure that bits extracted from the prefix distinguish strings of length three from keys of length four or more.

That would guarantee that altering the bits from the suffix cannot cause a collision. We can simply add multiple entries in our lookup table for strings of length three, one for each combination of bits in the suffix.

This is true for our mask, and can be enforced by exhaustive checking in our find_mask function.

With this improvement, we beat gperf on the full benchmark!

ns/op	err%	cyc/op	miss%	all verbs
29.22	0.4%	57.66	0.6%	`perfect_hash`
44.21	0.2%	87.50	14.9%	`pext_by_len`
39.74	0.2%	78.70	0.4%	`pext`
29.75	1.4%	58.71	0.5%	`pext v2`
26.25	0.7%	51.71	0.0%	`pext v3`

We also beat pext_by_len on the restricted benchmark.

ns/op	err%	cyc/op	miss%	GET/PUT/POST
38.07	0.9%	75.15	2.7%	`perfect_hash`
31.65	0.2%	62.65	7.2%	`pext_by_len`
43.15	0.2%	85.44	1.8%	`pext`
38.67	0.4%	76.47	2.5%	`pext v2`
26.06	0.2%	51.51	0.0%	`pext v3`

Optimizing the final check

Can we do even better? Running the latest version of our benchmark under perf shows that the bottleneck is now memcmp:

$ perf report -M intel --input=pext-v3-cycles-gcc.data
Samples: 118K of event 'cycles:pppu', Event count (approx.): 51634360904
Overhead  Command       Shared Object         Symbol
  29.87%  benchmark_nb  benchmark_nb          [.] ankerl::nanobench::Bench::run<run_benchmark<boost::beast::http:
  29.50%  benchmark_nb  libc.so.6             [.] __memcmp_avx2_movbe
  23.83%  benchmark_nb  benchmark_nb          [.] pext::string_to_verb_v3
  12.06%  benchmark_nb  benchmark_nb          [.] boost::beast::http::as_string
   2.02%  benchmark_nb  benchmark_nb          [.] memcmp@plt

That is, the bottleneck is not hashing our input string view, but checking that our hash has not given us a false positive:

    verb
    string_to_verb_v3(std::string_view v)
    {
        if (v.size() < 3 or v.size() > 13) {
            return verb::unknown;
        }

        verb res = (verb) LOOKUP_TABLE[lut_idx_v3(v)];

	// string_view::operator!= calls `memcmp` under the hood
        if (v != as_string(res)) { // BOTTLENECK!
            return verb::unknown;
        } else {
            return res;
        }
    }

We notice several inefficiencies in our false positive check:

Perf shows that memcmp is not getting inlined. Our page alignment check logic is inspired by memcmp's implementation, so it is likely that inlining memcmp the compiler would be able to elide redundant computations.
memcmp has to branch on the size of the input arrays. But we know that our input string size is between 3 and 13 characters. Therefore we can use 16 bytes loads and omit all branching.

memcmp needs to check that it is safe to read beyond the end of the array for both inputs. But we control one of the input arrays, and we can just pad the input with nulls.
memcmp has to find the first different byte, we don't. So we can use vptest instead of vmovmskb and save one instruction.
Our lookup table implementation has an inefficient layout. We are using a lookup table of string_views, but a string_view is itself a pair of (pointer, length). This leads to a double indirection on lookup. A better implementation would keep the length and the data side by side, in the same cacheline.

After implementing the changes above, we see a substantial speedup on both benchmarks:

ns/op	err%	cyc/op	miss%	all verbs
29.22	0.4%	57.66	0.6%	`perfect_hash`
44.21	0.2%	87.50	14.9%	`pext_by_len`
39.74	0.2%	78.70	0.4%	`pext`
29.75	1.4%	58.71	0.5%	`pext v2`
26.25	0.7%	51.71	0.0%	`pext v3`
22.04	1.1%	43.40	0.0%	`pext v4`

ns/op	err%	cyc/op	miss%	GET/PUT/POST
38.07	0.9%	75.15	2.7%	`perfect_hash`
31.65	0.2%	62.65	7.2%	`pext_by_len`
43.15	0.2%	85.44	1.8%	`pext`
38.67	0.4%	76.47	2.5%	`pext v2`
26.06	0.2%	51.51	0.0%	`pext v3`
21.98	1.0%	43.28	0.0%	`pext v4`

Interpreting the results

When we began optimizing, we first examined the full benchmark and concentrated our efforts on reducing branch misses. This ultimately improved performance on the full benchmark.

However, when it came to the restricted benchmark, branch misses were not as significant of an issue. We'd expect the restricted benchmark to be more friendly to a branchy implementation, as fewer cases means easier to predict branches.

Then why did our branchless implementation improve the restricted benchmark, even surpassing pext_by_len?

One way to interpret the results is the following. When using a well-performing perfect hash, hashing is no longer the limiting factor. Rather memory loads and memcmp become bottlenecks.

To address this, the pext_by_len implementation uses branching based on input size, and optimizes memcpy based on size.

Our implementation takes a different approach. We specializes memcpy to always compare 16 bytes, and branch to a slow path only when the input is aligned to page size.

But a branch on pointer alignment is more predictable than one on input size. Assuming the input pointer is a random number, the slow path will be hit approximately 0.37% of the time:

$\frac{15}{4096} = 0.003662109375 \approx 0.37\%$

Therefore, the trade-off between the two options is a net win.

Acknowledgments

I want to thank A.Y., N.S., V.P. and in particular B.X.V. for their help in writing and editing this post. I also want to thank Wojciech Muła for his analysis of the problem.

All my code is available here.