Xoranth's musings

A look inside `memcmp` on Intel AVX2 hardware.

2023-04-27

memcmp is a C standard library function that compares two arrays lexicographically. It will be familiar to C programmers, and it is often used as a building block for string operations in other languages. For instance, C++ std::string_view comparison uses memcmp under the hood in some implementations.

While it can be implemented in C using a simple for loop, Glibc provides an optimized implementation written in assembly. By digging into this implementation, we can learn a lot about low level optimization.

This post will procede in four steps. First I'll provide a quick summary of what memcmp does. Then we will delve into the assembly specialization for x86-64 AVX2, dividing it into three logical sections:

Handling of small arrays, where "small" denotes arrays smaller than 32 bytes.
Handling of arrays ranging from 32 to 256 bytes.
Handling of large arrays.

The section dedicated to small arrays is the longest, as the logic for small arrays is the least straightforward. 32 bytes is smaller than the size of an AVX2 SIMD register, and usually handling small arrays requires either:

Writing scalar code.
Some masking support from the hardware, that AVX2 doesn't provide.

We will see how the implementation avoids the need for either.

Summary of `memcmp`

According to cppreference, memcmp:

Reinterprets the objects pointed to by lhs and rhs as arrays of unsigned char and compares the first count bytes of these arrays. The comparison is done lexicographically.

The sign of the result is the sign of the difference between the values of the first pair of bytes (both interpreted as unsigned char) that differ in the objects being compared.

A pure C implementation of such a function is very straightforward, we just loop over the array till we find a difference. If we do find one, we return the signed difference of the two bytes:

int simple_memcmp(const void* p1, const void* p2, size_t count) {
    const unsigned char* lhs = p1;
    const unsigned char* rhs = p2;
    
    for (size_t i = 0; i < count; ++i)
        if (lhs[i] != rhs[i]) 
            return (int) lhs[i] - (int) rhs[i];

    return 0;
}

In the x64 calling convention, the first arguments are passed in registers. So p1 and p2 will be passed respectively in $rsi and $rdi, while the count will be passed in $rdx.

Let's now dig into the assembly.

Efficient handling of short arrays

Immediately after the entry point, we branch to a specialized routine that handles arrays smaller than 32 bytes.

    cmp        rdx,0x20          ; 0x20 is 32 in hexadecimal
    jb         2e0

Following the jump, we see some pretty unusual code.

     ; If count is smaller or equal to 1, go to a specialized routine.
2e0: cmp        edx,0x1
     jbe        360
     mov        eax,edi
     or         eax,esi          ; $rax, $rsi are p1, p2
     and        eax,0xfff        ; 0xfff is 4095 in hexadecimal
     cmp        eax,0xfe0        ; 0xfe0 is 4096 - 32 in hexadecimal
     jg         320

To unravel this code, let's consider a simple case. To compare two arrays which are 32 bytes long, we can load each to a YMM register using AVX2 loadu instruction. We can then compare the two YMM registers in a single instruction.

However, if count is less than 32 bytes, we still load 32 bytes of data. We'll then have to "discard" the comparison results for the excess bytes. Since we are reading beyond the end of the arrays, we need to be careful not to touch and unmapped page and cause a pagefault.

To avoid this, we must check if the next page is allocated. To be safe, we should check that bytes $[p_i, p_i+32)$ don't cross a page boundary. Pages are always aligned the page size, so we just need to check that the pointer modulo the page size isn't too close to the page boundary. In math terms:

$\newcommand\PageSize[0]{\textnormal{PageSize}} \newcommand\BitAnd[0]{\mathbin{\&}} \newcommand\BitOr[0]{\mathbin{\|}} \newcommand\BitMod[0]{\mathbin{\%}} \newcommand\Ptr[1]{{p}_{#1}} \begin{align*} \Ptr{i}' &< \PageSize - 32 \end{align*}$

for $\Ptr{i}' = \Ptr{i} \pmod \PageSize$ and $i \in \Set{1, 2}$ .

We can express both conditions as a single expression taking the maximum of $\Ptr{1}'$ , $\Ptr{2}'$ :

$\begin{align*} \PageSize - 32 > max(\Ptr{1}', \Ptr{2}') \end{align*}$

This has the advantage that $max(\Ptr{1}', \Ptr{2}')$ can be bound from above.

$\def\arraystretch{1.4em} \newcommand\htmlRef[1]{\href{###1}{\text{\footnotesize #1}}} \begin{array}{rcl|l} \max(\Ptr{1}', \Ptr{2}') &\lt& \Ptr{1}' \BitOr \Ptr{2}' & \text{\htmlRef{binorineq}} \\ &=& (\Ptr{1} \BitAnd (\PageSize - 1)) \BitOr ( \Ptr{2} \BitAnd (\PageSize - 1)) & \text{\htmlRef{pow2eq}} \\ &=& (\Ptr{1} \BitOr \Ptr{2}) \BitAnd (\PageSize - 1) & \text{\htmlRef{disteq}} \end{array}$

On x86-64 hardware, only page sizes of 4Kb, 2Mb and 1Gb are allowed. To be safe, we'll assume that the page size is 4096 bytes. With that in mind, we can establish the following condition, which is enough to prevent a page fault:

$(\Ptr{1} \BitOr \Ptr{2}) \BitAnd 4095 < 4064$

This condition corresponds to the assembly above.

Fast comparison using AVX2 and BMI

After loading the two arrays into ymm registers, we need to:

Compare them for equality.
Mask or discard the comparisons of characters beyond the end of the array.

AVX2 doesn't have byte-level masked loads, and loading a mask likely requires a lookup table. Therefore using AVX2 instructions to mask excessive characters is inefficient.

A better approach is to use vpcmpeqb for comparison, followed by vpmovmskb to move the comparison result to a general purpose register and finally clear the excess bits using bit manipulation tricks.

vpcmpeqb compares each byte in its two inputs, and returns -1 if they are equal and 0 if they differ. vpmovmskb sets the n-th bit of its destination register to the sign bit of the n-th byte in its input ymm register.

Combining vpcmpeqb and vpmovmskb will return an integer where the n-th bit is one if the n-th bytes of the input arrays are equal, and zero otherwise.

However, since memcmp returns 0 if the inputs are equal, we need to bitwise negate the output of vpmovmskb.

Finally we need to mask the excess bits. The most straightforward way is to use a shift to create a bitmask, like in this C example:

uint32_t zero_high_bits(uint32_t input, uint32_t count) {
    uint32_t mask = ~(1 << count);      // E.g. 3 becomes 0b00011111...1
    return input & mask;                // Keep only the count lowest bits
}

Fortunately most Intel and AMD provide an instruction that does all the above. The instruction is bzhi from the BMI2 extension. It has a throughput of 1 ins/cycle on both Skylake and Zen, which is extremely good. Also, while BMI2 and AVX2 are separate extensions, every CPU produced that supports AVX2 supports BMI.

This leads to the following assembly

     vmovdqu    ymm2,YMMWORD PTR [rsi]
     vpcmpeqb   ymm2,ymm2,YMMWORD PTR [rdi]
     vpmovmskb  eax,ymm2
     inc        eax
     bzhi       edx,eax,edx
     ; `bzhi` sets the flags if the input is not zero. 
     jne        e0
     xor        eax,eax
     vzeroupper
     ret

This shows the entire process for two example strings:

Handling the unequal case

If after clearing out the extra bits, edx is not zero, we found a difference. In particular, if the first $m$ bytes are identical, edx will contain $m$ zeros followed by a 1.

Fortunately, BMI contains another useful instruction that helps us. tzcnt counts how many trailing zeros a register has.

$\newcommand\tzcnt[0]{\textnormal{tzcnt}} \tzcnt ( 0\textnormal{b}\underbrace{00000}_{m} 1 \textnormal{xxxx} ) = m$

Now that we have the index of the first differing byte, we take their signed difference to obtain the result.

 e0:  tzcnt      eax,eax
      movzx      ecx,BYTE PTR [rsi+rax*1]
      movzx      eax,BYTE PTR [rdi+rax*1]
      sub        eax,ecx
      vzeroupper
      ret

Avoiding the loop epilogue

Now that we know that the input arrays are bigger than 32 characters, there's no need to worry about pagefaults or clearing "extra" bits.

We can load, compare and vpmovmskb in increments of 32 bytes till we have less than 32 bytes left to compare. The first four steps of the logic are inlined.

     cmp        rdx,0x40
     jbe        264                         ; Go to EPILOGUE 1
     vmovdqu    ymm2,YMMWORD PTR [rsi+0x20]
     vpcmpeqb   ymm2,ymm2,YMMWORD PTR [rdi+0x20]
     vpmovmskb  eax,ymm2
     inc        eax
     jne        100
     cmp        rdx,0x80
     jbe        250                         ; Go to EPILOGUE 2
     vmovdqu    ymm3,YMMWORD PTR [rsi+0x40]
     vpcmpeqb   ymm3,ymm3,YMMWORD PTR [rdi+0x40]
     vpmovmskb  eax,ymm3
     inc        eax
     jne        120                         ; Go to EPILOGUE 3
     vmovdqu    ymm4,YMMWORD PTR [rsi+0x60]
     vpcmpeqb   ymm4,ymm4,YMMWORD PTR [rdi+0x60]
     vpmovmskb  ecx,ymm4
     inc        ecx
     jne        15b                         ; Go to EPILOGUE 4
     cmp        rdx,0x100
     ja         170                         ; Go to main loop PROLOGUE

The assembly above can only handle arrays whose length is a multiple of 32. What can we do when that's not the case?

Two options come to mind. First, we can write a scalar loop to handle the remaining bytes. Alternatively, we can reuse the logic that is used for small arrays.

But there's an even better solution: read and test the last 32 bytes of the array. By doing this, we eliminate the need to read beyond the end of the array. Then, unlike the small arrays logic, we read only memory we know is allocated. In particular, we don't need a page alignment check.

It's important to note that although there is overlap between this final iteration and the previous ones, it doesn't affect the end result. Our objective is to locate the position of the first difference in the entire array (if any). But we know that the overlapping bytes contain no difference, and therefore don't affect the outcome.

The only place where we need to be careful is if a difference is found. In that case, tzcnt will tell us the offset from end_of_array - 32, instead than begin_of_array + num_iter * 32. We can adjust the final stanza to account for that:

     ; EPILOGUE 1
264: vmovdqu    ymm1,YMMWORD PTR [rsi+rdx*1-0x20]
     vpcmpeqb   ymm1,ymm1,YMMWORD PTR [rdi+rdx*1-0x20]
     vpmovmskb  eax,ymm1
     inc        eax
     jne        2c0                         ; If a difference found, go to tzcnt stanza
     vzeroupper
     ret                                    ; No difference, return 0
     ; ...

2c0: tzcnt      eax,eax
     ; $eax now contains the offset of the first difference from `end_of_array - 32`
     add        eax,edx                     ; $edx is the size of the array.
     ; $rsi is the beginning of the first array
     ; $rax now contains `size_of_array + offset`
     ; 0x20 is 32
     ; So $rsi + $rax is `begin_of_array + size_of_array + offset`
     ; But `begin_of_array + size_of_array = end_of_array` so
     ; So $rsi + $rax - 0x20 = end_of_array - 32 + offset
     ; which is the position of the first different byte
     movzx      ecx,BYTE PTR [rsi+rax*1-0x20]
     ; Ditto for the second array
     movzx      eax,BYTE PTR [rdi+rax*1-0x20]
     sub        eax,ecx
     vzeroupper
     ret                                    ; Return difference of first unequal bytes

Handling big arrays

Finally we reach the main logic. This logic is called only for arrays bigger than 256 elements. It is optimized with the idea that there will be many iterations before we find the difference. In particular, it has a prologue and a main loop.

     ; PROLOGUE
170: lea        rdx,[rdi+rdx*1-0x80]    ; Turn count into ptr to end of array
     sub        rsi,rdi                 ; Difference of two input ptrs
     and        rdi,0xffffffffffffffe0  ; Align to 32 bytes
     sub        rdi,0xffffffffffffff80
     ; LOOP
180: vmovdqu    ymm1,YMMWORD PTR [rsi+rdi*1]
     vpcmpeqb   ymm1,ymm1,YMMWORD PTR [rdi]
     vmovdqu    ymm2,YMMWORD PTR [rsi+rdi*1+0x20]
     vpcmpeqb   ymm2,ymm2,YMMWORD PTR [rdi+0x20]
     vmovdqu    ymm3,YMMWORD PTR [rsi+rdi*1+0x40]
     vpcmpeqb   ymm3,ymm3,YMMWORD PTR [rdi+0x40]
     vmovdqu    ymm4,YMMWORD PTR [rsi+rdi*1+0x60]
     vpcmpeqb   ymm4,ymm4,YMMWORD PTR [rdi+0x60]
     vpand      ymm5,ymm2,ymm1
     vpand      ymm6,ymm4,ymm3
     vpand      ymm7,ymm6,ymm5          ; Tree reduction
     vpmovmskb  ecx,ymm7
     inc        ecx
     jne        140                     ; Found differece
     sub        rdi,0xffffffffffffff80  ; Add 32 * 4 to first array ptr
     cmp        rdi,rdx
     jb         180                     ; Go to LOOP

The loop optimizations are pretty standard. The loop is unrolled, having four comparisons at a time in order to offset the overhead from pointer manipulation and to better exploit instruction level parallelism. Also, instead of checking the result of each comparison individually, we bitwise and them together first and check only the aggregated result. This exploits the fact that vpand can run on three CPU ports, while vpmovmskb can only run on one.

More interesting is the prologue. There's two optimizations of note here:

The pointer to the second array is replaced by the difference of second to first. Loads from the second array can still be expressed using x86-64 base+offset memory indexing. E.g.
```
    vmovdqu    ymm1,YMMWORD PTR [p1 + (p2 - p1) * 1]
```
This saves us from having to increment the second pointer inside the loop.
The pointer to the first array is rounded down to 32 bytes. This ensures that at least half of the memory loads will be aligned, resulting in a higher throughput.

Wrapup

We have investigated an optimized implementation of memcmp which uses SIMD instructions extensively. We have several techniques to:

Quickly find differences in an array using SIMD and BMI2 instructions.
Handle data smaller than the SIMD register size.
Eliminating the need for a separate scalar epilogue to handle arrays of uneven sizes.
Extract instruction level parallelism from loops.

For people that want to dig even deeper, I'd recommend looking at the assembly of various glibc routines. For a reference on SIMD instructions on Intel and their latencies and throughput, I would reccomend the excellent uops.info.

Finally, I want to thank A.Y., B.X.V., C.P., N.S. and V.P. for their invaluable feedback in writing this post

We used the inequality $\max(a, b) < a \BitOr b$ .

We used the following equality $a \pmod {2^k} = a \BitAnd (2^k - 1)$ and the fact that the page size is a power of two.

We used the distributivity property $(a \BitAnd c) \BitOr (b \BitAnd c) = (a \BitOr b) \BitAnd c$ .