2023-04-27
memcmp
is a C standard library function that compares two arrays lexicographically.
It will be familiar to C programmers, and it is often used as a building block for string operations in other languages. For instance, C++ std::string_view
comparison uses memcmp
under the hood in some implementations.
While it can be implemented in C using a simple for loop, Glibc provides an optimized implementation written in assembly. By digging into this implementation, we can learn a lot about low level optimization.
This post will procede in four steps. First I'll provide a quick summary of what memcmp
does.
Then we will delve into the assembly specialization for x86-64 AVX2, dividing it into three logical sections:
The section dedicated to small arrays is the longest, as the logic for small arrays is the least straightforward. 32 bytes is smaller than the size of an AVX2 SIMD register, and usually handling small arrays requires either:
We will see how the implementation avoids the need for either.
memcmp
According to cppreference, memcmp
:
Reinterprets the objects pointed to by lhs and rhs as arrays of unsigned char and compares the first
count
bytes of these arrays. The comparison is done lexicographically.The sign of the result is the sign of the difference between the values of the first pair of bytes (both interpreted as unsigned char) that differ in the objects being compared.
A pure C implementation of such a function is very straightforward, we just loop over the array till we find a difference. If we do find one, we return the signed difference of the two bytes:
int
In the x64 calling convention, the first arguments are passed in registers.
So p1
and p2
will be passed respectively in $rsi
and $rdi
, while the count
will be passed in $rdx
.
Let's now dig into the assembly.
Immediately after the entry point, we branch to a specialized routine that handles arrays smaller than 32 bytes.
cmp rdx,0x20 ; 0x20 is 32 in hexadecimal
jb 2e0
Following the jump, we see some pretty unusual code.
; If count is smaller or equal to 1, go to a specialized routine.
2e0: cmp edx,0x1
jbe 360
mov eax,edi
or eax,esi ; $rax, $rsi are p1, p2
and eax,0xfff ; 0xfff is 4095 in hexadecimal
cmp eax,0xfe0 ; 0xfe0 is 4096 - 32 in hexadecimal
jg 320
To unravel this code, let's consider a simple case.
To compare two arrays which are 32 bytes long, we can load each to a YMM register using AVX2 loadu
instruction.
We can then compare the two YMM registers in a single instruction.
However, if count
is less than 32 bytes, we still load 32 bytes of data.
We'll then have to "discard" the comparison results for the excess bytes.
Since we are reading beyond the end of the arrays, we need to be careful not to touch and unmapped page and cause a pagefault.
To avoid this, we must check if the next page is allocated. To be safe, we should check that bytes don't cross a page boundary. Pages are always aligned the page size, so we just need to check that the pointer modulo the page size isn't too close to the page boundary. In math terms:
for and .
We can express both conditions as a single expression taking the maximum of , :
This has the advantage that can be bound from above.
On x86-64 hardware, only page sizes of 4Kb, 2Mb and 1Gb are allowed. To be safe, we'll assume that the page size is 4096 bytes. With that in mind, we can establish the following condition, which is enough to prevent a page fault:
This condition corresponds to the assembly above.
After loading the two arrays into ymm registers, we need to:
AVX2 doesn't have byte-level masked loads, and loading a mask likely requires a lookup table. Therefore using AVX2 instructions to mask excessive characters is inefficient.
A better approach is to use vpcmpeqb
for comparison, followed by vpmovmskb
to move the comparison result to a general purpose register and finally clear the excess bits using bit manipulation tricks.
vpcmpeqb
compares each byte in its two inputs, and returns -1
if they are equal and 0
if they differ.
vpmovmskb
sets the n-th bit of its destination register to the sign bit of the n-th byte in its input ymm register.
Combining vpcmpeqb
and vpmovmskb
will return an integer where the n-th bit is one if the n-th bytes of the input arrays are equal, and zero otherwise.
However, since memcmp
returns 0 if the inputs are equal, we need to bitwise negate the output of vpmovmskb
.
Finally we need to mask the excess bits. The most straightforward way is to use a shift to create a bitmask, like in this C example:
uint32_t
Fortunately most Intel and AMD provide an instruction that does all the above.
The instruction is bzhi
from the BMI2 extension. It has a throughput of 1 ins/cycle on both Skylake and Zen, which is extremely good.
Also, while BMI2 and AVX2 are separate extensions, every CPU produced that supports AVX2 supports BMI.
This leads to the following assembly
vmovdqu ymm2,YMMWORD PTR [rsi]
vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi]
vpmovmskb eax,ymm2
inc eax
bzhi edx,eax,edx
; `bzhi` sets the flags if the input is not zero.
jne e0
xor eax,eax
vzeroupper
ret
This shows the entire process for two example strings:
If after clearing out the extra bits, edx
is not zero, we found a difference.
In particular, if the first
bytes are identical, edx
will contain
zeros followed by a 1.
Fortunately, BMI contains another useful instruction that helps us.
tzcnt
counts how many trailing zeros a register has.
Now that we have the index of the first differing byte, we take their signed difference to obtain the result.
e0: tzcnt eax,eax
movzx ecx,BYTE PTR [rsi+rax*1]
movzx eax,BYTE PTR [rdi+rax*1]
sub eax,ecx
vzeroupper
ret
Now that we know that the input arrays are bigger than 32 characters, there's no need to worry about pagefaults or clearing "extra" bits.
We can load, compare and vpmovmskb
in increments of 32 bytes till we have less than 32 bytes left to compare. The first four steps of the logic are inlined.
cmp rdx,0x40
jbe 264 ; Go to EPILOGUE 1
vmovdqu ymm2,YMMWORD PTR [rsi+0x20]
vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20]
vpmovmskb eax,ymm2
inc eax
jne 100
cmp rdx,0x80
jbe 250 ; Go to EPILOGUE 2
vmovdqu ymm3,YMMWORD PTR [rsi+0x40]
vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40]
vpmovmskb eax,ymm3
inc eax
jne 120 ; Go to EPILOGUE 3
vmovdqu ymm4,YMMWORD PTR [rsi+0x60]
vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60]
vpmovmskb ecx,ymm4
inc ecx
jne 15b ; Go to EPILOGUE 4
cmp rdx,0x100
ja 170 ; Go to main loop PROLOGUE
The assembly above can only handle arrays whose length is a multiple of 32. What can we do when that's not the case?
Two options come to mind. First, we can write a scalar loop to handle the remaining bytes. Alternatively, we can reuse the logic that is used for small arrays.
But there's an even better solution: read and test the last 32 bytes of the array. By doing this, we eliminate the need to read beyond the end of the array. Then, unlike the small arrays logic, we read only memory we know is allocated. In particular, we don't need a page alignment check.
It's important to note that although there is overlap between this final iteration and the previous ones, it doesn't affect the end result. Our objective is to locate the position of the first difference in the entire array (if any). But we know that the overlapping bytes contain no difference, and therefore don't affect the outcome.
The only place where we need to be careful is if a difference is found.
In that case, tzcnt
will tell us the offset from end_of_array - 32
, instead than begin_of_array + num_iter * 32
.
We can adjust the final stanza to account for that:
; EPILOGUE 1
264: vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x20]
vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi+rdx*1-0x20]
vpmovmskb eax,ymm1
inc eax
jne 2c0 ; If a difference found, go to tzcnt stanza
vzeroupper
ret ; No difference, return 0
; ...
2c0: tzcnt eax,eax
; $eax now contains the offset of the first difference from `end_of_array - 32`
add eax,edx ; $edx is the size of the array.
; $rsi is the beginning of the first array
; $rax now contains `size_of_array + offset`
; 0x20 is 32
; So $rsi + $rax is `begin_of_array + size_of_array + offset`
; But `begin_of_array + size_of_array = end_of_array` so
; So $rsi + $rax - 0x20 = end_of_array - 32 + offset
; which is the position of the first different byte
movzx ecx,BYTE PTR [rsi+rax*1-0x20]
; Ditto for the second array
movzx eax,BYTE PTR [rdi+rax*1-0x20]
sub eax,ecx
vzeroupper
ret ; Return difference of first unequal bytes
Finally we reach the main logic. This logic is called only for arrays bigger than 256 elements. It is optimized with the idea that there will be many iterations before we find the difference. In particular, it has a prologue and a main loop.
; PROLOGUE
170: lea rdx,[rdi+rdx*1-0x80] ; Turn count into ptr to end of array
sub rsi,rdi ; Difference of two input ptrs
and rdi,0xffffffffffffffe0 ; Align to 32 bytes
sub rdi,0xffffffffffffff80
; LOOP
180: vmovdqu ymm1,YMMWORD PTR [rsi+rdi*1]
vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi]
vmovdqu ymm2,YMMWORD PTR [rsi+rdi*1+0x20]
vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20]
vmovdqu ymm3,YMMWORD PTR [rsi+rdi*1+0x40]
vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40]
vmovdqu ymm4,YMMWORD PTR [rsi+rdi*1+0x60]
vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60]
vpand ymm5,ymm2,ymm1
vpand ymm6,ymm4,ymm3
vpand ymm7,ymm6,ymm5 ; Tree reduction
vpmovmskb ecx,ymm7
inc ecx
jne 140 ; Found differece
sub rdi,0xffffffffffffff80 ; Add 32 * 4 to first array ptr
cmp rdi,rdx
jb 180 ; Go to LOOP
The loop optimizations are pretty standard. The loop is unrolled, having four comparisons at a time in order to offset the overhead from pointer manipulation and to better exploit instruction level parallelism.
Also, instead of checking the result of each comparison individually, we bitwise and them together first and check only the aggregated result.
This exploits the fact that vpand
can run on three CPU ports, while vpmovmskb
can only run on one.
More interesting is the prologue. There's two optimizations of note here:
The pointer to the second array is replaced by the difference of second to first. Loads from the second array can still be expressed using x86-64 base+offset memory indexing. E.g.
vmovdqu ymm1,YMMWORD PTR [p1 + (p2 - p1) * 1]
This saves us from having to increment the second pointer inside the loop.
The pointer to the first array is rounded down to 32 bytes. This ensures that at least half of the memory loads will be aligned, resulting in a higher throughput.
We have investigated an optimized implementation of memcmp
which uses SIMD instructions extensively.
We have several techniques to:
For people that want to dig even deeper, I'd recommend looking at the assembly of various glibc
routines.
For a reference on SIMD instructions on Intel and their latencies and throughput, I would reccomend the excellent uops.info.
Finally, I want to thank A.Y., B.X.V., C.P., N.S. and V.P. for their invaluable feedback in writing this post
We used the inequality .
We used the following equality and the fact that the page size is a power of two.
We used the distributivity property .