binary search doing it less wrong
TRANSCRIPT
![Page 1: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/1.jpg)
Binary searchdoing it less wrong
Paul Khuong
October 30, 2014
Adserver Engineer @ AppNexus
![Page 2: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/2.jpg)
“Binary search is slow”
I “Linear search is faster for small n” (branches)
I “Fancy layouts scale better” (caches)
![Page 3: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/3.jpg)
No branch, no misprediction
Data dependent conditional move.
http://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/
![Page 4: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/4.jpg)
Don’t. . . look for early matches
if (*mid == needle) {
return mid;
}
![Page 5: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/5.jpg)
Don’t. . . try to adjust bounds tightly
if (needle < *mid) {
len = half;
} else {
low = mid + 1;
len -= half + 1;
}
G++ STL
![Page 6: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/6.jpg)
Don’t. . . do both
if (comparison < 0) {
high = mid;
} else if (comparison > 0) {
low = mid + 1;
} else {
return mid;
}
glibc, FreeBSD libc
![Page 7: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/7.jpg)
Simple binary search
midpoint (n = 5)
|
v
---------------------
| 0 | 1 | 2 | 3 | 4 |
---------------------
|___________|
n’ = 3
while ((half = n / 2) > 0) {
mid = low + half;
low = (*mid < needle) ? mid : low;
n -= half;
}
So simple, it’s AVX2-able!
![Page 8: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/8.jpg)
Assume a decent compiler
loop:
lea (%rdx,%rcx,4), %rdi
cmp (%rdi), %esi
cmovge %rdi, %rdx
sub %rcx, %rax
mov %rax, %rcx
shr %rcx
jnz loop
shr 0
lea 1 sub 1
cmp/load 1 shr 1
cmov 0
cmov 1
lea 2
sub 2
shr 2
![Page 9: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/9.jpg)
Microbenchmark
Implementations:
I branch (STL)
I both (libc)
I early (only early termination)
I simple (cmov)
Input: 32 bit ints (random, ≈ 5% density)
I 8, 16, 32, . . . , 1024
I 10, 50, 100, 200, . . . , 1000
Report average of 128 lookups (median, 1st/99th percentile)
![Page 10: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/10.jpg)
first last bimodal random intersection
0
20
40
60
80
10 100 100010 100 100010 100 100010 100 100010 100 1000size (n * 32 bit ints)
cycl
e/lo
okup implementation
branchbothearlysimple
![Page 11: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/11.jpg)
Caching: n = 2k(−1)⇒ aliasing issues
Midpoints:
0x200000
0x100000
0x080000
0x040000
0x020000
0x010000
....
I Run proper microbenchmarks
I Offset “mid” point ((n / 2) + (n / 64))
I 3-way (“ternary”) search
In the wild: Bentley & Saxe dynamisation or 2k ≤ n < 2k+1.
http://pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
![Page 12: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/12.jpg)
Practical use case: sparse bitmatrix mult + projection
Sparse Bit Matrix Bit Set
(Sparse Bit Vector)
X
![Page 13: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/13.jpg)
a.k.a. (pre)sorted equijoin
Inner loop:
I branch-free (simple)
I branches (STL)
I unrolled (3ary branch-free)
Reuse results
I roving lower bound
I galloping search
![Page 14: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/14.jpg)
Gather phase around peak time
![Page 15: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/15.jpg)
Sorted array search: a decent finger search
Let ∆ = ki − ki−1
Galloping search: ≈ 2 lg ∆ comparisons3ary search: ≈ 2 log3 ∆ D$ missesRoving search: ≈ lg n D$ misses
![Page 16: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/16.jpg)
And some spooky action at a distance
![Page 17: Binary search doing it less wrong](https://reader034.vdocuments.us/reader034/viewer/2022051522/58a2d7321a28ab1f238bd04b/html5/thumbnails/17.jpg)
Sorted arrays work well on contemporary µarch
Don’t be (too) clever:
I Careful with branches
I Avoid cache aliasing/bad benchmarks
I Reuse bounds when repeating searches
Cleverness is useful, but getting simple right goes a long way.