On lists, cache, algorithms, and microarchitecture
I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each
, transform
, and accumulate
I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.
No matter what we do, lists will remain a hardware unfriendly data structure. However, they do have their uses, and they share some of their properties with other node-based data structures like trees, for example. Lists, being simple, can provide an interesting insight into how the performance of high-level algorithms that we write is affected by the processor microarchitecture, what we can do to improve that, and what the cost of those optimisations can be.
Doubly-linked list
In this article, I’m going to use an intrusive doubly-linked list. We are going to focus on the iterating over that list, so insertion and erasure and other operations that modify it are of little importance. Each element of the list needs a hook that contains pointers to the previous and the next element.
class list_hook {
list_hook* next_;
list_hook* prev_;
private:
list_hook(list_hook* prev, list_hook* next) noexcept
: next_(next), prev_(prev) {}
template<typename T, list_hook T::*> friend class list;
public:
list_hook() = default;
list_hook(list_hook const&) = delete;
list_hook(list_hook&&) = delete;
};
Even though the public interface more or less follows std::list
and the list is presented as a linear one, internally it is circular. This simplifies some parts of the implementation since list_hook::next_
and list_hook::prev_
are never null pointers. Inside the list object there is a root hook which marks the end of the list.
template<typename T, list_hook T::*Hook>
class list {
list_hook root_ = {&root_, &root_};
public:
class iterator {
list_hook* current_;
public:
iterator& operator++() noexcept {
current_ = current_->next_;
return *this;
}
reference operator*() const noexcept {
return container_of<value_type, list_hook>(Hook, *current_);
}
/* ... */
};
iterator begin() noexcept { return iterator(root_.next_); }
iterator end() noexcept { return iterator(&root_); }
/* ... */
};
Nothing unusual here. Most of the observations we are going to make will apply to all doubly-linked list implementations.
Sequential reads
The simplest way to iterate over any range in C++ is to use range-based for
loop. With the iterators implemented as described earlier GCC1 compiles the following code:
struct value {
intrusive::list_hook hook_;
int value_ = 0;
};
using list_type = intrusive::list<value, &value::hook_>;
void do_something(value const&);
void for_each(list_type const& list) {
for (auto& element : list) {
do_something(element);
}
}
to:
.L3:
mov rdi, rbx
call do_something(value const&)
mov rbx, QWORD PTR [rbx]
cmp rbp, rbx
jne .L3
Clang does the same thing. There’s also no apparent difference between range-based for
loop and std::for_each
. The latter, however, has the potential to be specialised with container-specific optimisations.
Let’s take a closer look at a more concrete example. The following code computes a sum of all value::value_
s in the list.
using iterator = list_type::const_iterator;
int sum_list(iterator first, iterator last) {
return std::accumulate(first, last, 0,
[] (int a, value const& b) { return a + b.value_; });
}
The compiler doesn’t disappoint:
.L3:
add eax, DWORD PTR [rdi+16]
mov rdi, QWORD PTR [rdi]
cmp rdi, rsi
jne .L3
The generated assembly looks okay, so it is time to see how well such code performs. Before we run the test, let’s first find out what LLVM MCA can tell us about it.
Iterations: 10000
Instructions: 40000
Total Cycles: 50005
Total uOps: 50000
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[1] [2] [3] [4] Instructions:
2 6 0.50 * add eax, dword ptr [rdi + 16]
1 5 0.50 * mov rdi, qword ptr [rdi]
1 1 0.25 cmp rdi, rsi
1 1 0.50 jne .L3
Each iteration consists of 4 instructions which translate to 5 µops. The source of the additional µop is add eax, DWORD PTR [rdi+16]
which involves both memory load and an ALU operation. According to MCA, we can expect, on average, 5 cycles per iteration.
The timeline shows us in more details on how instruction-level parallelism works in this case.
[0,0] DeeeeeeER . . . add eax, dword ptr [rdi + 16]
[0,1] DeeeeeE-R . . . mov rdi, qword ptr [rdi]
[0,2] D=====eER . . . cmp rdi, rsi
[0,3] .D=====eER. . . jne .L3
[1,0] .D====eeeeeeER . . add eax, dword ptr [rdi + 16]
[1,1] .D====eeeeeE-R . . mov rdi, qword ptr [rdi]
[1,2] . D========eER . . cmp rdi, rsi
[1,3] . D=========eER. . jne .L3
[2,0] . D========eeeeeeER . add eax, dword ptr [rdi + 16]
[2,1] . D=======eeeeeE-R . mov rdi, qword ptr [rdi]
[2,2] . D============eER . cmp rdi, rsi
[2,3] . D=============eER. jne .L3
D : Instruction dispatched.
e : Instruction executing.
E : Instruction executed.
R : Instruction retired.
= : Instruction already dispatched, waiting to be executed.
- : Instruction executed, waiting to be retired.
Each iteration depends on the result of mov rdi, qword ptr [rdi]
from the previous one. mov
and add
in the same iteration are independent, and since L1 supports two loads each cycle, the processor can execute them at the same time. The compare and conditional jump have minimal effect on the performance.
IACA shows similar, though, not identical results:
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
Num Of | Ports pressure in cycles |
Uops | 0 | 1 | 2 - D | 3 - D | 5 | 6 |
---------------------------------------------------------
2^ | 0.2 | 0.3 | 1.0 1.0 | | 0.3 | 0.3 | add eax, dword ptr [rdi+16]
1 | | | | 1.0 1.0 | | | mov rdi, qword ptr [rdi]
1* | | | | | | | cmp rdi, rsi
0*F | | | | | | | jnz 0xfffffffffffffff7
Total Num Of Uops: 4
What MCA didn’t tell us was that cmp
and jnz
got macrofused and the actual number of µops per iteration is 4.
0| 0|add eax, dword ptr [rdi+0x10] : | | |
0| 0| TYPE_LOAD (1 uops) :s---deeeew----R-------p |
0| 0| TYPE_OP (1 uops) :A--------sdw----R-------p |
0| 1|mov rdi, qword ptr [rdi] : | | |
0| 1| TYPE_LOAD (1 uops) :s---deeeew------R-------p |
0| 2|cmp rdi, rsi : | | |
0| 2| TYPE_OP (1 uops) :A--------sdw----R-------p |
0| 3|jnz 0xfffffffffffffff7 : | | |
0| 3| TYPE_OP (0 uops) :w---------------R-------p |
1| 0|add eax, dword ptr [rdi+0x10] : | | |
1| 0| TYPE_LOAD (1 uops) :A--------sdeeeew----R-------p |
1| 0| TYPE_OP (1 uops) :A--------------sdw----R-------p
1| 1|mov rdi, qword ptr [rdi] : | | |
1| 1| TYPE_LOAD (1 uops) : A-------sdeeeew------R-------p
1| 2|cmp rdi, rsi : | | |
1| 2| TYPE_OP (1 uops) : A-------------w------R-------p
1| 3|jnz 0xfffffffffffffff7 : | | |
1| 3| TYPE_OP (0 uops) : w--------------------R-------p
[A] – Allocated
[s] – Sources ready
[d] – Dispatched for execution
[e] – Execute
[w] – Writeback
[R] – Retired
[p] – Post Retire
[-] – pending
It’s time now to see what the numbers are if the code is actually executed2 and not just analysed. In this test, all elements of the list are stored sequentially in a contiguous block of memory. That’s the best case and is only a small part of the full picture, but it limits the memory effects and is closest to what the static analysers showed.
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
sum_list/1024 1378 ns 1376 ns 508484 ops=744.098M/s
sum_list/4096 6494 ns 6487 ns 107988 ops=631.396M/s
sum_list/16384 27602 ns 27560 ns 25404 ops=594.483M/s
sum_list/32768 55285 ns 55199 ns 12686 ops=593.629M/s
sum_list/65536 110544 ns 110374 ns 6346 ops=593.762M/s
sum_list/131072 220977 ns 220640 ns 3173 ops=594.053M/s
sum_list/262144 448409 ns 447463 ns 1565 ops=585.845M/s
sum_list/524288 901703 ns 899810 ns 777 ops=582.665M/s
sum_list/1048576 2627253 ns 2618203 ns 266 ops=400.495M/s
sum_list/2097152 6143158 ns 6122264 ns 114 ops=342.545M/s
sum_list/4194304 12309066 ns 12266764 ns 57 ops=341.924M/s
sum_list/8388608 24549560 ns 24477472 ns 29 ops=342.707M/s
sum_list/16777216 49105582 ns 48961525 ns 14 ops=342.661M/s
sum_list/33554432 98183344 ns 97894047 ns 7 ops=342.763M/s
The test case parameter is the size of the list. When the dataset fits in the L1 cache, we get around 4 cycles per list element. Once the cache becomes too small, the memory bandwidth comes into play. 342 million 32-bit integers per second is not much but each list element also contains two 8-byte pointers, and its size needs to be 8-byte aligned. That brings the element size up to 24 bytes. Since the elements are in a contiguous array, that means the loop progresses in that array at the rate of around 8.2GB/s (24B × 342M/s), that’s still not the maximum memory bandwidth achievable in a single thread, but we are starting to get close to it.
Now, that we know what’s the best that we can get from a straightforward list implementation let’s see how bad that “the best” is. We can do that by comparing the results that we’ve got to a data structure that is much more hardware-friendly: an array. I’m using the following test code:
using iterator = std::vector<int>::const_iterator;
int sum_array(iterator first, iterator last) {
return std::accumulate(first, last, 0, std::plus<>{});
}
The core loop that the compiler emits looks as follows:
.L4:
vpaddd ymm1, ymm1, YMMWORD PTR [rax]
add rax, 32
cmp rax, rdx
jne .L4
We can already see the first benefit of an array. The compiler can vectorise the loop. The LLVM MCA result shows how this compares to the same operation performed on the list.
Iterations: 10000
Instructions: 40000
Total Cycles: 13343
Total uOps: 50000
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[1] [2] [3] [4] Instructions:
2 8 0.50 * vpaddd ymm1, ymm1, ymmword ptr [rax]
1 1 0.25 add rax, 32
1 1 0.25 cmp rax, rdx
1 1 0.50 jne .L4
There are still 4 instructions and 5 µops per iteration. The average cost of a single iteration is just 1.33 cycles. Not only that but each iteration processes eight elements of the array. In other words, instead of, on average, 4 cycles per integer, we now get 0.167.
Let’s see how this looks on the IACA timeline:
0| 0|vpaddd ymm1, ymm1, ymmword ptr [rax] : | |
0| 0| TYPE_LOAD (1 uops) :s---deeeeeew----R-------p
0| 0| TYPE_OP (1 uops) :A----------sdw----R-------p
0| 1|add rax, 0x20 : | |
0| 1| TYPE_OP (1 uops) :sdw---------------R-------p
0| 2|cmp rax, rdx : | |
0| 2| TYPE_OP (1 uops) :Asdw--------------R-------p
0| 3|jnz 0xfffffffffffffff5 : | |
0| 3| TYPE_OP (0 uops) :w-----------------R-------p
1| 0|vpaddd ymm1, ymm1, ymmword ptr [rax] : | |
1| 0| TYPE_LOAD (1 uops) :As--deeeeeew------R-------p
1| 0| TYPE_OP (1 uops) :A-----------sdw----R-------p
1| 1|add rax, 0x20 : | |
1| 1| TYPE_OP (1 uops) : sdw---------------R-------p
1| 2|cmp rax, rdx : | |
1| 2| TYPE_OP (1 uops) : Aw----------------R-------p
1| 3|jnz 0xfffffffffffffff5 : | |
1| 3| TYPE_OP (0 uops) : w-----------------R-------p
A significant difference is that the memory load in one iteration doesn’t depend on the memory load from the previous iteration. While there is a dependency between vpadd
instructions, it only affects the µop doing the addition, not the memory load. That allows the instruction-level parallelism the hide the latency of the load, which is by far the largest and allows us to achieve, on average, 1.33 cycles per iteration.
It is also worth noting that this loop can be further optimised by partially unrolling it and having multiple independent addition streams. Doing so would hide the latency of address increment (rax
) and data addition (ymm1
). However, this is not what this article is about so let’s stick to the simple example.
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------
sum_array/1024 57.8 ns 57.7 ns 11883599 ops=17.7382G/s
sum_array/4096 245 ns 245 ns 3393065 ops=16.7102G/s
sum_array/8192 420 ns 419 ns 1769924 ops=19.5334G/s
sum_array/16384 878 ns 877 ns 796682 ops=18.6718G/s
sum_array/32768 1906 ns 1903 ns 377939 ops=17.2148G/s
sum_array/65536 5961 ns 5953 ns 116709 ops=11.0097G/s
sum_array/131072 14382 ns 14358 ns 48150 ops=9.12886G/s
sum_array/262144 28744 ns 28704 ns 24386 ops=9.13269G/s
sum_array/524288 57481 ns 57401 ns 12196 ops=9.13385G/s
sum_array/1048576 115200 ns 114983 ns 6088 ops=9.1194G/s
sum_array/2097152 232698 ns 232258 ns 3014 ops=9.02941G/s
sum_array/4194304 524663 ns 523159 ns 1329 ops=8.01726G/s
sum_array/8388608 2356814 ns 2348119 ns 301 ops=3.57248G/s
sum_array/16777216 4978426 ns 4957937 ns 141 ops=3.38391G/s
sum_array/33554432 9934662 ns 9899215 ns 71 ops=3.38961G/s
These results are much better than the list. Since the core does only a minimum amount of work and memory bandwidth is the bottleneck in most cases we can rather easily infer cache level sizes from these results.
As long as we fit in the L1 cache, we do get the performance that the static analysis has promised. When the memory bandwidth is starting to become the limiting factor, the program processes 3 billion elements per second which translates to over 12GB/s, close to the maximum rate that a single core of this CPU can read from the DRAM. The list was able to achieve over 8GB/s, but since a significant portion of that was metadata, the actual element processing ended up being much lower.
Random memory access
So far we have looked at a simple and, perhaps, not very realistic case – small objects stored sequentially. While we already were able to identify some performance problems caused by lists we are yet to see what is perhaps the most important one: cache misses.
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
sum_list/1024 1379 ns 1378 ns 507422 ops=743.156M/s
sum_list/4096 20155 ns 20134 ns 34726 ops=203.438M/s
sum_list/16384 211483 ns 211150 ns 3317 ops=77.594M/s
sum_list/32768 577084 ns 576205 ns 1214 ops=56.8686M/s
sum_list/65536 1301668 ns 1299602 ns 538 ops=50.4277M/s
sum_list/131072 2745135 ns 2740843 ns 256 ops=47.8218M/s
sum_list/262144 6003913 ns 5990938 ns 117 ops=43.7568M/s
sum_list/524288 13040197 ns 13012240 ns 54 ops=40.2919M/s
sum_list/1048576 49060483 ns 48940961 ns 13 ops=21.4253M/s
sum_list/2097152 152734103 ns 152379010 ns 4 ops=13.7627M/s
sum_list/4194304 368973310 ns 368132055 ns 2 ops=11.3935M/s
sum_list/8388608 808360638 ns 806535445 ns 1 ops=10.4008M/s
sum_list/16777216 1693254531 ns 1689396273 ns 1 ops=9.93089M/s
sum_list/33554432 3470687291 ns 3462788715 ns 1 ops=9.69M/s
These are the results of the benchmark with objects randomly placed in memory. Since each element is at an unpredictable location, the hardware prefetchers aren’t able to hide the memory latency. Consequently, as long as the data fits in the L1 cache, the results are very similar to the sequential dataset. However, as soon as the memory loads need to go to the L2 cache, there is a significant degradation. Once DRAM latency comes into play, the results get absolutely horrible.
Prefetch
The tests we have run until now all involved a straightforward arithmetic operation and focused entirely on the memory. That’s hardly representative of all possible uses. Let’s see what do we get if each iteration is much more expensive. To make reasoning easier, I’m going to introduce more arithmetic operations but no additional memory accesses, even though this may be an oversimplification.
struct value {
intrusive::list_hook hook_;
int value_ = 23;
};
using iterator = list_type::const_iterator;
int is_prime(value const& v) {
for (auto i = 2; i < v.value_; i++) {
if (v.value_ % i == 0) {
return 0;
}
}
return 1;
}
int count_primes(iterator first, iterator last) {
return std::accumulate(first, last, 0,
[] (int a, value const& b) { return a + is_prime(b); });
}
There isn’t much we will be able to do if the list elements are stored sequentially. In those cases, the bottleneck was either the CPU core, which is something is_prime()
makes only worse, or memory bandwidth which we can’t increase nor can’t we reduce our needs, all elements have to be read.
Let’s have a look at the case when objects are at random locations in memory. Unsurprisingly, the results with are much lower than before:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024 58141 ns 58079 ns 12042 ops=17.6313M/s
count_primes/4096 232530 ns 232275 ns 3013 ops=17.6342M/s
count_primes/16384 952761 ns 951288 ns 736 ops=17.223M/s
count_primes/32768 1988941 ns 1985711 ns 351 ops=16.5019M/s
count_primes/65536 4064617 ns 4058007 ns 173 ops=16.1498M/s
count_primes/131072 8226710 ns 8213428 ns 85 ops=15.9583M/s
count_primes/262144 16932561 ns 16896077 ns 41 ops=15.5151M/s
count_primes/524288 35003895 ns 34927001 ns 20 ops=15.011M/s
count_primes/1048576 98477086 ns 98256410 ns 7 ops=10.6718M/s
count_primes/2097152 265279864 ns 264683726 ns 3 ops=7.92324M/s
count_primes/4194304 602553289 ns 601206514 ns 1 ops=6.97648M/s
count_primes/8388608 1285006296 ns 1282135197 ns 1 ops=6.54269M/s
count_primes/16777216 2655921573 ns 2650113330 ns 1 ops=6.33075M/s
count_primes/33554432 5405048320 ns 5393033036 ns 1 ops=6.22181M/s
We can try to improve this. What we are dealing with are slow computation and slow memory loads. The memory load from one iteration doesn’t depend on the result of computations from the previous one. As LLVM MCA has shown us earlier instruction-level parallelism is attempting to leverage that, at least to some extent. However, because in the C++ code we increment the iterator after the body of the iteration and the compiler doesn’t change that the processor ability to reorder those operations is limited. What we can do is to implement our own accumulate
that’s going to do those operations in the order that we want.
template<typename Iterator, typename T, typename Function>
T accumulate(Iterator first, Iterator last, T init, Function fn) {
while (first != last) {
init = fn(init, *first++);
}
return init;
}
There is a small, but measurable improvement.
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024 57882 ns 57820 ns 12086 ops=17.7102M/s
count_primes/4096 232351 ns 232040 ns 3015 ops=17.6522M/s
count_primes/16384 933098 ns 931562 ns 751 ops=17.5877M/s
count_primes/32768 1891233 ns 1888107 ns 371 ops=17.3549M/s
count_primes/65536 3812290 ns 3806098 ns 184 ops=17.2187M/s
count_primes/131072 7660883 ns 7648319 ns 92 ops=17.1374M/s
count_primes/262144 15721946 ns 15687144 ns 45 ops=16.7108M/s
count_primes/524288 32515040 ns 32436334 ns 22 ops=16.1636M/s
count_primes/1048576 89039706 ns 88833419 ns 8 ops=11.8038M/s
count_primes/2097152 237240276 ns 236644783 ns 3 ops=8.86203M/s
count_primes/4194304 533267107 ns 532046120 ns 1 ops=7.88335M/s
count_primes/8388608 1130966829 ns 1128285050 ns 1 ops=7.43483M/s
count_primes/16777216 2333217583 ns 2327657784 ns 1 ops=7.20777M/s
count_primes/33554432 4740416511 ns 4729532527 ns 1 ops=7.09466M/s
Essentially, what we are trying to do here is prefetching the next element. The processor has four hardware prefetchers:
- DCU prefetcher – detects ascending accesses to recently loaded data and fetches the next cache line
- DCU IP-based prefetcher – tracks load instructions looking for regular stride and loads the next anticipated memory location, can handle both ascending and descending access patterns
- Spatial Prefetcher – loads the adjacent cache line to the L2 cache
- Streamer – monitors requests from the L1 caches looking for ascending and descending access patterns and loads the next anticipated memory location
These prefetchers cover sequential accesses and large objects spanning multiple cache lines, but there’s nothing that would help with lists of small objects. We need software prefetching.
There is a family of prefetch
instructions that cause a line to be fetched to the L1 cache. Various variants allow specifying whether the actual access is going to be a load or a store and what’s the expected temporal locality. The benchmarks that we are running are not sophisticated enough to show differences between those variants, but we can see that a prefetch performs better than an explicit preload.
template<typename Iterator, typename T, typename Function>
T accumulate(Iterator first, Iterator last, T init, Function fn) {
while (first != last) {
__builtin_prefetch(first.current_->next_);
init = fn(init, *first);
first++;
}
return init;
}
The performance continues improving.
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024 58509 ns 58432 ns 11968 ops=17.5247M/s
count_primes/4096 233916 ns 233654 ns 2996 ops=17.5302M/s
count_primes/16384 936182 ns 934700 ns 749 ops=17.5286M/s
count_primes/32768 1872717 ns 1869688 ns 374 ops=17.5259M/s
count_primes/65536 3745794 ns 3739701 ns 187 ops=17.5244M/s
count_primes/131072 7491735 ns 7479747 ns 94 ops=17.5236M/s
count_primes/262144 14999112 ns 14966690 ns 47 ops=17.5152M/s
count_primes/524288 30253345 ns 30185697 ns 23 ops=17.3688M/s
count_primes/1048576 68201001 ns 68038354 ns 10 ops=15.4115M/s
count_primes/2097152 171015642 ns 170615830 ns 4 ops=12.2917M/s
count_primes/4194304 383902949 ns 383021166 ns 2 ops=10.9506M/s
count_primes/8388608 825511961 ns 823440345 ns 1 ops=10.1873M/s
count_primes/16777216 1716217939 ns 1712077835 ns 1 ops=9.79933M/s
count_primes/33554432 3509215001 ns 3500849170 ns 1 ops=9.58466M/s
Order guarantees
Everything we have considered and done so far was at low, hardware, level. It is time to take a look at the actual algorithm and see if it can be changed to improve the performance.
std::for_each
and std::accumulate
guarantee the order in which they are going to iterate over the range. Often the user doesn’t need that, and the guarantee becomes an unnecessary limitation. In reasonably simple cases, the compiler can see that the binary operation is associative and commutative and optimise the code accordingly. We saw that happen in the integer array sum case when the compiler chose to use vectorised vpadd
.
Unfortunately, we can’t rely on the optimiser to do similar optimisations for more complicated data structures. C++17 solved this problem by introducing std::reduce
which behaves just like std::accumulate
, but doesn’t specify the order in which it applies the provided binary operation.
Let’s take advantage of that. For a doubly-linked list, the lack of any order guarantees means that we can visit the elements of the list in two entirely independent sequences: one starting from the beginning and one starting from the end. Since the processor and memory can handle more than one cache miss at a time, we can expect noticeably better benchmark results.
template<typename Iterator, typename T, typename Function>
T reduce(Iterator first, Iterator last, T init, Function fn) {
T a = init;
T b{}; // <- not quite right, but let's keep it simple
while (first != last) {
auto current = first;
first++;
__builtin_prefetch(first.current_->next_);
a = fn(a, *current);
if (first == last) {
break;
}
--last;
current = last;
__builtin_prefetch(last.current_->prev_);
b = fn(b, *current);
}
return a + b;
}
The results look promising. The additional complexity penalises the cases with a smaller list, however.
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024 61918 ns 61836 ns 11312 ops=16.5599M/s
count_primes/4096 247573 ns 247294 ns 2830 ops=16.5633M/s
count_primes/16384 992230 ns 990653 ns 707 ops=16.5386M/s
count_primes/32768 1986464 ns 1983268 ns 353 ops=16.5222M/s
count_primes/65536 3974927 ns 3967599 ns 176 ops=16.5178M/s
count_primes/131072 7948909 ns 7936133 ns 88 ops=16.5159M/s
count_primes/262144 15914174 ns 15880084 ns 44 ops=16.5077M/s
count_primes/524288 32182869 ns 32110430 ns 22 ops=16.3277M/s
count_primes/1048576 69216427 ns 69057303 ns 10 ops=15.1841M/s
count_primes/2097152 152311285 ns 151964885 ns 5 ops=13.8002M/s
count_primes/4194304 314740429 ns 314020894 ns 2 ops=13.3568M/s
count_primes/8388608 641017755 ns 639543661 ns 1 ops=13.1166M/s
count_primes/16777216 1293884116 ns 1290747747 ns 1 ops=12.9981M/s
count_primes/33554432 2599582839 ns 2593566559 ns 1 ops=12.9376M/s
It’s also worth noting that while prefetching helped only when there was also quite expensive compute operation, iterating over the list from both ends helps even if the binary operation is just simple addition.
The reason for that is that one of the performance problems we have identified earlier was the dependency between iterations. The processor needs to read from the current object to get the address of the next one. With two independent chains of objects, the instruction-level parallelism has more opportunities to hide latencies.
Since we have already started changing high-level details, we may also want to consider doing slight modifications to the data structure itself. If the performance of reduce-like operations is important, it may be worthwhile to have a separate array of pointers to the list elements in an unspecified order. When an element is inserted into the list, the pointer is added at the end of the array. On erase, the pointer to the removed element is swapped with the last pointer in the array. There are several trade-offs involved here: increased memory footprint, including the hook size, more expensive insert and erase. However, what we can do now is to implement reduce in the following way:
template<typename T, typename Function>
T reduce(std::vector<value*> const& v, T init, Function&& fn) {
auto current = v.data();
auto prefetch = v.data() + 4;
auto end = v.data() + v.size();
while (current != end) {
__builtin_prefetch(*prefetch++);
init = fn(init, **current++);
}
return init;
}
If fn
is integer addition, this gets compiled to:
.L14:
mov rcx, QWORD PTR [rdx+32]
add rdx, 8
prefetcht0 [rcx]
mov rcx, QWORD PTR [rdx-8]
add eax, DWORD PTR [rcx+16]
cmp rsi, rdx
jne .L14
What we have gained is the fact that the only dependency between iterations is rdx
and the only instruction that modifies its value is cheap add rdx, 8
. That could significantly improve the performance by enabling much better instruction-level parallelism than any other list-based solution we’ve tried before as the LLVM MCA report below shows. On the other hand, we get an additional memory load and a prefetch in each iteration, which is not free.
Iterations: 10000
Instructions: 70000
Total Cycles: 20013
Total uOps: 80000
[0,0] DeeeeeER . . . mov rcx, qword ptr [rdx + 32]
[0,1] DeE----R . . . add rdx, 8
[0,2] D=====eeeeeER . . prefetcht0 byte ptr [rcx]
[0,3] D=eeeeeE----R . . mov rcx, qword ptr [rdx - 8]
[0,4] .D=====eeeeeeER. . add eax, dword ptr [rcx + 16]
[0,5] .DeE----------R. . cmp rsi, rdx
[0,6] .D=eE---------R. . jne .L14
[1,0] . DeeeeeE-----R. . mov rcx, qword ptr [rdx + 32]
[1,1] . DeE---------R. . add rdx, 8
[1,2] . D=====eeeeeER. . prefetcht0 byte ptr [rcx]
[1,3] . D=eeeeeE----R. . mov rcx, qword ptr [rdx - 8]
[1,4] . D=====eeeeeeER . add eax, dword ptr [rcx + 16]
[1,5] . DeE----------R . cmp rsi, rdx
[1,6] . D=eE---------R . jne .L14
The benchmark results are much better.
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
sum_list/1024 753 ns 752 ns 931223 ops=1.36227G/s
sum_list/4096 4003 ns 3998 ns 176082 ops=1024.41M/s
sum_list/16384 33795 ns 33743 ns 20828 ops=485.547M/s
sum_list/32768 73549 ns 73446 ns 9034 ops=446.15M/s
sum_list/65536 158379 ns 158158 ns 4428 ops=414.37M/s
sum_list/131072 328278 ns 327774 ns 2136 ops=399.886M/s
sum_list/262144 906197 ns 904299 ns 783 ops=289.887M/s
sum_list/524288 3287541 ns 3279609 ns 214 ops=159.863M/s
sum_list/1048576 10340606 ns 10308163 ns 68 ops=101.723M/s
sum_list/2097152 28847559 ns 28760461 ns 24 ops=72.9179M/s
sum_list/4194304 76017404 ns 75794048 ns 9 ops=55.3382M/s
sum_list/8388608 184997683 ns 184460327 ns 4 ops=45.4765M/s
sum_list/16777216 409853275 ns 408730117 ns 2 ops=41.0472M/s
sum_list/33554432 867062224 ns 864723310 ns 1 ops=38.8037M/s
The array-of-pointers reduce algorithm was still using prefetching. The reason is the same as before. If fn
performs expensive computation, prefetching helps to ensure that this time is spent loading succeeding elements to the L1 cache. It can harm the cases when the binary operation is cheap, and this is what’s happening now. Below are the results of the same benchmark, but without the explicit prefetch.
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
sum_list/1024 508 ns 508 ns 1000000 ops=2.01644G/s
sum_list/4096 3170 ns 3167 ns 222224 ops=1.29346G/s
sum_list/16384 29529 ns 29485 ns 23731 ops=555.678M/s
sum_list/32768 64005 ns 63911 ns 10472 ops=512.709M/s
sum_list/65536 134276 ns 134086 ns 5220 ops=488.762M/s
sum_list/131072 276732 ns 276299 ns 2533 ops=474.385M/s
sum_list/262144 863967 ns 862067 ns 814 ops=304.088M/s
sum_list/524288 3259658 ns 3251631 ns 215 ops=161.238M/s
sum_list/1048576 9678259 ns 9646640 ns 72 ops=108.699M/s
sum_list/2097152 25978633 ns 25902456 ns 27 ops=80.9634M/s
sum_list/4194304 71612908 ns 71414814 ns 10 ops=58.7316M/s
sum_list/8388608 178783902 ns 178306598 ns 4 ops=47.046M/s
sum_list/16777216 400214167 ns 399145315 ns 2 ops=42.0329M/s
sum_list/33554432 848662556 ns 846467448 ns 1 ops=39.6405M/s
Now, if we go back to the counting prime numbers test the things look different. With prefetch:
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024 58373 ns 58308 ns 11982 ops=17.5621M/s
count_primes/4096 230787 ns 230508 ns 3047 ops=17.7695M/s
count_primes/16384 911047 ns 909573 ns 772 ops=18.0128M/s
count_primes/32768 1823311 ns 1820351 ns 385 ops=18.0009M/s
count_primes/65536 3653772 ns 3647734 ns 192 ops=17.9662M/s
count_primes/131072 7314532 ns 7302256 ns 96 ops=17.9495M/s
count_primes/262144 14650767 ns 14618345 ns 48 ops=17.9325M/s
count_primes/524288 29483920 ns 29417635 ns 24 ops=17.8222M/s
count_primes/1048576 65331597 ns 65176522 ns 11 ops=16.0882M/s
count_primes/2097152 141598740 ns 141261338 ns 5 ops=14.8459M/s
count_primes/4194304 293117615 ns 292433879 ns 2 ops=14.3427M/s
count_primes/8388608 595831596 ns 594449906 ns 1 ops=14.1115M/s
count_primes/16777216 1201265965 ns 1198475633 ns 1 ops=13.9988M/s
count_primes/33554432 2413779177 ns 2408014847 ns 1 ops=13.9345M/s
Without prefetch:
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024 60592 ns 60511 ns 11626 ops=16.9225M/s
count_primes/4096 240139 ns 239875 ns 2919 ops=17.0755M/s
count_primes/16384 968054 ns 966523 ns 725 ops=16.9515M/s
count_primes/32768 1962921 ns 1959719 ns 357 ops=16.7208M/s
count_primes/65536 3958932 ns 3951584 ns 177 ops=16.5847M/s
count_primes/131072 7957449 ns 7944047 ns 88 ops=16.4994M/s
count_primes/262144 16322022 ns 16286870 ns 43 ops=16.0954M/s
count_primes/524288 33987005 ns 33912660 ns 21 ops=15.4599M/s
count_primes/1048576 99127792 ns 98909029 ns 7 ops=10.6014M/s
count_primes/2097152 244108086 ns 243574237 ns 3 ops=8.60991M/s
count_primes/4194304 541851480 ns 540651988 ns 1 ops=7.75786M/s
count_primes/8388608 1149760156 ns 1147266232 ns 1 ops=7.31182M/s
count_primes/16777216 2371160344 ns 2365748664 ns 1 ops=7.09172M/s
count_primes/33554432 4818974014 ns 4807875429 ns 1 ops=6.97906M/s
As we can see, explicit software prefetch can be beneficial, but it largely depends on the specifics of the code and the way memory accesses are balanced against computations.
Conclusion
What this exploration of ways to optimise doubly-linked lists has shown us several things. The fact that lists perform poorly is hardly a surprise, but depending on the particularities of the dataset there were different reasons for it. When everything fitted in the L1 cache, the problem was the mere fact that std::accumulate
on a list needs to do more than the same operation on an array and can’t be virtualised. Once the dataset grew larger than the cache, but the objects were stored sequentially the memory bandwidth was the limiting factor. With randomly placed objects we saw a significant drop in performance as soon as the processor had to load data from L2 cache which only got worse as the size of the data grew.
Another thing that we saw was that while many things that potentially can improve performance a lot of them are just trade-offs. Sometimes they can help, even a lot, but for other workloads and data they do nothing at best, are harmful at worst. Since we have run only simple benchmarks, there were still many phenomena that didn’t affect us at all. There are also other aspects that need consideration like the readability and maintainability of the code. All this makes optimising generic algorithms harder and explains why often more specific solutions can offer better performance.
It’s still probably a good idea to avoid lists if possible, though.