Avoiding instruction cache misses
Modern processors are quite complex, with many parts having the potential to become a bottleneck. It is relatively easy to reason about the performance of short pieces of code, especially if memory effects are kept to a minimum. Both static analysis tools like LLVM MCA and microbenchmarks can provide a lot of information in such cases. However, the behaviour of the whole program is not just the sum of those small parts. As the code becomes larger and more complex other effects start appearing. One of such potential problems are excessive instruction cache misses.
Every program has different properties, and those large-scale effects will affect it differently. However, if its job is to execute complex logic on a small amount of data, the instruction cache is likely to become a problem at some point. The actual impact may vary significantly from codebase to codebase, which is why I won’t show any numbers in this article. Let’s consider this just a collection of ideas, but it is not easy to tell how much any of them will help for a given application.
First, let’s have a quick look at the processor front end. Below is a simplified diagram of how it is arranged in Skylake, the numbers between units are the maxima per cycle.
Each cycle the processor fetches up to 16 bytes from the instruction cache using information from the Branch Prediction Unit to predict the control flow. The pre-decode unit determines instruction lengths and puts up to five of them in the Instruction Queue. From the Instruction Queue, up to 5 instructions (with macro-fusion) are brought each cycle to the decoders. There is one complex decoder that can handle instructions that translate to up to 4 µops and 3 simple decoders that can handle only single-µop instructions. In total, all decoders are limited to producing no more than 5 µops each cycle. Instructions that require more than 4 µops go through Microcode Sequence ROM, which emits 4 µops per cycle, and while it is active, the decoders are disabled. There is also Decoded ICache (DSB) which caches decoded µops. It can emit up to 6 µops each cycle. All µops, regardless of their source, end up in the Instruction Decode Queue (IDQ). The Loop Stream Detector (LSD) detects small loops and keeps them in the queue, so that no fetched, decodes or reads from the DSB are needed during the duration of the loop. IDQ is the last part of the front end, and the queued µops continue to the back end.
From the instruction cache point of view, the front end has two weaknesses. Firstly, instructions are processed in-order which can severely limit the processor ability to hide latencies of cache misses. HyperThreading can make sure that this part of the processor still does some useful work, but it is also the source of the second problem – all resources, including the L1 instruction cache and µop cache are shared between the hardware threads.
Modern processors provide various metrics that help monitor their behaviour. However, the task of extracting the relevant data requires a proper approach if it is to be done efficiently. Top-down analysis is invaluable with helping to understand microarchitectural phenomena in large codebases. The idea is to monitor program behaviour with the PMU counters and identify the bottleneck starting with the major functional parts of the CPU and then digging deeper narrowing down on the exact source of the problem. It can be done in an automated way by tools like VTune or toplev.
FE Frontend_Bound: 39.48 +- 0.00 % Slots
BE Backend_Bound: 16.19 +- 0.00 % Slots
This category represents fraction of slots where no uops are
being delivered due to a lack of required resources for
accepting new uops in the Backend...
FE Frontend_Bound.Frontend_Latency: 24.92 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 13.45 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 14.45 +- 0.00 % Clocks <==
This metric represents fraction of cycles the CPU was
stalled due to instruction cache misses...
Sampling events: frontend_retired.l2_miss:pp frontend_retired.l1i_miss:pp
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 8.71 +- 0.00 % Clocks
This metric represents fraction of cycles the CPU was
stalled due to instruction TLB misses...
Sampling events: frontend_retired.stlb_miss:pp frontend_retired.itlb_miss:pp
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 8.42 +- 0.00 % Clocks_Est
This metric represents fraction of cycles the CPU was
stalled due to Branch Resteers...
Sampling events: br_misp_retired.all_branches:u
FE Frontend_Bound.Frontend_Bandwidth.MITE: 31.93 +- 0.00 % CoreClocks
This metric represents Core fraction of cycles in which CPU
was likely limited due to the MITE pipeline (Legacy Decode
Pipeline)...
Above is an example of a toplev result. We can see that the instruction cache misses were the dominating bottleneck. Unsurprisingly instruction TLB misses also show up. On the front end bandwidth side of things, toplev points to the legacy decode pipeline. That makes perfect sense if the instructions are supplied from DSB or LSD, then there are no instruction fetches, and no cache misses.
Sometimes, the final summary may not provide sufficient information if there are changes in the code behaviour during the test. When that’s the case, a graph is likely to be a much more helpful way of presenting the results.
Tools like toplev are great for initial identification of the problem, but once that’s done what we need is a right way for comparing different solutions. Ultimately, the most important metric is the actual performance of the program in a real-life workload. toplev still can be helpful as it shows the balance between different performance-limiting factors. What also can be useful is perf stat
which can show the performance counter statistics. The event most relevant for us is L1-icache-load-misses
, though there are more model-specific registers that may be of interest.
Now, that we know how to diagnose excessive instruction cache misses, let’s see what can be done to deal with this problem.
Avoiding work
If the number of executed instructions is the problem, the most obvious solution is to try to reduce that number. Obviously, that’s much easier said than done, but there are some common patterns for dealing with this issue. One example would be prepared statements in databases. The general idea is that if a client knows it will send requests that have some commonality, it can tell the database engine early that the requests are going to match specific templates. This information allows the server to do as much work as possible during the preparation stage, thus reducing the amount of logic that needs to be executed for each individual request.
Extracting common patterns doesn’t have to be explicit. A server or any other kind of an application which actions depend on the user input could attempt to look for repeating patterns and cache some common parts. This is a very vague idea and most likely won’t be easily implementable in a lot of applications, but in some cases may be quite a natural solution. It also shows the main problem with the “just do less” approach – it is very application specific. On the plus, side, if this can be done, it is likely to help with the overall performance, not just the instruction cache misses.
Another potential problem is making sure that the preparation phase can really do something to help during the execution phase. If that means pre-computing some values, then it’s simple. However, if the only thing that preparation gives us is the knowledge which code paths are going to be exercised and which branches taken during the execution, it is going to be harder to take benefit from it. One option is to have some specialised code for the most common paths, C++ templates may come in handy here. If it is not easy to determine what may be the most common paths, then a just-in-time compiler may be used to generate code in the preparation stage.1
Batching work
So far, we have been trying to reduce the number of executed instructions, by taking advantage of some earlier knowledge to avoiding repeating the same work. In other words, we have introduced two stages:
- preparation, which is performed rarely and, consequently, less performance critical
- execution, which is done many times and is expected to dominate overall performance
The way this can help with the instruction cache misses is that the execution stage, being smaller, is more likely to fit in the instruction cache. Whether this brings any measurable benefits depends highly on the type of the application and how much logic can be moved from execution to preparation stages.
We have already split our processing pipeline into preparation and execution stage. If the execution stage can fit in the instruction cache, we are done. However, often, that’s not the case. What we can do to improve the situation more is to split the execution into more stages. This time the goal is not to reuse work as it was with the preparation, but to group entities that need to have the same code executed for them. In other words, if the processing pipeline consists of steps A, B and C the idea is to separate them, add a queue in front of each of those stages, and then cycle through those stages each time handling multiple elements from the queue. Connections between the stages don’t have to be one-to-one, any directed graph is fine.
In the diagram above, there is one stage that feeds tasks to one of two stages. This could be, for example, a front-end of a database server. The first stage does some initial request processing and then, depending on whether it is a read or write, puts it in the appropriate queue. Each stage processes requests in batches, the first one warms up the instruction cache so that the subsequent ones can take advantage of it and hopefully won’t see any instruction cache misses. To make this work well we need to ensure that each stage fits in the cache and the control flow doesn’t diverge too much – we won’t benefit from warmed up cache if each request takes a different path through the code.
The architecture I’ve just described is very similar to SEDA. For the instruction cache purposes, there’s no need to run execution stages on different threads. That actually may be inadvisable since the increase in latency caused by excessive communication between threads is one of the SEDA significant problems. This can be entirely avoided with an asynchronous thread-per-core architecture if tasks are batched per thread. That’s how I did that in Scylla with quite good results.
Regardless of how batching fits in the overall program architecture, there’s still one crucial issue to watch out for. Unless the latency of an individual task is irrelevant, there needs to be a bound on how long it can stay in the queue. This is a trade-off between the throughput and latency and the right solution to this problem is going to be application specific.
Optimisation options
We have considered relatively high-level changes to the code and its architecture that may improve the instruction cache situation. It is time now too look at the lower level side of things, and more specifically what the compiler can do for us.
When it comes to instruction cache misses one optimisation that can change a lot is function inlining. On the one hand inserting a function body in its call sites may increase the code size and increase the number of instruction cache misses, but on the other hand, it allows more optimisations, avoids calling convention overhead, and improves code locality. The point is that inlining can make a significant difference in both directions, and it is definitely something worth checking.
It is also important to remember that the compilers usually allow more control than just, relatively coarse-grained, flags like -finline-functions
or -finline-functions-called-once
. There is also a set of tunable parameters that helps control the compiler behaviour. For example, inline-unit-growth
limits the growth of the compilation unit. The actual effect depends on how the source code is organised, which may lead to quite confusing compiler behaviour if a parameter like this one is limiting the function inlining.
Another way of influencing inlining decisions is to use attributes like [[gnu::always_inline]]
or [[gnu::noinline]]
(and [[gnu::flatten]]
for the brave). This may help in some cases, but adds clutter to the code and if the compiler is unable to make the right decisions without them then it probably means that its heuristics need more tuning.
Function cloning is also a type of optimisation that can directly affect the code size. GCC may use it to do constant propagation across functions if the callee is not inlined.
[[gnu::noinline]] // disable inlining to show cloning
int callee(int x, int y) {
return x + 1000 / y;
}
int caller(int x) {
return callee(x, 4);
}
This gets compiled to:
callee(int, int) [clone .constprop.0]:
lea eax, [rdi+250]
ret
callee(int, int):
mov eax, 1000
cdq
idiv esi
add eax, edi
ret
caller(int):
jmp callee(int, int) [clone .constprop.0]
Almost always function cloning will increase the total binary size. It may not necessarily be a problem, though. If the program is going to execute in the hot path only a few cloned versions that benefited significantly from the constant propagation, then we may expect a reduction of the instruction cache misses. Like most things in this section and, actually, the majority of this article, the effect depends on the codebase.
The function definition needs to be visible in the translation unit that contains the call site to be eligible for inlining or cloning. This, again, means that the optimisation will depend on the organisation of the source code, which isn’t really something desirable. The solution for this problem is Link-Time Optimisation, which enables inlining functions across object files as well as other optimisations.
Another kind of optimisation that may cause trouble in the CPU front end is loop unrolling. Ideally, a loop would qualify for being detected by the LSD, if not then we’d want it to at least fit in DSB. However, that’s not necessarily what the best for the back end. As a result, the compiler may be quite aggressive with loop unrolling and produce quite a lot of code. This is definitely something to watch out for.
Finally, the compiler can be asked to optimise code for size. At first, it sounds like a good idea, a smaller program is more likely to fit in the instruction cache. However, this is done at the cost of not doing some performance-oriented optimisations. The end result depends highly on how much code size reduction -Os
can bring, and how vital for achieving good performance the optimisations that this flag doesn’t enable are.
Coding practices
While the compilers are prepared to do a wide range of quite aggressive optimisations, there are some limitations that they can’t easily overcome. That requires the behaviour mandated by the language standard (unless the “as-if” rule applies) or the ABI. To deal with that, the programmer needs to be aware of those restrictions, and, ideally, follow practices that lead to code that’s easier to optimise.
For example, the function calling conventions usually require non-trivial objects to be effectively passed by reference regardless of their size. Similarly, they can’t be returned in registers, and the callee needs to write the object to a buffer provided by the caller.
Let’s have a look at how passing a std::unique_ptr<int>
to a function looks like on x86_64-linux-gnu
:
void takes_unique_pointer(std::unique_ptr<int>);
void takes_unique_pointer_noexcept(std::unique_ptr<int>) noexcept;
void takes_raw_pointer(int*);
std::unique_ptr<int> unique;
void gives_unique_pointer() {
takes_unique_pointer(std::move(unique));
}
void gives_unique_pointer_noexcept() {
takes_unique_pointer_noexcept(std::move(unique));
}
int* raw;
void gives_raw_pointer() {
takes_raw_pointer(std::exchange(raw, {}));
}
This gets compiled2 to:
gives_unique_pointer():
push rbp
sub rsp, 16
mov rax, QWORD PTR unique[rip]
lea rdi, [rsp+8]
mov QWORD PTR unique[rip], 0
mov QWORD PTR [rsp+8], rax
call takes_unique_pointer(std::unique_ptr<int>)
mov rdi, QWORD PTR [rsp+8]
test rdi, rdi
je .L17
mov esi, 4
call operator delete(void*, unsigned long)
.L17:
add rsp, 16
pop rbp
ret
mov rbp, rax
jmp .L7
gives_unique_pointer() [clone .cold]:
.L7:
mov rdi, QWORD PTR [rsp+8]
test rdi, rdi
je .L16
mov esi, 4
vzeroupper
call operator delete(void*, unsigned long)
.L8:
mov rdi, rbp
call _Unwind_Resume
.L16:
vzeroupper
jmp .L8
gives_unique_pointer_noexcept():
sub rsp, 24
mov rax, QWORD PTR unique[rip]
lea rdi, [rsp+8]
mov QWORD PTR unique[rip], 0
mov QWORD PTR [rsp+8], rax
call takes_unique_pointer_noexcept(std::unique_ptr<int>)
mov rdi, QWORD PTR [rsp+8]
test rdi, rdi
je .L24
mov esi, 4
call operator delete(void*, unsigned long)
.L24:
add rsp, 24
ret
gives_raw_pointer():
mov rdi, QWORD PTR raw[rip]
mov QWORD PTR raw[rip], 0
jmp takes_raw_pointer(int*)
That’s a lot of code. Even though a std::unique_ptr
is just a single pointer with more logic associated with it, being non-trivial, it cannot be passed via registers. Moreover, the Itanium C++ ABI makes the caller responsible for destroying the objects passed by value to the callee. What that means is that this clean up needs to be repeated at each call site and can’t really be optimised since at this point the compiler doesn’t know what the callee does. Moreover, because of the exceptions, there is more than one way to leave a function via either a return or a throw. Unless the compiler can prove that no exception leaves the callee, because it is noexcept
or the implementation is available and straightforward enough, an additional handler doing the appropriate cleanup needs to be emitted.
Obviously, many more things may noticeably contribute to the code size growth: restrictions on copy elision, virtual and non-virtual thunks adjusting the object pointer in calls to virtual functions, accessing thread-local variables – just to name the few. However, there are some general guidelines. Often, a lot of complexity is caused by non-trivial objects, e.g. working with a std::string_view
instead of std::string
may be a good idea, as long as it doesn’t result in dangling pointers, that is. Incidentally, the restrictions that constexpr
imposes force the code to avoid the most expensive constructs and attempting to make as much code as possible constexpr
is very likely to result in the compiler producing simpler code that’s easier to optimise.
Annotating cold code
Compilers usually allow the programmer to provide some information about the expected characteristics of the code. For our purposes the most interesting one are [[gnu::cold]]
and [[gnu::hot]]
attributes in GCC and Clang. They allow telling the compiler whether a function is unlikely executed or a hot spot. This affects the optimisations and the location of those functions in the output binary. Cold functions are going to be grouped together in a .text.cold
section to make sure they stay out of the way of the rest of the code and hot functions are going to be grouped together in .text.hot
to improve instruction cache locality.
Moreover, with -freorder-blocks-and-partition
flag GCC can split functions and put cold blocks in .text.cold
section. Block that contain calls to cold functions are considered cold, as well as, some other like exception handlers. Even though at a different location in memory, those cold blocks are still effectively considered to be a part of the same function so the control can flow to them with just a single jump instruction, without any need to obey the calling convention.
static constexpr int threshold = 100;
void hot_path(int value);
[[gnu::cold]]
void cold_path(int value);
void do_work(int value) {
if (value > threshold) {
hot_path(value / 5);
return;
}
cold_path(value / 3);
}
The code above is compiled to:
do_work(int):
cmp edi, 100
jle .L2
movsx rax, edi
sar edi, 31
imul rax, rax, 1717986919
sar rax, 33
sub eax, edi
mov edi, eax
jmp hot_path(int)
do_work(int) [clone .cold]:
.L2:
mov eax, edi
mov ecx, 3
cdq
idiv ecx
mov edi, eax
jmp cold_path(int)
As we can see, the main part of the function contains a comparison and a conditional jump to the cold region, the rest is the hot path. Moreover, in the cold section, the compiler doesn’t transform a division by a constant into cheaper operations as it does in the hot part.
C++20 is going to gain two new attributes that would allow the programmer to annotate branches as likely or unlikely to be taken: [[likely]]
and [[unlikely]]
. Even now, some compiler provide extensions for that, such as __builtin_expect()
, __builtin_expect_with_probability()
. Their purpose is to rearrange the code in a way that optimises for the branches that are most likely to be taken. Apart from helping with the branch prediction itself, it may also improve instruction locality and reduce cache misses.
Profile-Guided Optimisation
Manually annotating code with hot/cold or likely/unlikely attributes can be a quite daunting task. There is also a matter of maintainability. As the code evolves, those properties may change, and there’s no easy way to identify the annotations that are no longer correct.
Fortunately, there is still a lot that the compiler can do for us. Profile-guided optimisation is a technique that instruments the program to monitor its behaviour and then uses the collected information to produce highly optimised binary. The burden of programmers having to maintain the annotations is gone. However, there is a risk of overfitting. If the training phase doesn’t exercise all possible workloads, those omitted may end up with unacceptable performance regressions. How much of a problem this depends on the program, the range of workloads and the differences between them. Another potential problem is the slowdown in the training phase caused by the instrumentation. This may reduce the accuracy of collected data.
There’s also BOLT, a standalone tool that optimises a program, focusing mainly on the instruction cache and the instruction TLB, by rearranging code using information from a profile collected in the training phase. Unlike PGO, profiles are created using sampling and LBR, and the overhead is much lower. BOLT can be combined with PGO and often brings additional improvements. Unfortunately, at the moment, split functions are not supported, meaning that previously discussed flag -freorder-blocks-and-partition
has to be disabled.
Conclusion
There are multiple ways of improving the instruction cache miss situation. One can try to change the architecture of the program to make it more hardware-friendly. This is likely to be the most time consuming but may help in more ways than just improving instruction cache hit ratio. Other options include helping the compiler help us. Changing the optimisation options, using PGO, and adjusting the coding practices may have a noticeable effect. Often it will take a combination of those changes before the instruction cache stops being the problem, or rather, something else becomes a more critical issue.
In the end, all optimisations are some kind of a trade-off.
Related links
- WikiChip: Skylake
- toplev manual
- SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
- The Staged Event-Driven Architecture for Highly Concurrent Server Applications
- A Retrospective on SEDA
- Interprocedural optimization in GCC
- Clang User Manual on Profile-Guided Optimisation
- GCC documentation on Profile-Guided Optimisation
- Accelerate large-scale applications with BOLT
- BOLT: A Practical Binary Optimizer for Data Centers and Beyond
There is a paper proposing
[[jit]]
attribute in C++. One may also use, for example, LLVM JIT and either generate the intermediate representation by hand or take advantage of higher level wrappers such as my experiment. ↩︎GCC 9.1.1 with
-O3 -std=c++17 -march=skylake
↩︎