Matrix-vector multiplication with microwave signals

An RF signal at fixed frequency has amplitude and phase, so it can be represented or it can represent a complex number. Furthermore, microwave networks can be thought of as linear operations applied to those signals. When those networks are designed so that some of their parameters can be dynamically configured we can end up with a unitary matrix-complex vector multiplication. In this post I describe my implementation of 2×2 special unitary matrix by complex vector multiplication done (mostly) with microstrip and phase shifters.

On GPUs, ranges, latency, and superoptimisers

Not long ago, I put together a small library that provides an interface similar to the range-v3 library and uses CUDA to execute all operations on GPUs. I’ve implemented the basics: map, filter, reduce operations, just enough to serve as a proof of concept. In essence, it is quite similar to Thrust; just the interface is different, based on ranges instead of iterators. As one would expect, the initial implementation wasn’t heavily optimised, but the time has come to improve the performance. This was an excellent opportunity to look more closely at the architecture of modern GPUs – Turing in this case.

Avoiding instruction cache misses

Modern processors are quite complex, with many parts having the potential to become a bottleneck. It is relatively easy to reason about the performance of short pieces of code, especially if memory effects are kept to a minimum. Both static analysis tools like LLVM MCA and microbenchmarks can provide a lot of information in such cases. However, the behaviour of the whole program is not just the sum of those small parts. As the code becomes larger and more complex other effects start appearing. One of such potential problems are excessive instruction cache misses.

On lists, cache, algorithms, and microarchitecture

I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each, transform, and accumulate I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.