Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C
While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.
Security Analysis Of AMD Predictive Store Forwarding
AMD “Zen3” processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting dependencies between loads and stores. Like technologies such as branch prediction, with PSF the processor “guesses” what the result of a load is likely to be, and speculatively executes subsequent instructions. In the event that the processor incorrectly speculated on the result of the load, it is designed to detect this and flush the incorrect results from the CPU pipeline.
Security research in recent years has examined the security implications of incorrect CPU speculation and how in some cases it may lead to side channel attacks. For instance, conditional branch speculation, indirect branch speculation, and store bypass speculation have been demonstrated to have the potential to be used in side-channel attacks (e.g., Spectre v1, v2, and v4 respectively).
Counting cycles and instructions on the Apple M1 processor
Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.
To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.
How I cut GTA Online loading times by 70%
Some debug-stepping later it turns out it’s… JSON!
Of course it is. But a really solid reversing effort. And a nice fix.
Block Profiling in Go
The block profile in Go lets you analyze how much time your program spends waiting on the blocking operations listed below:
Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation
In this post I’ll explain how I configured my AMD ThreadRipper Pro workstation with 10 PCIe 4.0 SSDs to achieve 11M IOPS with 4kB random reads and 66 GiB/s throughput with larger IOs - and what bottlenecks & issues I fixed to get there. We’ll look into Linux block I/O internals and their interaction with modern hardware. We’ll use tools & techniques, old and new, for measuring bottlenecks - and other adventures in the kernel I/O stack.
Micro-Optimizing .tar.gz Archives by Changing File Order
A few weeks ago, I was doing something with a sizeable .tar.gz file, and wondered how the order of files affected the process. I’m not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.
Against essential and accidental complexity
In the classic 1986 essay, No Silver Bullet, Fred Brooks argued that there is, in some sense, not that much that can be done to improve programmer productivity. His line of reasoning is that programming tasks contain a core of essential/conceptual1 complexity that’s fundamentally not amenable to attack by any potential advances in technology (such as languages or tooling). He then uses an Ahmdahl’s law argument, saying that because 1/X of complexity is essential, it’s impossible to ever get more than a factor of X improvement via technological improvements.
To summarize, Brooks states a bound on how much programmer productivity can improve. But, in practice, to state this bound correctly, one would have to be able to conceive of problems that no one would reasonably attempt to solve due to the amount of friction involved in solving the problem with current technologies.
Pointers Are Complicated II, or: We need better language specs
Below, I will show a series of three compiler transformations that each seem “intuitively justified”, but when taken together they lead to a clearly incorrect result. I will use LLVM for these examples, but the goal is not to pick on LLVM—other compilers suffer from similar issues. The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust, we need to take IR semantics (and specifically provenance) more seriously. I use LLVM for the examples because it is particularly easy to study with its single, extensively-documented IR that a lot of infrastructure evolved around. Let’s get started!
Dissecting Lemire’s nearly divisionless random
The idea was simple, I’ve always felt that code readability is undervalued so I figured I’d put cold hard cash up. I announced a $1,000 pot, divided into $500, $300, and $200 prizes for the most readable implementations of Daniel Lemire’s nearly divisionless algorithm for selecting a random number from an interval. I now have winners to announce and congratulate, and they’re in this blog post, but there’s more to this story.
Performance of Elixir's System.get_env/0 Function
At work I was debugging a performance issue in one of our Elixir applications and stumbled across the strange implementation of Elixir’s System.get_env/0 function. In this blog post I’ll show how it caused performance issues for the application I was debugging and I’ll also propose a better implementation of the function. I’ll conclude by explaining why the better implementation is not used yet.
.NET Memory Performance Analysis
This document aims to help folks who develop applications in .NET with how to think about memory performance analysis and finding the right approaches to perform such analysis if they need to. In this context .NET includes .NET Framework and .NET Core. In order to get the latest memory improvements in both the garbage collector and the rest of the framework I strongly encourage you to be on .NET Core if you are not already, because that’s where the active development happens.
When I was writing this document I intended to introduce concepts like concurrent GC or pinning as needed by the explanation of the analysis. So as you read it, you’ll gradually come across them. If you already kind of knew what they are and are looking for explanation on a specific concept here are the links to them
On Modern Hardware the Min-Max Heap beats a Binary Heap
The heap is a data structure that I use all the time and that others somehow use rarely. (I once had a coworker tell me that he knew some code was mine because it used a heap) Recently I was writing code that could really benefit from using a heap (as most code can) but I needed to be able to pop items from both ends. So I read up on double-ended priority queues and how to implement them. These are rare, but the most common implementation is the “Interval Heap” that can be explained quickly, has clean code and is only slightly slower than a binary heap. But there is an alternative called the “Min-Max Heap” that doesn’t have pretty code, but it has shorter dependency chains, which is important on modern hardware. As a result it often ends up faster than a binary heap, even though it allows you to pop from both ends. Which means there might be no reason to ever use a binary heap again.
How Go 1.15 improved converting small integer values to interfaces
The answer turns out to be pretty straightforward, and is in Go CL 216401 (merged in this commit, which may be easier to read). The Go runtime has a special static array of the first 256 integers (0 to 255), and when it would normally have to allocate memory to store an integer on the heap as part of converting it to an interface, it first checks to see if it can just return a pointer to the appropriate element in the array instead. This kind of static allocation of frequently used values is common in languages with lots of dynamic allocation; Python does something similar for small integers, for example (which can sometimes surprise you).
PopCount on ARM64 in Go Assembler
Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware. I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.
Ice Lake Store Elimination
We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
What Outranks Thread Priority?
This investigation started, as so many of mine do, with me minding my own business, not looking for trouble. In this case all I was doing was opening my laptop lid and trying to log on. The first few times that this resulted in a twenty-second delay I ignored the problem, hoping that it would go away. The next few times I thought about investigating, but performance problems that occur before you have even logged on are trickier to solve, and I was feeling lazy. When I noticed that I was avoiding closing my laptop because I dreaded the all-too-frequent delays when opening it I realized it was time to get serious.
A lot of effort for a rather unsatisfactory conclusion, but I won’t spoil the surprise.
This Goes to Eleven - Decimating Array.Sort with AVX2
Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.
Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop. Sounds interesting? If so, down the rabbit hole we go…
Very well done.
Speeding up Linux disk encryption
At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
Analysing .NET start-up time with Flamegraphs
Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’: In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up: