Dissecting Lemire’s nearly divisionless random
The idea was simple, I’ve always felt that code readability is undervalued so I figured I’d put cold hard cash up. I announced a $1,000 pot, divided into $500, $300, and $200 prizes for the most readable implementations of Daniel Lemire’s nearly divisionless algorithm for selecting a random number from an interval. I now have winners to announce and congratulate, and they’re in this blog post, but there’s more to this story.
Performance of Elixir's System.get_env/0 Function
At work I was debugging a performance issue in one of our Elixir applications and stumbled across the strange implementation of Elixir’s System.get_env/0 function. In this blog post I’ll show how it caused performance issues for the application I was debugging and I’ll also propose a better implementation of the function. I’ll conclude by explaining why the better implementation is not used yet.
.NET Memory Performance Analysis
This document aims to help folks who develop applications in .NET with how to think about memory performance analysis and finding the right approaches to perform such analysis if they need to. In this context .NET includes .NET Framework and .NET Core. In order to get the latest memory improvements in both the garbage collector and the rest of the framework I strongly encourage you to be on .NET Core if you are not already, because that’s where the active development happens.
When I was writing this document I intended to introduce concepts like concurrent GC or pinning as needed by the explanation of the analysis. So as you read it, you’ll gradually come across them. If you already kind of knew what they are and are looking for explanation on a specific concept here are the links to them
On Modern Hardware the Min-Max Heap beats a Binary Heap
The heap is a data structure that I use all the time and that others somehow use rarely. (I once had a coworker tell me that he knew some code was mine because it used a heap) Recently I was writing code that could really benefit from using a heap (as most code can) but I needed to be able to pop items from both ends. So I read up on double-ended priority queues and how to implement them. These are rare, but the most common implementation is the “Interval Heap” that can be explained quickly, has clean code and is only slightly slower than a binary heap. But there is an alternative called the “Min-Max Heap” that doesn’t have pretty code, but it has shorter dependency chains, which is important on modern hardware. As a result it often ends up faster than a binary heap, even though it allows you to pop from both ends. Which means there might be no reason to ever use a binary heap again.
How Go 1.15 improved converting small integer values to interfaces
The answer turns out to be pretty straightforward, and is in Go CL 216401 (merged in this commit, which may be easier to read). The Go runtime has a special static array of the first 256 integers (0 to 255), and when it would normally have to allocate memory to store an integer on the heap as part of converting it to an interface, it first checks to see if it can just return a pointer to the appropriate element in the array instead. This kind of static allocation of frequently used values is common in languages with lots of dynamic allocation; Python does something similar for small integers, for example (which can sometimes surprise you).
PopCount on ARM64 in Go Assembler
Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware. I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.
Ice Lake Store Elimination
We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
What Outranks Thread Priority?
This investigation started, as so many of mine do, with me minding my own business, not looking for trouble. In this case all I was doing was opening my laptop lid and trying to log on. The first few times that this resulted in a twenty-second delay I ignored the problem, hoping that it would go away. The next few times I thought about investigating, but performance problems that occur before you have even logged on are trickier to solve, and I was feeling lazy. When I noticed that I was avoiding closing my laptop because I dreaded the all-too-frequent delays when opening it I realized it was time to get serious.
A lot of effort for a rather unsatisfactory conclusion, but I won’t spoil the surprise.
This Goes to Eleven - Decimating Array.Sort with AVX2
Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.
Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop. Sounds interesting? If so, down the rabbit hole we go…
Very well done.
Speeding up Linux disk encryption
At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
Analysing .NET start-up time with Flamegraphs
Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’: In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:
Elixir and Postgres: A Rarely Mentioned Problem
Last time, we talked about the magic trick to make your full text searches go fast. This time, I’ll tell you about another performance issue I encountered that probably also affects your performance, at least if you are using Ecto and PostgreSQL.
Precision Opportunities for Demanded Bits in LLVM
A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
Gathering Intel on Intel AVX-512 Transitions
This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.
Too Much Crypto
We show that many symmetric cryptography primitives would not be less safe with significantly fewer rounds. To support this claim, we review the cryptanalysis progress in the last 20 years, examine the reasons behind the current number of rounds, and analyze the risk of doing fewer rounds. Advocating a rational and scientific approach to round numbers selection, we propose revised number of rounds for AES, BLAKE2, ChaCha, and SHA-3, which offer more consistent security margins across primitives and make them much faster, without increasing the security risk.
Chunking Optimizations: Let the Knife Do the Work
“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.
Go memory ballast: How I learnt to stop worrying and love the heap
The heap size is the total size of allocations on the heap. Therefore, if a ballast of 10 GiB is allocated, the next GC will only trigger when the heap size grows to 20 GiB. At that point, there will be roughly 10 GiB of ballast + 10 GiB of other allocations.
Boosting the Real Time Performance of Gnome Shell 3.34 in Ubuntu 19.10
As you may have read many times, Gnome 3.34 brings much improved desktop performance. In this article we will describe some of the improvements contributed by Canonical, how the problems were surprising, how they were approached and what other performance work is coming in future.
The thing is in the case of Gnome Shell its biggest performance problems of late were not hot spots at all. They were better characterised as cold spots where it was idle instead of updating the screen smoothly. Such cold spots are only apparent when you look at the real time usage of a program, and in not the CPU or GPU time consumed.
Nice write-up on addressing stuttering and lag.
Clang format tanks performance
Let’s benchmark toupper implementations.
Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.
The day when starting a receiver fixed the transmitter
Have you ever tried to do something, but had it fail and weren’t really sure why? Did you then try to fall back to doing something you could actually measure in order to then get a handle on the problem? I had something like this happen quite a while back with some software defined radio stuff. Here’s how it went.