PopCount on ARM64 in Go Assembler
> Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware. I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.
Ice Lake Store Elimination
> We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
What Outranks Thread Priority?
> This investigation started, as so many of mine do, with me minding my own business, not looking for trouble. In this case all I was doing was opening my laptop lid and trying to log on. The first few times that this resulted in a twenty-second delay I ignored the problem, hoping that it would go away. The next few times I thought about investigating, but performance problems that occur before you have even logged on are trickier to solve, and I was feeling lazy. When I noticed that I was avoiding closing my laptop because I dreaded the all-too-frequent delays when opening it I realized it was time to get serious.
A lot of effort for a rather unsatisfactory conclusion, but I won’t spoil the surprise.
This Goes to Eleven - Decimating Array.Sort with AVX2
> Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.
> Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop. Sounds interesting? If so, down the rabbit hole we go…
Very well done.
Speeding up Linux disk encryption
> At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
> To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
Analysing .NET start-up time with Flamegraphs
> Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’: In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:
Elixir and Postgres: A Rarely Mentioned Problem
> Last time, we talked about the magic trick to make your full text searches go fast. This time, I’ll tell you about another performance issue I encountered that probably also affects your performance, at least if you are using Ecto and PostgreSQL.
Precision Opportunities for Demanded Bits in LLVM
> A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
Gathering Intel on Intel AVX-512 Transitions
> This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
> Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.
Too Much Crypto
> We show that many symmetric cryptography primitives would not be less safe with significantly fewer rounds. To support this claim, we review the cryptanalysis progress in the last 20 years, examine the reasons behind the current number of rounds, and analyze the risk of doing fewer rounds. Advocating a rational and scientific approach to round numbers selection, we propose revised number of rounds for AES, BLAKE2, ChaCha, and SHA-3, which offer more consistent security margins across primitives and make them much faster, without increasing the security risk.
Chunking Optimizations: Let the Knife Do the Work
> “Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.
Go memory ballast: How I learnt to stop worrying and love the heap
> The heap size is the total size of allocations on the heap. Therefore, if a ballast of 10 GiB is allocated, the next GC will only trigger when the heap size grows to 20 GiB. At that point, there will be roughly 10 GiB of ballast + 10 GiB of other allocations.
Boosting the Real Time Performance of Gnome Shell 3.34 in Ubuntu 19.10
> As you may have read many times, Gnome 3.34 brings much improved desktop performance. In this article we will describe some of the improvements contributed by Canonical, how the problems were surprising, how they were approached and what other performance work is coming in future.
> The thing is in the case of Gnome Shell its biggest performance problems of late were not hot spots at all. They were better characterised as cold spots where it was idle instead of updating the screen smoothly. Such cold spots are only apparent when you look at the real time usage of a program, and in not the CPU or GPU time consumed.
Nice write-up on addressing stuttering and lag.
Clang format tanks performance
> Let’s benchmark toupper implementations.
> Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.
The day when starting a receiver fixed the transmitter
> Have you ever tried to do something, but had it fail and weren’t really sure why? Did you then try to fall back to doing something you could actually measure in order to then get a handle on the problem? I had something like this happen quite a while back with some software defined radio stuff. Here’s how it went.
Snap: a microkernel approach to host networking
> This paper describes the networking stack, Snap, that has been running in production at Google for the last three years+. It’s been clear for a while that software designed explicitly for the data center environment will increasingly want/need to make different design trade-offs to e.g. general-purpose systems software that you might install on your own machines. But wow, I didn’t think we’d be at the point yet where we’d be abandoning TCP/IP! You need a lot of software engineers and the willingness to rewrite a lot of software to entertain that idea.
FreeBSD'fy ZFS zlib zalloc/zfree callbacks
> The previous code came from OpenSolaris, which in my understanding require allocation size to be known to free memory. To store that size previous code allocated additional 8 byte header. But I have noticed that zlib with present settings allocates 64KB context buffers for each call, that could be efficiently cached by UMA, but addition of those 8 bytes makes them fall back to physical RAM allocations, that cause huge overhead and lock congestion on small blocks. Since FreeBSD’s free() does not have the size argument, switching to it solves the problem, increasing write speed to ZVOLs with 4KB block size and GZIP compression on my 40-threads test system from ~60MB/s to ~600MB/s.
An analysis of performance evolution of Linux’s core operations
> When you get into the details I found it hard to come away with any strongly actionable takeaways though. Perhaps the most interesting lesson/reminder is this: it takes a lot of effort to tune a Linux kernel. For example:
> “Red Hat and Suse normally required 6-18 months to optimise the performance an an upstream Linux kernel before it can be released as an enterprise distribution”, and
> “Google’s data center kernel is carefully performance tuned for their workloads. This task is carried out by a team of over 100 engineers, and for each new kernel, the effort can also take 6-18 months.”
Dramatically reduced power usage in Firefox 70 on macOS with Core Animation
> In Firefox 70 we changed how pixels get to the screen on macOS. This allows us to do less work per frame when only small parts of the screen change. As a result, Firefox 70 drastically reduces the power usage during browsing.
> Every Firefox window contains one OpenGL context, which covers the entire window. Firefox 69 was using the API described above. So we were always redrawing the whole window on every change, and the window manager was always copying our entire window to the screen on every change. This turned out to be a problem despite the fact that these draws were fully hardware accelerated.
> Core Animation is the name of an Apple framework which lets you create a tree of layers (CALayer). These layers usually contain textures with some pixel content. The layer tree defines the positions, sizes, and order of the layers within the window. Starting with macOS 10.14, all windows use Core Animation by default, as a way to share their rendering with the window manager.
Mispredicted branches can multiply your running times
> Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.