Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C
While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.
Eliminating Data Races in Firefox – A Technical Report
We successfully deployed ThreadSanitizer in the Firefox project to eliminate data races in our remaining C/C++ components. In the process, we found several impactful bugs and can safely say that data races are often underestimated in terms of their impact on program correctness. We recommend that all multithreaded C/C++ projects adopt the ThreadSanitizer tool to enhance code quality.
Security Analysis Of AMD Predictive Store Forwarding
AMD “Zen3” processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting dependencies between loads and stores. Like technologies such as branch prediction, with PSF the processor “guesses” what the result of a load is likely to be, and speculatively executes subsequent instructions. In the event that the processor incorrectly speculated on the result of the load, it is designed to detect this and flush the incorrect results from the CPU pipeline.
Security research in recent years has examined the security implications of incorrect CPU speculation and how in some cases it may lead to side channel attacks. For instance, conditional branch speculation, indirect branch speculation, and store bypass speculation have been demonstrated to have the potential to be used in side-channel attacks (e.g., Spectre v1, v2, and v4 respectively).
signed char lotte
“signed char lotte” is a computer program written by Brian Westley and the winner of the “Best Layout” award in the 1990 International Obfuscated C Code Contest. The cleverness of the text is staggering. Superficially it reads as an epistolary exchange between two (possibly former) lovers, Charlotte and Charlie. At the same, it is an executable piece of code whose action is thematically related to its story.
How does Go know time.Now?
This post may be a little longer than usual, so grab your coffees, grab your teas and without further ado, let’s dive in and see what we can come up with.
Recovering A Full Pem Private Key When Half Of It Is Redacted
The @CryptoHack__ account was pinged today by ENOENT, with a CTF-like challenge found in the wild: Source tweet. Here’s a write-up covering how given a partially redacted PEM, the whole private key can be recovered. The Twitter user, SAXX, shared a partially redacted private RSA key in a tweet about a penetration test where they had recovered a private key. Precisely, a screenshot of a PEM was shared online with 31 of 51 total lines of the file redacted. As ENOENT correctly identified, the redaction they had offered wasn’t sufficient, and from the shared screenshot, it was possible to totally recover the private key.
Speculating The Entire X86-64 Instruction Set In Seconds With This One Weird Trick
As cheesy as the title sounds, I promise it cannot beat the cheesiness of the technique I’ll be telling you about in this post. The morning I saw Mark Ermolov’s tweet about the undocumented instruction reading from/writing to the CRBUS, I had a bit of free time in my hands and I knew I had to find out the opcode so I started theory-crafting right away. After a few hours of staring at numbers, I ended up coming up with a method of discovering practically every instruction in the processor using a side(?)-channel. It’s an interesting method involving even more interesting components of the processor so I figured I might as well write about it, so here it goes.
Counting cycles and instructions on the Apple M1 processor
Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.
To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.
Cranelift, Part 3: Correctness in Register Allocation
In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs, by developing a symbolic checker that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator’s robustness.
Wrecking sandwich traders for fun and profit
However, nothing is risk-free on the blockchain, and exploitative trading strategies such as sandwich trading and front-running actually increase in risk the more the engineer attempts to generalise their ability to capture opportunities.
To illustrate to novice traders the risks of playing in the mempool, I have conducted a demonstration of a new trading alpha I call “Salmonella”, which involves intentionally exploiting the generalised nature of front-running setups. The goal of sandwich trading is to exploit the slippage of unintended victims, so this strategy turns the tables on the exploiters.
How I cut GTA Online loading times by 70%
Some debug-stepping later it turns out it’s… JSON!
Of course it is. But a really solid reversing effort. And a nice fix.
Bilinear down/upsampling, aligning pixel grids, and that infamous GPU half pixel offset
It’s been more than two decades of me using bilinear texture filtering, a few months since I’ve written about bilinear resampling, but only two days since I discovered a bug of mine related to it. 😅 Similarly, just last week a colleague asked for a very fast implementation of bilinear on a CPU and it caused a series of questions “which kind of bilinear?”.
So I figured it’s an opportunity for another short blog post – on bilinear filtering, but in context of down/upsampling. We will touch here on GPU half pixel offsets, aligning pixel grids, a bug / confusion in Tensorflow, deeper signal processing analysis of what’s going on during bilinear operations, and analysis of the magic of the famous “magic kernel”.
Uncovering a 24-year-old bug in the Linux Kernel
When one side’s receive buffer (Recv-Q) fills up (in this case because the rsync process is doing disk I/O at a speed slower than the network’s), it will send out a zero window advertisement, which will put that direction of the connection on hold. When buffer space eventually frees up, the kernel will send an unsolicited window update with a non-zero window size, and the data transfer continues. To be safe, just in case this unsolicited window update is lost, the other end will regularly poll the connection state using the so-called Zero Window Probes (the persist mode we are seeing here).
Apparently, the bug was in the bulk receiver fast-path, a code path that skips most of the expensive, strict TCP processing to optimize for the common case of bulk data reception. This is a significant optimization, outlined 28 years ago² by Van Jacobson in his “TCP receive in 30 instructions” email. Apparently the Linux implementation did not update snd_wl1 while in the receiver fast path. If a connection uses the fast path for too long, snd_wl1 will fall so far behind that ack_seq will wrap around with respect to it. And if this happens while the receive window is zero, there is no way to re-open the window, as demonstrated above. What’s more, this bug had been present in Linux since v2.1.8, dating back to 1996!
2020 Chrome Extension Performance Report
I tested how the 1000 most popular Chrome extensions affect browser performance. The main metrics I’ll consider are CPU consumption, memory consumption, and whether the extension makes pages render more slowly.
Some results are terrible. Some are worse.
Improving texture atlas allocation in WebRender
This is a longer version of the piece I published in the mozilla gfx team blog where I focus on the atlas allocation algorithms. In this one I’ll go into more details about the process and methodology behind these improvements. The first part is about the making of guillotiere, a crate that I first released in March 2019. In the second part we’ll have a look at more recent work building upon what I did with guillotiere, to improve texture memory usage in WebRender/Firefox.
Dissecting the Apple M1 GPU
Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!
And part II where it really takes off: https://rosenzweig.io/blog/asahi-gpu-part-2.html
donut.c without a math library
My little donut.c has been making the rounds again, after being featured in a couple YouTube videos (e.g., Lex Fridman and Joma Tech). If I had known how much attention this code would get over the years, I would have spent more time on it.
Introducing the In-the-Wild Series
Unsafe string interning in Go
The result of this work is the package go4.org/intern which uses some pretty neat unsafe tricks to implement efficient string interning using weak references and Go finalizers. We’ll start by showing off the safe implementation and gradually introduce the concepts needed to understand the unsafe one as well.
node.example.com Is An IP Address
This takes a bit to get to the punchline, but man, good old duck typing for the win.
It turns out that, under certain conditions, the ipaddress module can create IPv6 addresses from raw bytes. My assumption is that it offers this behavior as a convenient way to parse IP addresses from data fresh off the wire.
Does node.example.com meet those certain conditions? You bet it does. Because we’re using Python 2 it’s just bytes and it happens to be 16 characters long.