How are Unix pipes implemented?
> This article is about how pipes are implemented the Unix kernel. I was a little disappointed that a recent article titled “How do Unix pipes work?” was not about the internals, and curious enough to go digging in some old sources to try to answer the question.
A "living" Linux process with no memory
> This code gets a list of all memory maps from /proc/self/maps, then creates a new executable map where it jits some code that calls munmap() on each of the maps it just got, and finally on the map it’s on. This is just a quick example with no portability in mind, so the source code contains the actual bytes that would be emitted by a x64 compiler. After unmapping the final map, where the jit code lies, there’s no new instruction to execute and a segfault is raised.
firefox's low-latency webassembly compiler
> The goals of high throughput and low latency conflict with each other. To get best throughput, a compiler needs to spend time on code motion, register allocation, and instruction selection; to get low latency, that’s exactly what a compiler should not do. Web browsers therefore take a two-pronged approach: they have a compiler optimized for throughput, and a compiler optimized for latency. As a WebAssembly file is being downloaded, it is first compiled by the quick-and-dirty low-latency compiler, with the goal of producing machine code as soon as possible. After that “baseline” compiler has run, the “optimizing” compiler works in the background to produce high-throughput code. The optimizing compiler can take more time because it runs on a separate thread. When the optimizing compiler is done, it replaces the baseline code. (The actual heuristics about whether to do baseline + optimizing (“tiering“) or just to go straight to the optimizing compiler are a bit hairy, but this is a summary.)
> This article is about the WebAssembly baseline compiler in Firefox. It’s a surprising bit of code and I learned a few things from it.
Speeding up Linux disk encryption
> At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
> To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
How Much of a Genius-Level Move Was Using Binary Space Partitioning in Doom?
> A decade after Doom’s release, in 2003, journalist David Kushner published a book about id Software called Masters of Doom, which has since become the canonical account of Doom’s creation. I read Masters of Doom a few years ago and don’t remember much of it now, but there was one story in the book about lead programmer John Carmack that has stuck with me. This is a loose gloss of the story (see below for the full details), but essentially, early in the development of Doom, Carmack realized that the 3D renderer he had written for the game slowed to a crawl when trying to render certain levels. This was unacceptable, because Doom was supposed to be action-packed and frenetic. So Carmack, realizing the problem with his renderer was fundamental enough that he would need to find a better rendering algorithm, started reading research papers. He eventually implemented a technique called “binary space partitioning,” never before used in a video game, that dramatically sped up the Doom engine.
History of research into BSP.
Understanding X mouse cursors (and their several layers of history)
> The X protocol (and server) come with a pre-defined set of cursors. If your program is happy with one of these, you use it by telling the X server that you want cursor number N with XCreateFontCursor(). As mentioned in the manpage (and hinted at by the function name), the server loads these cursors from a specific X font, which is exposed to clients under the special font name ‘cursor’. Like the special ‘fixed’ font name, this isn’t even a XLFD font name and so there’s no way to specify what pixel size you want your cursors to be in; you get whatever (font) size the font is or the server decides on (if the X font the server is using is one where it can do that, and I’m not sure that the X server even supports resizable fonts for the special cursor font).
Recent and future pattern matching improvements
> Much of writing software revolves around checking if some data has some shape (“pattern“), extracting information from it, and then reacting if there was a match. To facilitate this, many modern languages, Rust included, support what is known as “pattern matching”.
Patching nVidia GPU driver for hot-unplug on Linux
> Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.
Donald Knuth Was Framed
Knuth writes 8 pages and McIlroy writes six lines.
> A damning counter. But neither of us had ever read the paper. And as you know, I’m all about primary sources. We pulled up the paper here and read through it together. And it left us with a very different understanding of literate programming, and the challenge, than the famous story gave.
Fixing memory leaks in web applications
> Part of the bargain we struck when we switched from building server-rendered websites to client-rendered SPAs is that we suddenly had to take a lot more care with the resources on the user’s device. Don’t block the UI thread, don’t make the laptop’s fan spin, don’t drain the phone’s battery, etc. We traded better interactivity and “app-like” behavior for a new class of problems that don’t really exist in the server-rendered world.
> One of these problems is memory leaks. A poorly-coded SPA can easily eat up megabytes or even gigabytes of memory, continuing to gobble up more and more resources, even as it’s sitting innocently in a background tab. At this point, the page might start to slow to a crawl, or the browser may just terminate the tab and you’ll see Chrome’s familiar “Aw, snap!” page.
> C++ “move” semantics are simple, but they are still widely misunderstood. This post is an attempt to shed light on that situation.
I like that the appendix is 3 times the article’s length.
Don't touch my clipboard
> You can (but shouldn’t) change how people copy text from your website.
Escaping the Chrome Sandbox with RIDL
> Vulnerabilities that leak cross process memory can be exploited to escape the Chrome sandbox. An attacker is still required to compromise the renderer prior to mounting this attack. To protect against attacks on affected CPUs make sure your microcode is up to date and disable hyper-threading (HT).
This is a pretty clear write-up and comes with a nice footnote:
> When I started working on this I was surprised that it’s still exploitable even though the vulnerabilities have been public for a while. If you read guidance on the topic, they will usually talk about how these vulnerabilities have been mitigated if your OS is up to date with a note that you should disable hyper threading to protect yourself fully. The focus on mitigations certainly gave me a false sense that the vulnerabilities have been addressed and I think these articles could be more clear on the impact of leaving hyper threading enabled.
DOM Clobbering strikes back
> As classic client-side vulnerabilities like XSS and CSRF get patched, CSP’d and SameSite’d into oblivion, niche attack techniques like DOM Clobbering are becoming ever more relevant. Michał Bentkowski recently used DOM Clobbering to exploit GMail, six years after I first introduced the technique in 2013. In this post, I’m going to quickly introduce DOM Clobbering, expand on my original research with some new techniques, and share two interactive labs so you can try the techniques out for yourself.
> This post explains how to implement heap allocators from scratch. It presents and discusses different allocator designs, including bump allocation, linked list allocation, and fixed-size block allocation. For each of the three designs, we will create a basic implementation that can be used for our kernel.
OpenSMTPD advisory dissected
> Qualys contacted by e-mail to tell me they found a vulnerability in OpenSMTPD and would send me the encrypted draft for advisory. Receiving this kind of e-mail when working on a daemon that can’t revoke completely privileges is not a thing you want to read, particularly when you know how efficient they are at spotting a small bug and leveraging into a full-fledged clusterfuck.
Legacy code bad, even when it’s freshly written legacy code.
The Hidden Number Problem
> The Hidden Number Problem (HNP) is a problem that poses the question: Are the most signficant bits of a Diffie-Hellman shared key as hard to compute as the entire secret? The original problem was defined in the paper “Hardness of computing the most significant bits of secret keys in Diffie-Hellman and related schemes” by Dan Boneh and Ramarathnam Venkatesan.
> In this paper Boneh and Venkatesan demonstrate that a bounded number of most signifcant bits of a shared secret are as hard to compute as the entire secret itself. They also demonstrate an efficient algorithm for recovering secrets given a significant enough bit leakage. This notebook walks through some of the paper and demonstrates some of the results.
Precision Opportunities for Demanded Bits in LLVM
> A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
The Soundness Pledge
> This post is an opportunity to share some thoughts I’ve had about soundness, Rust, and open source community.
> I believe one of the most important contributions of Rust is the cultural ideal of perfect soundness: that code using a sound library, no matter how devious, is unable to trigger undefined behavior (which is often thought of in terms of crashes but can be far more insidious). Any deviation from this is a bug. The Rust language itself clearly subscribes to this ideal, even as it sometimes falls short of attaining it (at this writing, there are 44 I-unsound bugs, the oldest of which is more than 6 years old).
Gathering Intel on Intel AVX-512 Transitions
> This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
> Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.