Running the “Reflections on Trusting Trust” Compiler
In October 1983, 40 years ago this week, Ken Thompson chose supply chain security as the topic for his Turing award lecture, although the specific term wasn’t used back then. (The field of computer science was still young and small enough that the ACM conference where Ken spoke was the “Annual Conference on Computers.”) Ken’s lecture was later published in Communications of the ACM under the title “Reflections on Trusting Trust.” It is a classic paper, and a short one (3 pages); if you haven’t read it yet, you should. This post will still be here when you get back.
In the lecture, Ken explains in three steps how to modify a C compiler binary to insert a backdoor when compiling the “login” program, leaving no trace in the source code. In this post, we will run the backdoored compiler using Ken’s actual code. But first, a brief summary of the important parts of the lecture.
Polonius refers to a few things. It is a new formulation of the borrow checker. It is also a specific project that implemented that analysis, based on datalog. Our current plan does not make use of that datalog-based implementation, but uses what we learned implementing it to focus on reimplementing Polonius within rustc.
An instruction oddity in the ppc64 (PowerPC 64-bit) architecture
As Raymond Chen notes, ‘or rd, ra, ra’ has the effect of ‘move ra to rd’. Moving a register to itself is a NOP, but several Power versions (the Go code’s comment says Power8, 9, and 10) overload this particular version of a NOP (and some others) to signal that the priority of your hardware thread should be changed by the CPU; in the specific case of ‘or r1, r1, r1’ it drops you to low priority. That leaves us with the mystery of why such an instruction would be used by a compiler, instead of the official NOP (per Raymond Chen, this is ‘or r0, r0, 0’).
As covered in the specific ppc64 diff in the change that introduced this issue, Go wanted to artificially mark a particular runtime function this way (see CL 425396 and Go issue #54332 for more). To do this it needed to touch the stack pointer in a harmless way, which would trigger the toolchain’s weirdness detector. On ppc64, the stack pointer is in r1. So the obvious and natural thing to do is to move r1 to itself, which encodes as ‘or r1, r1, r1’, and which then triggers this special architectural behavior of lowering the priority of that hardware thread. Oops.
Building the fastest Lua interpreter.. automatically!
I have been working on a research project to make writing VMs easier. The idea arises from the following observation: writing a naive interpreter is not hard (just write a big switch-case), but writing a good interpreter (or JIT compiler) is hard, as it unavoidably involves hand-coding assembly. So why can’t we implement a special compiler to automatically generate a high-performance interpreter (and even the JIT) from “the big switch-case”, or more formally, a semantical description of what each bytecode does?
The Applesoft Compiler (TASC): We have the source code, in a sense
Chaining was a common technique when your program got too large to fit into memory all at once, so you broke it into multiple programs that each handed off control to each other.
As the author added features, he kept hitting the Apple ][‘s 48KB RAM limit and was forced to delete all the comments from the code, and when that wasn’t enough, he resorted to shortening all the important variable names to one character.
How to speed up the Rust compiler in April 2022
In my last post I introduced the Compiler performance roadmap for 2022. Let’s see how things are progressing.
Along the way I had to undo some optimizations I had added to this code a couple of years ago. Those optimizations turned out to be useful for one kind of expensive macro (with many rules but no metavariables) present in the html5ever benchmark. But such macros aren’t common in practice, and these optimizations were unhelpful for more typical expensive macros, which are recursive, have fewer rules, and use metavariables. This shows the value of a good benchmark suite.
Generics can make your Go code slower
Go 1.18 is here, and with it, the first release of the long-awaited implementation of Generics is finally ready for production usage. Generics are a frequently requested feature that has been highly contentious throughout the Go community. On the one side, vocal detractors worry about the added complexity. They fear the inescapable evolution of Go towards either a verbose and Enterprisey Java-lite with Generic Factories or, most terrifyingly, a degenerate HaskellScript that replaces ifs with Monads. In all fairness, both these fears may be overblown. On the other side, proponents of generics believe that they are a critical feature to implement clean and reusable code at scale.
This blog post does not take sides in that debate, or advise where and when to use Generics in Go. Instead, this blog post is about the third side of the generics conundrum: It’s about systems engineers who are not excited about generics per se, but about monomorphization and its performance implications. There are dozens of us! Dozens! And we’re all due for some serious disappointment.
PartialExecuter: Reducing WebAssembly size by exploring all executions in LLVM
Partial Executer is a brand-new LLVM optimization pass that uses an Interpreter-like engine to prove some code will never be executed, making it safe to eliminate it.
Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C
While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.
Eliminating Data Races in Firefox – A Technical Report
We successfully deployed ThreadSanitizer in the Firefox project to eliminate data races in our remaining C/C++ components. In the process, we found several impactful bugs and can safely say that data races are often underestimated in terms of their impact on program correctness. We recommend that all multithreaded C/C++ projects adopt the ThreadSanitizer tool to enhance code quality.
Cranelift, Part 3: Correctness in Register Allocation
In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs, by developing a symbolic checker that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator’s robustness.
Pointers Are Complicated II, or: We need better language specs
Below, I will show a series of three compiler transformations that each seem “intuitively justified”, but when taken together they lead to a clearly incorrect result. I will use LLVM for these examples, but the goal is not to pick on LLVM—other compilers suffer from similar issues. The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust, we need to take IR semantics (and specifically provenance) more seriously. I use LLVM for the examples because it is particularly easy to study with its single, extensively-documented IR that a lot of infrastructure evolved around. Let’s get started!
Rust programming language exploit mitigations
This section documents the exploit mitigations applicable to the Rust compiler when building programs for the Linux operating system on the AMD64 architecture and equivalent.
Zig's New Relationship with LLVM
In the early days, Zig was but a thin frontend in front of LLVM. This was instrumental for getting started quickly and filling in gaps of Andrew’s knowledge as a compiler developer. Now, the training wheels of the bicycle are coming off, and LLVM is transitioning into an optional component.
The move to a self-hosted compiler for Zig has similar advantages for the core contributors, but it also makes LLVM an optional dependency, increases compilation speed (instead of losing it), and adds an amazing feature for debug builds of your code: incremental compilation with in-place binary patching, another unique Zig feature.
Proposal: Register-based Go calling convention
We propose switching the Go ABI from its current stack-based calling convention to a register-based calling convention. Preliminary experiments indicate this will achieve at least a 5–10% throughput improvement across a range of applications. This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism.
This also presents a very nice overview of existing calling conventions.
Ensmallening Go binaries by prohibiting comparisons
In this post I’ll dig into what equality, in the context of a Go program, means and why changes like this have a measurable impact on the size of a Go program.
Addendum: thanks to Brad’s prodding, Go 1.15 already has a bunch of improvements by Cherry Zhang and Keith Randall that fix the most egregious of the failures to eliminate unnecessary equality and hash functions (although I suspect it was also to avoid the proliferation of this class of CLs).
firefox's low-latency webassembly compiler
The goals of high throughput and low latency conflict with each other. To get best throughput, a compiler needs to spend time on code motion, register allocation, and instruction selection; to get low latency, that’s exactly what a compiler should not do. Web browsers therefore take a two-pronged approach: they have a compiler optimized for throughput, and a compiler optimized for latency. As a WebAssembly file is being downloaded, it is first compiled by the quick-and-dirty low-latency compiler, with the goal of producing machine code as soon as possible. After that “baseline” compiler has run, the “optimizing” compiler works in the background to produce high-throughput code. The optimizing compiler can take more time because it runs on a separate thread. When the optimizing compiler is done, it replaces the baseline code. (The actual heuristics about whether to do baseline + optimizing (“tiering“) or just to go straight to the optimizing compiler are a bit hairy, but this is a summary.)
This article is about the WebAssembly baseline compiler in Firefox. It’s a surprising bit of code and I learned a few things from it.
Precision Opportunities for Demanded Bits in LLVM
A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
Speculative Load Hardening
While several approaches are being actively pursued to mitigate specific branches and/or loads inside especially risky software (most notably various OS kernels), these approaches require manual and/or static analysis aided auditing of code and explicit source changes to apply the mitigation. They are unlikely to scale well to large applications. We are proposing a comprehensive mitigation approach that would apply automatically across an entire program rather than through manual changes to the code. While this is likely to have a high performance cost, some applications may be in a good position to take this performance / security tradeoff.
Unintuitive JSON Parsing
Based on the title, could be just about anything...
The parser will not complain about leading zeros because JSON has no concept of leading zeros.
Unexpectedly a parse error, not a lex error.