Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C
While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.
Eliminating Data Races in Firefox – A Technical Report
We successfully deployed ThreadSanitizer in the Firefox project to eliminate data races in our remaining C/C++ components. In the process, we found several impactful bugs and can safely say that data races are often underestimated in terms of their impact on program correctness. We recommend that all multithreaded C/C++ projects adopt the ThreadSanitizer tool to enhance code quality.
Cranelift, Part 3: Correctness in Register Allocation
In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs, by developing a symbolic checker that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator’s robustness.
Pointers Are Complicated II, or: We need better language specs
Below, I will show a series of three compiler transformations that each seem “intuitively justified”, but when taken together they lead to a clearly incorrect result. I will use LLVM for these examples, but the goal is not to pick on LLVM—other compilers suffer from similar issues. The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust, we need to take IR semantics (and specifically provenance) more seriously. I use LLVM for the examples because it is particularly easy to study with its single, extensively-documented IR that a lot of infrastructure evolved around. Let’s get started!
Rust programming language exploit mitigations
This section documents the exploit mitigations applicable to the Rust compiler when building programs for the Linux operating system on the AMD64 architecture and equivalent.
Zig's New Relationship with LLVM
In the early days, Zig was but a thin frontend in front of LLVM. This was instrumental for getting started quickly and filling in gaps of Andrew’s knowledge as a compiler developer. Now, the training wheels of the bicycle are coming off, and LLVM is transitioning into an optional component.
The move to a self-hosted compiler for Zig has similar advantages for the core contributors, but it also makes LLVM an optional dependency, increases compilation speed (instead of losing it), and adds an amazing feature for debug builds of your code: incremental compilation with in-place binary patching, another unique Zig feature.
Proposal: Register-based Go calling convention
We propose switching the Go ABI from its current stack-based calling convention to a register-based calling convention. Preliminary experiments indicate this will achieve at least a 5–10% throughput improvement across a range of applications. This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism.
This also presents a very nice overview of existing calling conventions.
Ensmallening Go binaries by prohibiting comparisons
In this post I’ll dig into what equality, in the context of a Go program, means and why changes like this have a measurable impact on the size of a Go program.
Addendum: thanks to Brad’s prodding, Go 1.15 already has a bunch of improvements by Cherry Zhang and Keith Randall that fix the most egregious of the failures to eliminate unnecessary equality and hash functions (although I suspect it was also to avoid the proliferation of this class of CLs).
firefox's low-latency webassembly compiler
The goals of high throughput and low latency conflict with each other. To get best throughput, a compiler needs to spend time on code motion, register allocation, and instruction selection; to get low latency, that’s exactly what a compiler should not do. Web browsers therefore take a two-pronged approach: they have a compiler optimized for throughput, and a compiler optimized for latency. As a WebAssembly file is being downloaded, it is first compiled by the quick-and-dirty low-latency compiler, with the goal of producing machine code as soon as possible. After that “baseline” compiler has run, the “optimizing” compiler works in the background to produce high-throughput code. The optimizing compiler can take more time because it runs on a separate thread. When the optimizing compiler is done, it replaces the baseline code. (The actual heuristics about whether to do baseline + optimizing (“tiering“) or just to go straight to the optimizing compiler are a bit hairy, but this is a summary.)
This article is about the WebAssembly baseline compiler in Firefox. It’s a surprising bit of code and I learned a few things from it.
Precision Opportunities for Demanded Bits in LLVM
A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
Speculative Load Hardening
While several approaches are being actively pursued to mitigate specific branches and/or loads inside especially risky software (most notably various OS kernels), these approaches require manual and/or static analysis aided auditing of code and explicit source changes to apply the mitigation. They are unlikely to scale well to large applications. We are proposing a comprehensive mitigation approach that would apply automatically across an entire program rather than through manual changes to the code. While this is likely to have a high performance cost, some applications may be in a good position to take this performance / security tradeoff.
Unintuitive JSON Parsing
Based on the title, could be just about anything...
The parser will not complain about leading zeros because JSON has no concept of leading zeros.
Unexpectedly a parse error, not a lex error.
Introducing Glush: a robust, human readable, top-down parser compiler
It’s been 45 years since Stephen Johnson wrote Yacc (Yet another compiler-compiler), a parser generator that made it possible for anyone to write fast, efficient parsers. Yacc, and its many derivatives, quickly became popular and were included in many Unix distributions. You would imagine that in 45 years we would have further perfected the art of creating parsers and would have standardized on a single tool. A lot of progress has been made, but there are still annoyances and problems affecting every tool out there.
This is great, even just for the overview of parsing.
The CYK algorithm (named after Cocke–Younger–Kasami) is in my opinion of great theoretical importance when it comes to parsing context-free grammars. CYK will parse all context-free parsers in O(n3), including the “simple” grammars that LL/LR can parse in linear time. It accomplishes this by converting parsing into a different problem: CYK shows that parsing context-free languages is equivalent to doing a boolean matrix multiplication. Matrix multiplication can be done naively in cubic time, and as such parsing context-free languages can be done in cubic time. It’s a very satisfying theoretical result, and the actual algorithm is small and easy to understand.
Pointer authentication is a technology which offers strong probabilistic protection against exploiting a broad class of memory bugs to take control of program execution. When adopted consistently in a language ABI, it provides a form of relatively fine-grained control flow integrity (CFI) check that resists both return-oriented programming (ROP) and jump-oriented programming (JOP) attacks.
While pointer authentication can be implemented purely in software, direct hardware support (e.g. as provided by ARMv8.3) can dramatically lower the execution speed and code size costs. Similarly, while pointer authentication can be implemented on any architecture, taking advantage of the (typically) excess addressing range of a target with 64-bit pointers minimizes the impact on memory performance and can allow interoperation with existing code (by disabling pointer authentication dynamically). This document will generally attempt to present the pointer authentication feature independent of any hardware implementation or ABI. Considerations that are implementation-specific are clearly identified throughout.
So We Don'T Have A Solution For Catalina...Yet
With the release of macOS 10.15 (Catalina), Apple has dropped support for running 32-bit executables and removed the 32-bit versions of system frameworks and libraries. Most Windows applications our users run with CrossOver are 32-bit and CrossOver uses a 32-bit Mac executable, system frameworks, and libraries to run them. This will break with Catalina.
And then comes the fun part:
We have built a modified version of the standard C language compiler for macOS, Clang, to automate many of the changes we need to make to Wine’s behavior without pervasive changes to Wine’s source code.
First, our version of Clang understands both 32- and 64-bit pointers. We are able to control from a broad level down to a detailed level which pointers in Wine’s source code need to be 32-bit and which 64-bit. Any code which substitutes for Windows at the interface with the Windows app has to use 32-bit pointers. On the other hand, the interfaces to the system libraries are always 64-bit.
Chunking Optimizations: Let the Knife Do the Work
“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.
Tearing apart printf()
If ‘Hello World’ is the first program for C students, then printf() is probably the first function. I’ve had to answer questions about printf() many times over the years, so I’ve finally set aside time for an informal writeup.
This wild goose chase is not only a great learning experience, but also an interesting test for the dedicated beginner. Will they come back with an answer? If so, how detailed is it? What IS a good answer?
How Swift Achieved Dynamic Linking Where Rust Couldn't
For those who don’t follow Swift’s development, ABI stability has been one of its most ambitious projects and possibly it’s defining feature, and it finally shipped in Swift 5. The result is something I find endlessly fascinating, because I think Swift has pushed the notion of ABI stability farther than any language without much compromise.
AddressSanitizer (ASan) for Windows with MSVC
We are pleased to announce AddressSanitizer (ASan) support for the MSVC toolset. ASan is a fast memory error detector that can find runtime memory issues such as use-after-free and perform out of bounds checks. Support for sanitizers has been one of our more popular suggestions on Developer Community, and we can now say that we have an experience for ASan on Windows, in addition to our existing support for Linux projects.
MSVC support for ASan is available in our second Preview release of Visual Studio 2019 version 16.4.
Most functional compiler
One-letter variable names abound in IOCCC entries, and for good reason. These tiny pieces of confetti are hard to read, and leave room for more code. Then why not go further and use zero-letter variable names? That is, tacit programming or point-free style.
I had been playing with an algorithm devised by Oleg Kiselyov that effortlessly and efficiently eliminates those pesky variables, leaving behind terms composed from a small set of combinators. No need for lambda lifting or supercombinators.
By adding a handful of lines of mostly parsing code, we get a Haskell compiler, or rather, a compiler that accepts a subset of Haskell sufficiently large to self-host. You might say I wrote a tool for this contest, then ran it on itself to make an entry for it.
And more: https://www.ioccc.org/years.html#2019