Analyzing Starfield’s Performance on Nvidia’s 4090 and AMD’s 7900 XTX
We analyzed this scene using Nvidia’s Nsight Graphics and AMD’s Radeon GPU Profiler to get some insight into why Starfield performs the way it does. On the Nvidia side, we covered the last three generations of cards by testing the RTX 4090, RTX 3090, and Titan RTX. On AMD, we tested the RX 7900 XTX. The i9-13900K was used to collect data for all of these GPUs.
FreeBSD on Firecracker
Experiences porting FreeBSD 14 to run on the Firecracker VMM
When Good Correlation is Not Enough
Choosing to use a block range index (BRIN) to query a field with high correlation is a no-brainer for the optimizer. The small size of the index and the field’s correlation makes BRIN an ideal choice. However, a recent event taught us that correlation can be misleading. Under some easily reproducible circumstances, a BRIN index can result in significantly slower execution even when the indexed field has very high correlation.
Commander Keen's Adaptive Tile Refresh
I have been reading Doom Guy by John Romero. It is an excellent book which I highly recommend. In the ninth chapter, John describes being hit by lightning upon seeing Adaptive Tile Refresh (ATS). That made me realize I never took the time to understand how this crucial piece of tech powers the Commander Keen (CK) series.
At its heart the problem ATS solves is bandwidth. Writing 320x200 nibbles (32 KiB) per frame is too much for the ISA bus. There is no way to maintain a 60Hz framerate while refreshing the whole screen. If we were to run the following code, which simply fills all banks, it would run at 5 frames per seconds.
Shoot ’em up in style: the making of Gun Trails on Playdate
Enter Playdate. I had wanted to build a shmup for years, but for various reasons—primarily bad scoping—the efforts always sputtered out. This little yellow device could provide the constraints needed, with the added bonus of a programming challenge to hit consistently high framerates.
gotraceui - an efficient frontend for Go execution traces
Gotraceui is a tool for visualizing and analyzing Go execution traces. It is meant to be a faster, more accessible, and more powerful alternative to go tool trace. Unlike go tool trace, Gotraceui doesn’t use deprecated browser APIs (or a browser at all), and its UI is tuned specifically to the unique characteristics of Go traces.
The futex_waitv() syscall and gaming on Linux
The futex_waitv syscall is a new syscall through which the process can wait for multiple futexes. The task wakes up when any futex in the list is awakened. This can be used to implement wait on multiple locks and wait lists, etc, without the limitations imposed by using eventfd.
The 8086 processor's microcode pipeline from die analysis
Do Not Taunt Happy Fun Branch Predictor
I recently came up with a “clever” idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don’t fall victim in the future.
Building the fastest Lua interpreter.. automatically!
I have been working on a research project to make writing VMs easier. The idea arises from the following observation: writing a naive interpreter is not hard (just write a big switch-case), but writing a good interpreter (or JIT compiler) is hard, as it unavoidably involves hand-coding assembly. So why can’t we implement a special compiler to automatically generate a high-performance interpreter (and even the JIT) from “the big switch-case”, or more formally, a semantical description of what each bytecode does?
How fast are Linux pipes anyway?
In this post, we will explore how Unix pipes are implemented in Linux by iteratively optimizing a test program that writes and reads data through a pipe.
We will proceed as follows:
A first slow version of our pipe test bench;
How pipes are implemented internally, and why writing and reading from them is slow;
How the vmsplice and splice syscalls let us get around some (but not all!) of the slowness;
A description of Linux paging, leading up to a faster version using huge pages;
The final optimization, replacing polling with busy looping;
Some closing thoughts.
All About Libpas, Phil's Super Fast Malloc
Libpas is a fast and memory-efficient memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it’ll be used for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.
Faster CRC32 on the Apple M1
CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
Speeding up sort performance in Postgres 15
Let’s explore each of the 4 improvements in PostgreSQL 15 that make sort performance go faster:
Change 1: Improvements sorting a single column
Change 2: Reduce memory consumption by using generation memory context
Change 3: Add specialized sort routines for common datatypes
Change 4: Replace polyphase merge algorithm with k-way merge
Fixing the entire SM64 Source Code (Insane N64 performance)
Recap of a lot of work, optimizing and rewriting code to squeeze out performance on limited hardware.
Changing std::sort at Google’s Scale and Beyond
We are changing std::sort in LLVM’s libcxx. That’s a long story of what it took us to get there and all possible consequences, bugs you might encounter with examples from open source. We provide some benchmarks, perspective, why we did this in the first place and what it cost us with exciting ideas from Hyrum’s Law to reinforcement learning. All changes went into open source and thus I can freely talk about all of them.
This article is split into 3 parts, the first is history with all details of recent (and not so) past of sorting in C++ standard libraries. Second part is about what it takes to switch from one sorting algorithm to another with various bugs. The final one is about the implementation we have chosen with all optimizations we have done.
What's new in CPUs since the 80s?
Everything below refers to x86 and linux, unless otherwise indicated. History has a tendency to repeat itself, and a lot of things that were new to x86 were old hat to supercomputing, mainframe, and workstation folks.
x86 chips have picked up a lot of new features and whiz-bang gadgets.
Overall, a pretty good introduction to modern CPUs, performance, and concurrency.
How to speed up the Rust compiler in April 2022
In my last post I introduced the Compiler performance roadmap for 2022. Let’s see how things are progressing.
Along the way I had to undo some optimizations I had added to this code a couple of years ago. Those optimizations turned out to be useful for one kind of expensive macro (with many rules but no metavariables) present in the html5ever benchmark. But such macros aren’t common in practice, and these optimizations were unhelpful for more typical expensive macros, which are recursive, have fewer rules, and use metavariables. This shows the value of a good benchmark suite.
Generics can make your Go code slower
Go 1.18 is here, and with it, the first release of the long-awaited implementation of Generics is finally ready for production usage. Generics are a frequently requested feature that has been highly contentious throughout the Go community. On the one side, vocal detractors worry about the added complexity. They fear the inescapable evolution of Go towards either a verbose and Enterprisey Java-lite with Generic Factories or, most terrifyingly, a degenerate HaskellScript that replaces ifs with Monads. In all fairness, both these fears may be overblown. On the other side, proponents of generics believe that they are a critical feature to implement clean and reusable code at scale.
This blog post does not take sides in that debate, or advise where and when to use Generics in Go. Instead, this blog post is about the third side of the generics conundrum: It’s about systems engineers who are not excited about generics per se, but about monomorphization and its performance implications. There are dozens of us! Dozens! And we’re all due for some serious disappointment.
PartialExecuter: Reducing WebAssembly size by exploring all executions in LLVM
Partial Executer is a brand-new LLVM optimization pass that uses an Interpreter-like engine to prove some code will never be executed, making it safe to eliminate it.