How I cut GTA Online loading times by 70%
Some debug-stepping later it turns out it’s… JSON!
Of course it is. But a really solid reversing effort. And a nice fix.
Bilinear down/upsampling, aligning pixel grids, and that infamous GPU half pixel offset
It’s been more than two decades of me using bilinear texture filtering, a few months since I’ve written about bilinear resampling, but only two days since I discovered a bug of mine related to it. 😅 Similarly, just last week a colleague asked for a very fast implementation of bilinear on a CPU and it caused a series of questions “which kind of bilinear?”.
So I figured it’s an opportunity for another short blog post – on bilinear filtering, but in context of down/upsampling. We will touch here on GPU half pixel offsets, aligning pixel grids, a bug / confusion in Tensorflow, deeper signal processing analysis of what’s going on during bilinear operations, and analysis of the magic of the famous “magic kernel”.
Uncovering a 24-year-old bug in the Linux Kernel
When one side’s receive buffer (Recv-Q) fills up (in this case because the rsync process is doing disk I/O at a speed slower than the network’s), it will send out a zero window advertisement, which will put that direction of the connection on hold. When buffer space eventually frees up, the kernel will send an unsolicited window update with a non-zero window size, and the data transfer continues. To be safe, just in case this unsolicited window update is lost, the other end will regularly poll the connection state using the so-called Zero Window Probes (the persist mode we are seeing here).
Apparently, the bug was in the bulk receiver fast-path, a code path that skips most of the expensive, strict TCP processing to optimize for the common case of bulk data reception. This is a significant optimization, outlined 28 years ago² by Van Jacobson in his “TCP receive in 30 instructions” email. Apparently the Linux implementation did not update snd_wl1 while in the receiver fast path. If a connection uses the fast path for too long, snd_wl1 will fall so far behind that ack_seq will wrap around with respect to it. And if this happens while the receive window is zero, there is no way to re-open the window, as demonstrated above. What’s more, this bug had been present in Linux since v2.1.8, dating back to 1996!
2020 Chrome Extension Performance Report
I tested how the 1000 most popular Chrome extensions affect browser performance. The main metrics I’ll consider are CPU consumption, memory consumption, and whether the extension makes pages render more slowly.
Some results are terrible. Some are worse.
Improving texture atlas allocation in WebRender
This is a longer version of the piece I published in the mozilla gfx team blog where I focus on the atlas allocation algorithms. In this one I’ll go into more details about the process and methodology behind these improvements. The first part is about the making of guillotiere, a crate that I first released in March 2019. In the second part we’ll have a look at more recent work building upon what I did with guillotiere, to improve texture memory usage in WebRender/Firefox.
Dissecting the Apple M1 GPU
Apple’s latest line of Macs includes their in-house “M1” system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!
And part II where it really takes off: https://rosenzweig.io/blog/asahi-gpu-part-2.html
donut.c without a math library
My little donut.c has been making the rounds again, after being featured in a couple YouTube videos (e.g., Lex Fridman and Joma Tech). If I had known how much attention this code would get over the years, I would have spent more time on it.
Introducing the In-the-Wild Series
Unsafe string interning in Go
The result of this work is the package go4.org/intern which uses some pretty neat unsafe tricks to implement efficient string interning using weak references and Go finalizers. We’ll start by showing off the safe implementation and gradually introduce the concepts needed to understand the unsafe one as well.
node.example.com Is An IP Address
This takes a bit to get to the punchline, but man, good old duck typing for the win.
It turns out that, under certain conditions, the ipaddress module can create IPv6 addresses from raw bytes. My assumption is that it offers this behavior as a convenient way to parse IP addresses from data fresh off the wire.
Does node.example.com meet those certain conditions? You bet it does. Because we’re using Python 2 it’s just bytes and it happens to be 16 characters long.
Against essential and accidental complexity
In the classic 1986 essay, No Silver Bullet, Fred Brooks argued that there is, in some sense, not that much that can be done to improve programmer productivity. His line of reasoning is that programming tasks contain a core of essential/conceptual1 complexity that’s fundamentally not amenable to attack by any potential advances in technology (such as languages or tooling). He then uses an Ahmdahl’s law argument, saying that because 1/X of complexity is essential, it’s impossible to ever get more than a factor of X improvement via technological improvements.
To summarize, Brooks states a bound on how much programmer productivity can improve. But, in practice, to state this bound correctly, one would have to be able to conceive of problems that no one would reasonably attempt to solve due to the amount of friction involved in solving the problem with current technologies.
Why are video games graphics (still) a challenge? Productionizing rendering algorithms
This post will cover challenges and aspects of production to consider when creating new rendering / graphics techniques and algorithms – especially in the context of applied research for real time rendering. I will base this on my personal experiences, working on Witcher 2, Assassin’s Creed 4: Black Flag, Far Cry 4, and God of War.
Many of those challenges are easily ignored – they are real problems in production, but not necessarily there only if you only read about those techniques, or if you work on pure research, writing papers, or create tech demos.
I have seen statements like “why is this brilliant research technique X not used in production?” both from gamers, but also from my colleagues with academic background. And there are always some good reasons!
This is quite extensive.
The Easy Ones – Three Bugs Hiding in the Open
If everyone on a project spends all of their time heads-down working on the features and known bugs then there are probably some easy bugs hiding in plain sight. Take some time to look through the logs, clean up compiler warnings (although, really, if you have compiler warnings you need to rethink your life choices), and spend a few minutes running a profiler. Extra points if you add custom logging, enable some new warnings, or use a profiler that nobody else does.
Pointers Are Complicated II, or: We need better language specs
Below, I will show a series of three compiler transformations that each seem “intuitively justified”, but when taken together they lead to a clearly incorrect result. I will use LLVM for these examples, but the goal is not to pick on LLVM—other compilers suffer from similar issues. The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust, we need to take IR semantics (and specifically provenance) more seriously. I use LLVM for the examples because it is particularly easy to study with its single, extensively-documented IR that a lot of infrastructure evolved around. Let’s get started!
ARM and Lock-Free Programming
This is intended to be a casual introduction to the perils of lock-free programming (which I last wrote about some fifteen years ago), but also some explanation of why ARM’s weak memory model breaks some code, and why that code was probably broken already. I also want to explain why C++11 made the lock-free situation strictly better (objections to the contrary notwithstanding).
retvals, terrible teaching, and admitting we have a problem
Really though, this is everywhere. It’s not just that one class. It’s not just that one school. It shows up all over the place. The vast majority of pages about this kind of stuff manage to convey it incorrectly. It’s clear that not only is the horse out of the barn, but the cat is out of the bag, and the whole damn menagerie has cut loose and is running down Broadway singing show tunes. You just can’t expect people to do the right thing when the right thing is implemented this way. Too many people have voted with their feet and have decreed that they are just going to not check, and whatever happens, happens.
Fixing a 3+ year old bug in NVIDIA GeForce Experience
So the issue is such: If you have a joystick plugged in, and the GeForce Experience overlay enabled, your display will not sleep. If you unplug the joystick, the display sleeps. If you disable the overlay, the display sleeps. You can have one or the other - but not both. People hadn’t just tracked the issue down - people tracked it down 3 years ago!
But now for the deep dive disassembly to find and fix the bug. Solid work.
What went wrong with the libdispatch. A tale of caution for the future of concurrency.
The future was multithreading and we had to use the libdispatch to get there. So we did.
As we went down that rabbit hole, things got progressively worse.
What they don’t tell you about demand paging in school
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
Good look at practical realities.
Floating Point in the Browser, Part 3: When x+y=x
That is, if you add a small number to a large number then if the small number is “too small” then the large number may (in the default/sane round-to-nearest mode) stay at the same value.
Because of this the loop spins endlessly and the push command runs until the array hits the size limits. If there were no size limits then the push command would keep running until the entire machine ran out of memory, so, yay?