The multi-generational LRU
One of the key tasks assigned to the memory-management subsystem is to optimize the system’s use of the available memory; that means pushing out pages containing unused data so that they can be put to better use elsewhere. Predicting which pages will be accessed in the near future is a tricky task, and the kernel has evolved a number of mechanisms designed to improve its chances of guessing right. But the kernel not only often gets it wrong, it also can expend a lot of CPU time to make the incorrect choice. The multi-generational LRU patch set posted by Yu Zhao is an attempt to improve that situation.
Uncovering a 24-year-old bug in the Linux Kernel
When one side’s receive buffer (Recv-Q) fills up (in this case because the rsync process is doing disk I/O at a speed slower than the network’s), it will send out a zero window advertisement, which will put that direction of the connection on hold. When buffer space eventually frees up, the kernel will send an unsolicited window update with a non-zero window size, and the data transfer continues. To be safe, just in case this unsolicited window update is lost, the other end will regularly poll the connection state using the so-called Zero Window Probes (the persist mode we are seeing here).
Apparently, the bug was in the bulk receiver fast-path, a code path that skips most of the expensive, strict TCP processing to optimize for the common case of bulk data reception. This is a significant optimization, outlined 28 years ago² by Van Jacobson in his “TCP receive in 30 instructions” email. Apparently the Linux implementation did not update snd_wl1 while in the receiver fast path. If a connection uses the fast path for too long, snd_wl1 will fall so far behind that ack_seq will wrap around with respect to it. And if this happens while the receive window is zero, there is no way to re-open the window, as demonstrated above. What’s more, this bug had been present in Linux since v2.1.8, dating back to 1996!
Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation
In this post I’ll explain how I configured my AMD ThreadRipper Pro workstation with 10 PCIe 4.0 SSDs to achieve 11M IOPS with 4kB random reads and 66 GiB/s throughput with larger IOs - and what bottlenecks & issues I fixed to get there. We’ll look into Linux block I/O internals and their interaction with modern hardware. We’ll use tools & techniques, old and new, for measuring bottlenecks - and other adventures in the kernel I/O stack.
How to make Bash fail badly on Ubuntu 16.04 by typo'ing a command name
The simple thing to say about this is that it only happens on Ubuntu 16.04, not on 18.04 or 20.04, and it happens because Ubuntu’s normal /etc/bash.bashrc defines a command_not_found_handle function that winds up running a helper program to produce this ‘did you mean’ report. The helper program comes from the command-not-found package, which is installed because it’s Recommended by ubuntu-standard.
GNOME has no thumbnails in the file picker (and my toilets are blocked)
The file picker is the pop-up box thingy that appears when you’re opening a file, usually when uploading something online. The GNOME desktop environment uses the file picker package GtkFileChooser. This file picker does not have a thumbnail view. It is broken software. Thumbnails are not a cute little extra, they are essential. This is as bad as a file picker that doesn’t list the name of the files, only their creation date, or inode serial number. It is broken software.
Personally, not a big deal, but fair point.
PAM Bypass: when null(is not)ok
The commit attempts to avoid a timing attack against PAM. Some attacker can know valid user names by timing how quickly PAM returns an error, so the fix is to use an existing user in the system we always validate against to ensure a consistent timing. But which user is always present on a Linux system? root!
The code does not check if root has any valid passwords set. An invalid user would fail, loop over to root and try validate. root has no password. It’s blank. We have nullok set. And we have pam_permit.so. The invalid user is authenticated. We have enough information to do a quick POC.
1 + 1 = 3.
What they don’t tell you about demand paging in school
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
Good look at practical realities.
Major Bug in glibc is Killing Applications With a Memory Limit
malloc() preallocates large chunks of memory, per thread. This is meant as a performance optimization, to reduce memory contention in highly threaded applications. On a typical physical server, dual Xeon CPU with a terabyte of RAM. The core count is easily 40 or above. 10 cores * 2 CPU * 2 for hyper threading. This means a preallocation of up to 20 GB of RAM in the process.
KVM host in a few lines of code
KVM is a virtualization technology that comes with the Linux kernel. In other words, it allows you to run multiple virtual machines (VMs) on a single Linux VM host. VMs in this case are known as guests. If you ever used QEMU or VirtualBox on Linux - you know what KVM is capable of.
But how does it work under the hood?
The case of the missing DNS packets
Troubleshooting is both a science and an art. The first step is to make a hypothesis about why something is behaving in an unexpected way, and then prove whether or not the hypothesis is correct. But before you can formulate a hypothesis, you first need to clearly identify the problem, and express it with precision. If the issue is too vague, then you need to brainstorm in order to narrow down the problem—this is where the “artistic” part of the process comes in.
systemd, 10 years later: a historical and technical retrospective
10 years ago, systemd was announced and swiftly rose to become one of the most persistently controversial and polarizing pieces of software in recent history, and especially in the GNU/Linux world. The quality and nature of debate has not improved in the least from the major flame wars around 2012-2014, and systemd still remains poorly understood and understudied from both a technical and social level despite paradoxically having disproportionate levels of attention focused on it.
I am writing this essay both for my own solace, so I can finally lay it to rest, but also with the hopes that my analysis can provide some context to what has been a decade-long farce, and not, as in Benno Rice’s now famous characterization, tragedy.
Why strace doesn't work in Docker
But I wasn’t interested in fixing it, I wanted to know why it happens. So why does strace not work, and why does --cap-add=SYS_PTRACE fix it?
Hunting a Linux kernel bug
Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug!
Exploiting Race Conditions Using the Scheduler
This talk shows how two bugs involving somewhat narrow-looking race windows (https://crbug.com/project-zero/1695 in the Linux kernel, https://crbug.com/project-zero/1741 in Android userspace code) can be stretched wide enough to win the race conditions on a Google Pixel 2 phone, running a Linux 4.4 kernel, by making use of the unprivileged sched_*() syscalls.
Curiosity around 'exec_id' and some problems associated with it
The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process (Problem I.). Nevertheless, it’s not trivial and more limited comparing to the CVE-2009-1337.
A "living" Linux process with no memory
This code gets a list of all memory maps from /proc/self/maps, then creates a new executable map where it jits some code that calls munmap() on each of the maps it just got, and finally on the map it’s on. This is just a quick example with no portability in mind, so the source code contains the actual bytes that would be emitted by a x64 compiler. After unmapping the final map, where the jit code lies, there’s no new instruction to execute and a segfault is raised.
Speeding up Linux disk encryption
At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
Another look at two Linux KASLR patches
In the end, this random number generator was quickly removed, and that was that. But one can still wonder—is this generator secure but unanalyzed, or would it have been broken just to prove a point?
A Compendium of Container Escapes
The goal of this talk is to broaden the awareness of the how and why container escapes work, starting from a brief intro to what makes a process a container, and then spanning the gamut of escape techniques, covering exposed orchestrators, access to the Docker socket, exposed mount points, /proc, all the way down to overwriting/exploiting the kernel structures to leave the confines of the container.
The FreeBSD-linuxulator explained (for users)
First, the linuxulator is not an emulation. It is “just” a binary interface which is a little bit different from the FreeBSD-“native”-one. This means that the binary files in FreeBSD and Linux are both files which comply to the ELF specification.