What they don’t tell you about demand paging in school
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
Good look at practical realities.
Major Bug in glibc is Killing Applications With a Memory Limit
malloc() preallocates large chunks of memory, per thread. This is meant as a performance optimization, to reduce memory contention in highly threaded applications. On a typical physical server, dual Xeon CPU with a terabyte of RAM. The core count is easily 40 or above. 10 cores * 2 CPU * 2 for hyper threading. This means a preallocation of up to 20 GB of RAM in the process.
KVM host in a few lines of code
KVM is a virtualization technology that comes with the Linux kernel. In other words, it allows you to run multiple virtual machines (VMs) on a single Linux VM host. VMs in this case are known as guests. If you ever used QEMU or VirtualBox on Linux - you know what KVM is capable of.
But how does it work under the hood?
The case of the missing DNS packets
Troubleshooting is both a science and an art. The first step is to make a hypothesis about why something is behaving in an unexpected way, and then prove whether or not the hypothesis is correct. But before you can formulate a hypothesis, you first need to clearly identify the problem, and express it with precision. If the issue is too vague, then you need to brainstorm in order to narrow down the problem—this is where the “artistic” part of the process comes in.
systemd, 10 years later: a historical and technical retrospective
10 years ago, systemd was announced and swiftly rose to become one of the most persistently controversial and polarizing pieces of software in recent history, and especially in the GNU/Linux world. The quality and nature of debate has not improved in the least from the major flame wars around 2012-2014, and systemd still remains poorly understood and understudied from both a technical and social level despite paradoxically having disproportionate levels of attention focused on it.
I am writing this essay both for my own solace, so I can finally lay it to rest, but also with the hopes that my analysis can provide some context to what has been a decade-long farce, and not, as in Benno Rice’s now famous characterization, tragedy.
Why strace doesn't work in Docker
But I wasn’t interested in fixing it, I wanted to know why it happens. So why does strace not work, and why does --cap-add=SYS_PTRACE fix it?
Hunting a Linux kernel bug
Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug!
Exploiting Race Conditions Using the Scheduler
This talk shows how two bugs involving somewhat narrow-looking race windows (https://crbug.com/project-zero/1695 in the Linux kernel, https://crbug.com/project-zero/1741 in Android userspace code) can be stretched wide enough to win the race conditions on a Google Pixel 2 phone, running a Linux 4.4 kernel, by making use of the unprivileged sched_*() syscalls.
Curiosity around 'exec_id' and some problems associated with it
The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process (Problem I.). Nevertheless, it’s not trivial and more limited comparing to the CVE-2009-1337.
A "living" Linux process with no memory
This code gets a list of all memory maps from /proc/self/maps, then creates a new executable map where it jits some code that calls munmap() on each of the maps it just got, and finally on the map it’s on. This is just a quick example with no portability in mind, so the source code contains the actual bytes that would be emitted by a x64 compiler. After unmapping the final map, where the jit code lies, there’s no new instruction to execute and a segfault is raised.
Speeding up Linux disk encryption
At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.
To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is: A significant amount of tail latency is due to queueing effects
Another look at two Linux KASLR patches
In the end, this random number generator was quickly removed, and that was that. But one can still wonder—is this generator secure but unanalyzed, or would it have been broken just to prove a point?
A Compendium of Container Escapes
The goal of this talk is to broaden the awareness of the how and why container escapes work, starting from a brief intro to what makes a process a container, and then spanning the gamut of escape techniques, covering exposed orchestrators, access to the Docker socket, exposed mount points, /proc, all the way down to overwriting/exploiting the kernel structures to leave the confines of the container.
The FreeBSD-linuxulator explained (for users)
First, the linuxulator is not an emulation. It is “just” a binary interface which is a little bit different from the FreeBSD-“native”-one. This means that the binary files in FreeBSD and Linux are both files which comply to the ELF specification.
Patching nVidia GPU driver for hot-unplug on Linux
Anyway, I was kind of annoyed of rebooting every time it happens, so I decided to reboot a few more dozen times instead while patching the driver. This has indeed worked, and left me with something similar to a functional hot-unplug, mildly crippled by the fact that nvidia-modeset is a completely opaque blob that keeps some internal state and tries to act on it, getting stuck when it tries to do something to the now-missing eGPU.
Avoiding gaps in IOMMU protection at boot
But setting things up in the OS isn’t sufficient. If an attacker is able to trigger arbitrary DMA before the OS has started then they can tamper with the system firmware or your bootloader and modify the kernel before it even starts running. So ideally you want your firmware to set up the IOMMU before it even enables any external devices, and newer firmware should actually do this automatically. It sounds like the problem is solved.
The Linux CSPRNG Is Now Good!
Oceans of ink and hours on stage have been spent to convince the world that the best random number generator is /dev/urandom, the kernel one. And it is, and it’s always been. However, an uncomfortable truth was that the Linux CSPRNG really could have been better than it was. Userspace CSPRNGs couldn’t be better than the kernel one, so our advice was still valid, but that space for improvement always frustrated me.
Good news everyone! In recent years, the Linux CSPRNG got a number of great incremental improvements, and I can now say in good conscience that it’s not only the best, it’s also good.
Purgeable Memory Allocations for Linux
It allocates anonymous pages using mmap(2). When the allocation is “unlocked” — i.e. the process isn’t actively using it — its pages are marked with MADV_FREE so that the kernel can reclaim them at any time. To lock the allocation so that the process can safely make use of them, the MADV_FREE is canceled. This is all a little trickier than it sounds, and that’s the subject of this article.
On Linux's Random Number Generation
I have been asked about the usefulness of security monitoring of entropy levels in the Linux kernel. This calls for some explanation of how random generation works in Linux systems.
So, randomness and the Linux kernel. This is an area where there is longstanding confusion, notably among some Linux kernel developers, including Linus Torvalds himself.
Sculpt OS on HP EliteBook 840 G5
Unfortunately, the first boot of a recent Sculpt OS USB flash drive just hanged after GRUB showing the GENODE boot logo. So, it was time to get my hands dirty and debug the boot process. From a debuggers point of view, the used i5-8350U CPU luckily comes with Intel vPRO support, which means enabling AMT Serial-Over-LAN is just a matter of some configuration tweaks. Additionally, I adapted the Sculpt configuration to use the core LOG service, which reflects all messages on the first UART - in our case (and thanks to bender) AMT SOL.