Security Analysis Of AMD Predictive Store Forwarding
AMD “Zen3” processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting dependencies between loads and stores. Like technologies such as branch prediction, with PSF the processor “guesses” what the result of a load is likely to be, and speculatively executes subsequent instructions. In the event that the processor incorrectly speculated on the result of the load, it is designed to detect this and flush the incorrect results from the CPU pipeline.
Security research in recent years has examined the security implications of incorrect CPU speculation and how in some cases it may lead to side channel attacks. For instance, conditional branch speculation, indirect branch speculation, and store bypass speculation have been demonstrated to have the potential to be used in side-channel attacks (e.g., Spectre v1, v2, and v4 respectively).
Speculating The Entire X86-64 Instruction Set In Seconds With This One Weird Trick
As cheesy as the title sounds, I promise it cannot beat the cheesiness of the technique I’ll be telling you about in this post. The morning I saw Mark Ermolov’s tweet about the undocumented instruction reading from/writing to the CRBUS, I had a bit of free time in my hands and I knew I had to find out the opcode so I started theory-crafting right away. After a few hours of staring at numbers, I ended up coming up with a method of discovering practically every instruction in the processor using a side(?)-channel. It’s an interesting method involving even more interesting components of the processor so I figured I might as well write about it, so here it goes.
Counting cycles and instructions on the Apple M1 processor
Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.
To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.
Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical
We introduce the first microarchitectural side channel attacks that leverage contention on the CPU ring interconnect. There are two challenges that make it uniquely difficult to exploit this channel. First, little is known about the ring interconnect’s functioning and architecture. Second, information that can be learned by an attacker through ring contention is noisy by nature and has coarse spatial granularity. To address the first challenge, we perform a thorough reverse engineering of the sophisticated protocols that handle communication on the ring interconnect. With this knowledge, we build a cross-core covert channel over the ring interconnect with a capacity of over 4 Mbps from a single thread, the largest to date for a cross-core channel not relying on shared memory. To address the second challenge, we leverage the fine-grained temporal patterns of ring contention to infer a victim program’s secrets. We demonstrate our attack by extracting key bits from vulnerable EdDSA and RSA implementations, as well as inferring the precise timing of keystrokes typed by a victim user.
In-depth dive into the security features of the Intel/Windows platform secure boot process
This blog post is an in-depth dive into the security features of the Intel/Windows platform boot process. In this post I’ll explain the startup process through security focused lenses, next post we’ll dive into several known attacks and how there were handled by Intel and Microsoft. My wish is to explain to technology professionals not deep into platform security why Microsoft’s SecureCore is so important and necessary.
Not exclusive to Windows systems, lots of PC platform details.
PLATYPUS With Great Power comes Great Leakage
With classical power side-channel attacks, an adversary typically attaches an oscilloscope to monitor the energy consumption of a device. Since Intel Sandy Bridge CPUs, the Intel Running Average Power Limit (RAPL) interface allows monitoring and controlling the power consumption of the CPU and DRAM in software. Hence, the CPU basically comes with its own power meter. With the current implementation of the Linux driver, every unprivileged user has access to its measurements.
Using PLATYPUS, we demonstrate that we can observe variations in the power consumption to distinguish different instructions and different Hamming weights of operands and memory loads, allowing inference of loaded values. PLATYPUS can further infer intra-cacheline control flow of applications, break KASLR, leak AES-NI keys from Intel SGX enclaves and the Linux kernel, and establish a timing-independent covert channel.
With SGX, Intel released a security feature to create isolated environments, so-called enclaves, that are secure even if the operating system is compromised. In our work, we combine PLATYPUS with precise execution control of SGX-Step. As a result, we overcome the hurdle of the limited measuring capabilities of Intel RAPL by repeatedly executing single instructions inside the SGX enclave. Using this technique, we recover RSA keys processed by mbed TLS from an SGX enclave.
Exploring mullender.c - A deep dive into the first IOCCC winner
I will discuss the code, how I got such old and obscure code to run, as well as include snippets from my conversations with one of the original authors (who was very helpful in figuring some of this out). If all that wasn’t enough I managed to obtain the original PDP and VAX source code, it will be hosted here with permission. I want to give a huge thank you to Sjoerd Mullender and Don Libes for their assistance and permission in reproducing some of this material.
Proposal: Register-based Go calling convention
We propose switching the Go ABI from its current stack-based calling convention to a register-based calling convention. Preliminary experiments indicate this will achieve at least a 5–10% throughput improvement across a range of applications. This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism.
This also presents a very nice overview of existing calling conventions.
Inside the 8086 processor, tiny charge pumps create a negative voltage
You might wonder how a charge pump can turn a positive voltage into a negative voltage. The trick is a “flying” capacitor, as shown below. On the left, the capacitor is charged to 5 volts. Now, disconnect the capacitor and connect the positive side to ground. The capacitor still has its 5-volt charge, so now the low side must be at -5 volts. By rapidly switching the capacitor between the two states, the charge pump produces a negative voltage.
PopCount on ARM64 in Go Assembler
Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware. I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.
Die shrink: How Intel scaled down the 8086 processor
The revolutionary Intel 8086 microprocessor was introduced 42 years ago this month so I’ve been studying its die. I came across two 8086 dies with different sizes, which reveal details of how a die shrink works. The concept of a die shrink is that as technology improved, a manufacturer could shrink the silicon die, reducing costs and improving performance. But there’s more to it than simply scaling down the whole die. Although the internal circuitry can be directly scaled down, external-facing features can’t shrink as easily. For instance, the bonding pads need a minimum size so wires can be attached, and the power-distribution traces must be large enough for the current. The result is that Intel scaled the interior of the 8086 without change, but the circuitry and pads around the edge of the chip were redesigned.
Memory Ordering in Modern Microprocessors
A Technical Look at Intel’s Control-flow Enforcement Technology
Since ROP relies on RET instructions, where the address of the next instruction to execute is fetched from a stack, stack corruption plays a critical role in ROP attacks. CET enables the OS to create a Shadow Stack, which is designed to be protected from application code memory accesses, and stores CPU-stored copies of the return addresses. This helps ensure that even when an attacker is able to modify/corrupt the return addresses in the data stack for the purpose of carrying out a ROP attack, the attacker is not able to modify the Shadow Stack, and the CET state machine in the CPU detects mismatches between the address on the shadow and data stack to help prevent the attack via an exception reported to the OS.
Ice Lake Store Elimination
We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
This Goes to Eleven - Decimating Array.Sort with AVX2
Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.
Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop. Sounds interesting? If so, down the rabbit hole we go…
Very well done.
AWS re:Invent 2019: Speculation & leakage: Timing side channels & multi-tenant computing
In January 2018, the world learned about Spectre and Meltdown, a new class of issues that affects virtually all modern CPUs via nearly imperceptible changes to their micro-architectural states and can result in full access to physical RAM or leaking of state between threads, processes, or guests. In this session, we examine one of these side-channel attacks in detail and explore the implications for multi-tenant computing. We discuss AWS design decisions and what AWS does to protect your instances, containers, and function invocations. Finally, we discuss what the future looks like in the presence of this new class of issue.
This is a good recap. Specific defenses starts around 42:00.
LVI - Hijacking Transient Execution with Load Value Injection
LVI is a new class of transient-execution attacks exploiting microarchitectural flaws in modern processors to inject attacker data into a victim program and steal sensitive data and keys from Intel SGX, a secure vault in Intel processors for your personal data.
LVI turns previous data extraction attacks around, like Meltdown, Foreshadow, ZombieLoad, RIDL and Fallout, and defeats all existing mitigations. Instead of directly leaking data from the victim to the attacker, we proceed in the opposite direction: we smuggle — “inject” — the attacker’s data through hidden processor buffers into a victim program and hijack transient execution to acquire sensitive information, such as the victim’s fingerprints or passwords.
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors
In this paper, we are the first to exploit the cache way predictor. We reverse-engineered AMD’s L1D cache way predictor in microarchitectures from 2011 to 2019, resulting in two new attack techniques. With Collide+Probe, an attacker can monitor a victim’s memory accesses without knowledge of physical addresses or shared memory when time-sharing a logical core. With Load+ Reload, we exploit the way predictor to obtain highly-accurate memory-access traces of victims on the same physical core. While Load+Reload relies on shared memory, it does not invalidate the cache line, allowing stealthier attacks that do not induce any last-level-cache evictions.
KASLR: Break It, Fix It, Repeat
Escaping the Chrome Sandbox with RIDL
Vulnerabilities that leak cross process memory can be exploited to escape the Chrome sandbox. An attacker is still required to compromise the renderer prior to mounting this attack. To protect against attacks on affected CPUs make sure your microcode is up to date and disable hyper-threading (HT).
This is a pretty clear write-up and comes with a nice footnote:
When I started working on this I was surprised that it’s still exploitable even though the vulnerabilities have been public for a while. If you read guidance on the topic, they will usually talk about how these vulnerabilities have been mitigated if your OS is up to date with a note that you should disable hyper threading to protect yourself fully. The focus on mitigations certainly gave me a false sense that the vulnerabilities have been addressed and I think these articles could be more clear on the impact of leaving hyper threading enabled.