CPU Adventure – Unknown CPU Reversing
> We reverse-engineered a program written for a completely custom, unknown CPU architecture, without any documentation for the CPU (no emulator, no ISA reference, nothing) in the span of ten hours.
Position Independent Code (PIC) in shared libraries
> This article explained what position independent code is, and how it helps create shared libraries with shareable read-only text sections. There are some tradeoffs when choosing between PIC and its alternative (load-time relocation), and the eventual outcome really depends on a lot of factors, like the CPU architecture on which the program is going to run.
Go compiler intrinsics
> Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
> An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Where do interrupts happen?
> For a simple 1-wide in-order, non-pipelined CPU the answer might be as simple as: the CPU is interrupted either before or after instruction that is currently running2. For anything more complicated it’s not going to be easy. On a modern out-of-order processor there may be hundreds of instructions in-flight at any time, some waiting to execute, a dozen or more currently executing, and others waiting to retire. From all these choices, which instruction will be chosen as the victim?
APRR - Access Protection ReRouting
> Almost a year ago I did a write-up on KTRR, first introduced in Apple’s A10 chip series. Now over the course of the last year, there has been a good bit of talk as well as confusion about the new mitigations shipped with Apple’s A12. One big change, PAC, has already been torn down in detail by Brandon Azad, so I’m gonna leave that out here. What’s left to cover is more than just APRR, but APRR is certainly the biggest chunk, hence the title of this post.
> APRR is a pretty cool feature, even if parts of it are kinda broke. What I really like about it (besides the fact that it is an efficient and elegant solution to switching privileges) is that it untangles EL1 and EL0 memory permissions, giving you more flexibility than a standard ARMv8 implementation. What I don’t like though is that it has clearly been designed as a lockdown feature, allowing you only to take permissions away rather than freely remap them.
> It’s also evident that Apple is really fond of post-exploit mitigations, or just mitigations in general. And on one hand, getting control over the physical address space is a good bit harder now. But on the other hand, Apple’s stacking of mitigations is taking a problematic turn when adding new mitigations actively creates vulnerabilities now.
Adopting the Arm Memory Tagging Extension in Android
> As part of our continuous commitment to improve the security of the Android ecosystem, we are partnering with Arm to design the memory tagging extension (MTE). Memory safety bugs, common in C and C++, remain one of the largest vulnerabilities in the Android platform and although there have been previous hardening efforts, memory safety bugs comprised more than half of the high priority security bugs in Android 9.
> We believe that memory tagging will detect the most common classes of memory safety bugs in the wild, helping vendors identify and fix them, discouraging malicious actors from exploiting them. During the past year, our team has been working to ensure readiness of the Android platform and application software for MTE. We have deployed HWASAN, a software implementation of the memory tagging concept, to test our entire platform and a few select apps. This deployment has uncovered close to 100 memory safety bugs. The majority of these bugs were detected on HWASAN enabled phones in everyday use. MTE will greatly improve upon this in terms of overhead, ease of deployment, and scale. In parallel, we have been working on supporting MTE in the LLVM compiler toolchain and in the Linux kernel. The Android platform support for MTE will be complete by the time of silicon availability.
7 Days To Virtualization: A Series On Hypervisor Development
Provide protection against starvation of the ll/sc loops when accessing userpace.
> Casueword(9) on ll/sc architectures must be prepared for userspace constantly modifying the same cache line as containing the CAS word, and not loop infinitely. Otherwise, rogue userspace livelock kernel.
Defending against transient execution attacks
> It is important to build up a systematic understanding of these attacks and possible defenses
In-DRAM Bulk Bitwise Execution Engine
> Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.
AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome
> We have been teased with AMD’s next generation processor products for over a year. The new chiplet design has been heralded as a significant breakthrough in driving performance and scalability, especially as it becomes increasingly difficult to create large silicon with high frequencies on smaller and smaller process nodes. AMD is expected to deploy its chiplet paradigm across its processor line, through Ryzen and EPYC, with those chiplets each having eight next-generation Zen 2 cores. Today AMD went into more detail about the Zen 2 core, providing justification for the +15% clock-for-clock performance increase over the previous generation that the company presented at Computex last week.
If each thread’s TEB is referenced by the fs selector, does that mean that the 80386 is limited to 1024 threads?
> No, it doesn’t, because nobody said that the distinct values had to be different simultaneously.
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
> One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
Faster arithmetic by flipping signs
> TL;DR: Flipping the sign of some arithmetic operations can lead to shorter and faster assembly.
XSA-294 - x86 shadow: Insufficient TLB flushing when using PCID
> Use of Process Context Identifiers (PCID) was introduced into Xen in order to improve performance after XSA-254 (and in particular its Meltdown sub-issue). This enablement implied changes to the TLB flushing logic. One aspect which was overlooked is the safety of switching between shadow pagetables, which previously relied on the unconditional flushing of a write to CR3.
> With PCID enabled, a switch of shadow pagetable for a 64bit PV guest fails to invalidate the linear mappings of the previous shadow pagetable. As a result, subsequent accesses to the shadow pagetables may be deemed to be safe by the shadow logic (based on the old shadow pagetable) but fault when made in practice.
The Go low-level calling convention on x86-64
> This article analyzes how the Go compiler generates code for function calls, argument passing and exception handling on x86-64 targets.
Tons of information. Very thorough.
Splitting atoms in XNU
> A locking bug in the XNU virtual memory subsystem allowed violation of the preconditions required for the correctness of an optimized virtual memory operation. This was abused to create shared memory where it wasn’t expected, allowing the creation of a time-of-check-time-of-use bug where one wouldn’t usually exist. This was exploited to cause a heap overflow in XPC, which was used to trigger the execution of a jump-oriented payload which chained together arbitrary function calls in an unsandboxed root process, even in the presence of Apple’s implementation of ARM’s latest Pointer Authentication Codes (PAC) hardware mitigation. The payload opened a privileged socket and sent the file descriptor back to the sandboxed process, where it was used to trigger a kernel heap overflow only reachable from outside the sandbox.
This is excellent work.
What has your microcode done for you lately?
> Did you ever wonder what is inside those microcode updates that get silently applied to your CPU via Windows update, BIOS upgrades, and various microcode packages on Linux? Well, you are in the wrong place, because this blog post won’t answer that question (you might like this though).
> In fact, the overwhelming majority of this this post is about the performance of scattered writes, and not very much at all about the details of CPU microcode. Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.
Gallery of Processor Cache Effects
> In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.