Synthetic Memory Protections - An update on ROP mitigations
https://www.openbsd.org/papers/csw2023.pdf [www.openbsd.org]
2023-03-25 19:35
tags:
cpu
defense
malloc
openbsd
pdf
security
slides
systems
ROP methods have become increasingly sophisticated
But we can identify system behaviours which only ROP code requires
We can contrast this to what Regular Control Flow code needs
And then, find behaviours to block
source: HN
The 8086 processor's microcode pipeline from die analysis
http://www.righto.com/2023/01/the-8086-processors-microcode-pipeline.html [www.righto.com]
2023-01-27 18:28
tags:
cpu
hardware
investigation
perf
series
Do Not Taunt Happy Fun Branch Predictor
https://www.mattkeeter.com/blog/2023-01-25-branch/ [www.mattkeeter.com]
2023-01-25 20:09
tags:
cpu
perf
programming
I recently came up with a “clever” idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don’t fall victim in the future.
An instruction oddity in the ppc64 (PowerPC 64-bit) architecture
https://utcc.utoronto.ca/~cks/space/blog/tech/PowerPCInstructionOddity [utcc.utoronto.ca]
2023-01-21 19:45
tags:
bugfix
compiler
cpu
programming
turtles
As Raymond Chen notes, ‘or rd, ra, ra’ has the effect of ‘move ra to rd’. Moving a register to itself is a NOP, but several Power versions (the Go code’s comment says Power8, 9, and 10) overload this particular version of a NOP (and some others) to signal that the priority of your hardware thread should be changed by the CPU; in the specific case of ‘or r1, r1, r1’ it drops you to low priority. That leaves us with the mystery of why such an instruction would be used by a compiler, instead of the official NOP (per Raymond Chen, this is ‘or r0, r0, 0’).
As covered in the specific ppc64 diff in the change that introduced this issue, Go wanted to artificially mark a particular runtime function this way (see CL 425396 and Go issue #54332 for more). To do this it needed to touch the stack pointer in a harmless way, which would trigger the toolchain’s weirdness detector. On ppc64, the stack pointer is in r1. So the obvious and natural thing to do is to move r1 to itself, which encodes as ‘or r1, r1, r1’, and which then triggers this special architectural behavior of lowering the priority of that hardware thread. Oops.
https://devblogs.microsoft.com/oldnewthing/20180809-00/?p=99455
https://github.com/golang/go/issues/54332
Hertzbleed Attack
https://www.hertzbleed.com/ [www.hertzbleed.com]
2022-06-16 18:36
tags:
cpu
crypto
exploit
paper
security
sidechannel
Hertzbleed is a new family of side-channel attacks: frequency side channels. In the worst case, these attacks can allow an attacker to extract cryptographic keys from remote servers that were previously believed to be secure.
Hertzbleed takes advantage of our experiments showing that, under certain circumstances, the dynamic frequency scaling of modern x86 processors depends on the data being processed. This means that, on modern processors, the same program can run at a different CPU frequency (and therefore take a different wall time) when computing, for example, 2022 + 23823 compared to 2022 + 24436.
source: HN
Faster CRC32 on the Apple M1
https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-the-apple-m1/ [dougallj.wordpress.com]
2022-05-22 19:25
tags:
cpu
hash
perf
programming
CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
source: HN
What's new in CPUs since the 80s?
https://danluu.com/new-cpu-features/ [danluu.com]
2022-04-19 17:10
tags:
article
concurrency
cpu
perf
programming
systems
Everything below refers to x86 and linux, unless otherwise indicated. History has a tendency to repeat itself, and a lot of things that were new to x86 were old hat to supercomputing, mainframe, and workstation folks.
x86 chips have picked up a lot of new features and whiz-bang gadgets.
Overall, a pretty good introduction to modern CPUs, performance, and concurrency.
Introduction to Apple Silicon
https://github.com/AsahiLinux/docs/wiki/Introduction-to-Apple-Silicon [github.com]
2022-03-17 03:23
tags:
bios
cpu
development
mac
systems
This document attempts to explain the Apple Silicon (i.e. M1 and later) Mac boot ecosystem (henceforth “AS Macs“), as it pertains for how open OSes interoperate with the platform.
It is intended for developers and maintainers of Linux, BSD and other OS distributions and boot-related components, as well as users interested in the platform, and its goal is to cover the overall picture without delving into excessive technical detail. Specifics should be left to other wiki pages. It also omits details that only pertain to macOS (such as how kernel extensions work and are loaded).
source: HN
Emulating AMD Approximate Arithmetic Instructions On Intel
https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-etc-on-intel.html [robert.ocallahan.org]
2021-09-13 04:29
tags:
cpu
debugging
math
programming
virtualization
Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.
source: HN
Security Analysis Of AMD Predictive Store Forwarding
https://www.amd.com/system/files/documents/security-analysis-predictive-store-forwarding.pdf [www.amd.com]
2021-04-03 02:38
tags:
cpu
pdf
perf
programming
security
sidechannel
AMD “Zen3” processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting dependencies between loads and stores. Like technologies such as branch prediction, with PSF the processor “guesses” what the result of a load is likely to be, and speculatively executes subsequent instructions. In the event that the processor incorrectly speculated on the result of the load, it is designed to detect this and flush the incorrect results from the CPU pipeline.
Security research in recent years has examined the security implications of incorrect CPU speculation and how in some cases it may lead to side channel attacks. For instance, conditional branch speculation, indirect branch speculation, and store bypass speculation have been demonstrated to have the potential to be used in side-channel attacks (e.g., Spectre v1, v2, and v4 respectively).
Nice disclosure.
source: R
Speculating The Entire X86-64 Instruction Set In Seconds With This One Weird Trick
https://blog.can.ac/2021/03/22/speculating-x86-64-isa-with-one-weird-trick/ [blog.can.ac]
2021-03-25 02:23
tags:
cpu
investigation
programming
sidechannel
As cheesy as the title sounds, I promise it cannot beat the cheesiness of the technique I’ll be telling you about in this post. The morning I saw Mark Ermolov’s tweet about the undocumented instruction reading from/writing to the CRBUS, I had a bit of free time in my hands and I knew I had to find out the opcode so I started theory-crafting right away. After a few hours of staring at numbers, I ended up coming up with a method of discovering practically every instruction in the processor using a side(?)-channel. It’s an interesting method involving even more interesting components of the processor so I figured I might as well write about it, so here it goes.
source: L
Counting cycles and instructions on the Apple M1 processor
https://lemire.me/blog/2021/03/24/counting-cycles-and-instructions-on-the-apple-m1-processor/ [lemire.me]
2021-03-25 02:12
tags:
cpu
mac
perf
programming
Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.
To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.
source: HN
Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical
https://arxiv.org/abs/2103.03443 [arxiv.org]
2021-03-08 18:36
tags:
cpu
exploit
paper
security
sidechannel
We introduce the first microarchitectural side channel attacks that leverage contention on the CPU ring interconnect. There are two challenges that make it uniquely difficult to exploit this channel. First, little is known about the ring interconnect’s functioning and architecture. Second, information that can be learned by an attacker through ring contention is noisy by nature and has coarse spatial granularity. To address the first challenge, we perform a thorough reverse engineering of the sophisticated protocols that handle communication on the ring interconnect. With this knowledge, we build a cross-core covert channel over the ring interconnect with a capacity of over 4 Mbps from a single thread, the largest to date for a cross-core channel not relying on shared memory. To address the second challenge, we leverage the fine-grained temporal patterns of ring contention to infer a victim program’s secrets. We demonstrate our attack by extracting key bits from vulnerable EdDSA and RSA implementations, as well as inferring the precise timing of keystrokes typed by a victim user.
source: HN
In-depth dive into the security features of the Intel/Windows platform secure boot process
https://igor-blue.github.io/2021/02/04/secure-boot.html [igor-blue.github.io]
2021-02-15 18:19
tags:
bios
cpu
hardware
security
systems
windows
This blog post is an in-depth dive into the security features of the Intel/Windows platform boot process. In this post I’ll explain the startup process through security focused lenses, next post we’ll dive into several known attacks and how there were handled by Intel and Microsoft. My wish is to explain to technology professionals not deep into platform security why Microsoft’s SecureCore is so important and necessary.
Not exclusive to Windows systems, lots of PC platform details.
source: grugq
PLATYPUS With Great Power comes Great Leakage
https://platypusattack.com/ [platypusattack.com]
2020-12-11 06:55
tags:
cpu
energy
exploit
paper
security
sidechannel
With classical power side-channel attacks, an adversary typically attaches an oscilloscope to monitor the energy consumption of a device. Since Intel Sandy Bridge CPUs, the Intel Running Average Power Limit (RAPL) interface allows monitoring and controlling the power consumption of the CPU and DRAM in software. Hence, the CPU basically comes with its own power meter. With the current implementation of the Linux driver, every unprivileged user has access to its measurements.
Using PLATYPUS, we demonstrate that we can observe variations in the power consumption to distinguish different instructions and different Hamming weights of operands and memory loads, allowing inference of loaded values. PLATYPUS can further infer intra-cacheline control flow of applications, break KASLR, leak AES-NI keys from Intel SGX enclaves and the Linux kernel, and establish a timing-independent covert channel.
With SGX, Intel released a security feature to create isolated environments, so-called enclaves, that are secure even if the operating system is compromised. In our work, we combine PLATYPUS with precise execution control of SGX-Step. As a result, we overcome the hurdle of the limited measuring capabilities of Intel RAPL by repeatedly executing single instructions inside the SGX enclave. Using this technique, we recover RSA keys processed by mbed TLS from an SGX enclave.
source: trivium
Exploring mullender.c - A deep dive into the first IOCCC winner
https://lainsystems.com/posts/exploring-mullender-dot-c/ [lainsystems.com]
2020-08-30 18:03
tags:
c
cpu
programming
retro
I will discuss the code, how I got such old and obscure code to run, as well as include snippets from my conversations with one of the original authors (who was very helpful in figuring some of this out). If all that wasn’t enough I managed to obtain the original PDP and VAX source code, it will be hosted here with permission. I want to give a huge thank you to Sjoerd Mullender and Don Libes for their assistance and permission in reproducing some of this material.
source: HN
Proposal: Register-based Go calling convention
https://go.googlesource.com/proposal/+/refs/changes/78/248178/1/design/40724-register-calling.md [go.googlesource.com]
2020-08-17 04:29
tags:
compiler
cpu
go
programming
We propose switching the Go ABI from its current stack-based calling convention to a register-based calling convention. Preliminary experiments indicate this will achieve at least a 5–10% throughput improvement across a range of applications. This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism.
This also presents a very nice overview of existing calling conventions.
source: L
Inside the 8086 processor, tiny charge pumps create a negative voltage
http://www.righto.com/2020/07/inside-8086-processor-tiny-charge-pumps.html [www.righto.com]
2020-07-26 21:25
tags:
cpu
hardware
photos
retro
You might wonder how a charge pump can turn a positive voltage into a negative voltage. The trick is a “flying” capacitor, as shown below. On the left, the capacitor is charged to 5 volts. Now, disconnect the capacitor and connect the positive side to ground. The capacitor still has its 5-volt charge, so now the low side must be at -5 volts. By rapidly switching the capacitor between the two states, the charge pump produces a negative voltage.
PopCount on ARM64 in Go Assembler
https://barakmich.dev/posts/popcnt-arm64-go-asm/ [barakmich.dev]
2020-07-13 01:22
tags:
cpu
go
perf
programming
Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware. I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.
source: L
Die shrink: How Intel scaled down the 8086 processor
http://www.righto.com/2020/06/die-shrink-how-intel-scaled-down-8086.html [www.righto.com]
2020-07-01 02:23
tags:
cpu
hardware
photos
retro
The revolutionary Intel 8086 microprocessor was introduced 42 years ago this month so I’ve been studying its die. I came across two 8086 dies with different sizes, which reveal details of how a die shrink works. The concept of a die shrink is that as technology improved, a manufacturer could shrink the silicon die, reducing costs and improving performance. But there’s more to it than simply scaling down the whole die. Although the internal circuitry can be directly scaled down, external-facing features can’t shrink as easily. For instance, the bonding pads need a minimum size so wires can be attached, and the power-distribution traces must be large enough for the current. The result is that Intel scaled the interior of the 8086 without change, but the circuitry and pads around the edge of the chip were redesigned.