Don't clobber the frame pointer
https://nsrip.com/posts/clobberfp.html [nsrip.com]
2025-01-05 09:34
tags:
bugfix
compiler
cpu
go
programming
Recently I diagnosed and fixed two frame pointer unwinding crashes in Go. The root causes were two flavors of the same problem: buggy assembly code clobbered a frame pointer. By “clobbered” I mean wrote over the value without saving & restoring it. One bug clobbered the frame pointer register. The other bug clobbered a frame pointer saved on the stack. This post explains the bugs, talks a bit about ABIs and calling conventions, and makes some recommendations for how to avoid the bugs.
source: L
It’s the Most Indispensable Machine in the World
https://www.wsj.com/tech/ai/asml-euv-machine-lithography-chips-967954d0 [www.wsj.com]
2025-01-04 07:12
tags:
article
business
cpu
tech
The piece of equipment that the entire world has come to rely on—and she is specially trained to handle—is called an extreme ultraviolet lithography machine. It’s the machine that produces the most advanced microchips on the planet. It was built with scientific technologies that sound more like science fiction—breakthroughs so improbable that they were once dismissed as impossible. And it has transformed wafers of silicon into the engines of modern life.
She’s one of the engineers assigned to the fabrication plants—or fabs—where ASML customers manufacture their semiconductors. Hall is based here in Boise, the headquarters of Micron Technology, where I hopped into a bunny suit of my own and followed her inside the chip fab. Then I got a rare, behind-the-scenes peek at what might just be the most important machine ever made.
source: DF
The Alder Lake SHLX anomaly
https://tavianator.com/2025/shlx.html [tavianator.com]
2025-01-03 09:54
tags:
benchmark
cpu
perf
programming
It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate). On the other hand, 32-bit instructions, and 64-bit instructions without immediates (even no-op ones), make it fast. All of these ways to initialize RCX lead to 1-cycle latency:
source: L
Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques
https://pwning.tech/nftables/ [pwning.tech]
2024-03-26 23:33
tags:
best
cpu
exploit
linux
malloc
paper
programming
security
systems
In this blogpost I present several novel techniques I used to exploit a 0-day double-free bug in hardened Linux kernels (i.e. KernelCTF mitigation instances) with 93%-99% success rate. The underlying bug is input sanitization failure of netfilter verdicts. Hence, the requirements for the exploit are that nf_tables is enabled and unprivileged user namespaces are enabled. The exploit is data-only and performs an kernel-space mirroring attack (KSMA) from userland with the novel Dirty Pagedirectory technique (pagetable confusion), where it is able to link any physical address (and its permissions) to virtual memory addresses by performing just read/writes to userland addresses.
Also: https://github.com/Notselwyn/CVE-2024-1086
source: HN
Reverse engineering standard cell logic in the Intel 386 processor
http://www.righto.com/2024/01/intel-386-standard-cells.html [www.righto.com]
2024-03-13 07:33
tags:
article
compsci
cpu
hardware
photos
tech
The 386 processor (1985) was Intel’s most complex processor at the time, with 285,000 transistors. Intel had scheduled 50 person-years to design the processor, but it was falling behind schedule. The design team decided to automate chunks of the layout, developing “automatic place and route” software. This was a risky decision since if the software couldn’t create a dense enough layout, the chip couldn’t be manufactured. But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.
In this article, I take a close look at the “standard cells” used in the 386, the logic blocks that were arranged and wired by software. Reverse-engineering these circuits shows how standard cells implement logic gates, latches, and other components with CMOS transistors. Modern integrated circuits still use standard cells, much smaller now, of course, but built from the same principles.
An improved chkstk function on Windows
https://nullprogram.com/blog/2024/02/05/ [nullprogram.com]
2024-02-06 23:47
tags:
compiler
cpu
programming
windows
If you’ve spent much time developing with Mingw-w64 you’ve likely seen the symbol ___chkstk_ms, perhaps in an error message. It’s a little piece of runtime provided by GCC via libgcc which ensures enough of the stack is committed for the caller’s stack frame. The “function” uses a custom ABI and is implemented in assembly. So is the subject of this article, a slightly improved implementation soon to be included in w64devkit as libchkstk (-lchkstk).
source: L
Operation Triangulation: What You Get When Attack iPhones of Researchers
https://securelist.com/operation-triangulation-the-last-hardware-mystery/111669/ [securelist.com]
2023-12-27 19:52
tags:
best
cpu
exploit
investigation
iphone
security
This presentation was also the first time we had publicly disclosed the details of all exploits and vulnerabilities that were used in the attack. We discover and analyze new exploits and attacks using these on a daily basis, and we have discovered and reported more than thirty in-the-wild zero-days in Adobe, Apple, Google, and Microsoft products, but this is definitely the most sophisticated attack chain we have ever seen.
source: HN
Zenbleed
https://lock.cmpxchg8b.com/zenbleed.html [lock.cmpxchg8b.com]
2023-07-25 01:47
tags:
cpu
exploit
programming
security
sidechannel
systems
What should happen if the processor speculatively executed a vzeroupper, but then discovers that there was a branch misprediction? Well, we will have to revert that operation and put things back the way they were… maybe we can just unset that z-bit?
If we return to the analogy of malloc and free, you can see that it can’t be that simple - that would be like calling free() on a pointer, and then changing your mind!
That would be a use-after-free vulnerability, but there is no such thing as a use-after-free in a CPU… or is there?
source: L
The complex history of the Intel i960 RISC processor
http://www.righto.com/2023/07/the-complex-history-of-intel-i960-risc.html [www.righto.com]
2023-07-02 01:13
tags:
cpu
hardware
retro
The Intel i960 was a remarkable 32-bit processor of the 1990s with a confusing set of versions. Although it is now mostly forgotten (outside the many people who used it as an embedded processor), it has a complex history. It had a shot at being Intel’s flagship processor until x86 overshadowed it. Later, it was the world’s best-selling RISC processor. One variant was a 33-bit processor with a decidedly non-RISC object-oriented instruction set; it became a military standard and was used in the F-22 fighter plane. Another version powered Intel’s short-lived Unix servers. In this blog post, I’ll take a look at the history of the i960, explain its different variants, and examine silicon dies. This chip has a lot of mythology and confusion (especially on Wikipedia), so I’ll try to clear things up.
source: HN
Understanding DeepMind's Sorting Algorithm
https://justine.lol/sorting/ [justine.lol]
2023-06-12 21:55
tags:
compsci
cpu
performance
sorting
A few days ago, DeepMind published a blog post talking about a paper they wrote, where they discovered tinier kernels for sorting algorithms. They did this by taking their deep learning wisdom, which they gained by building AlphaGo, and applying it to the discipline of of superoptimization. That piqued my interest, since as a C library author, I’m always looking for opportunities to curate the best stuff. In some ways that’s really the whole purpose of the C library. There are so many functions that we as programmers take for granted, which are the finished product of decades of research, distilled into plain and portable code.
DeepMind earned a fair amount of well-deserved attention for this discovery, but unfortunately they could have done a much better job explaining it.
https://www.deepmind.com/blog/alphadev-discovers-faster-sorting-algorithms
source: HN
Epyc 7002 CPUs may hang after 1042 days of uptime
https://old.reddit.com/r/sysadmin/comments/13wmowy/psa_epyc_7002_cpus_may_hang_after_1042_days_of/ [old.reddit.com]
2023-06-01 18:27
tags:
admin
cpu
hardware
Note that your server will almost definitely hang, requiring a physical (or IPMI) reboot, because no interrupts, including NMIs, can be delivered to the zombie cores: this means no scheduler, no IPIs, nothing will work.
source: HN
Synthetic Memory Protections - An update on ROP mitigations
https://www.openbsd.org/papers/csw2023.pdf [www.openbsd.org]
2023-03-25 19:35
tags:
cpu
defense
malloc
openbsd
pdf
security
slides
systems
ROP methods have become increasingly sophisticated
But we can identify system behaviours which only ROP code requires
We can contrast this to what Regular Control Flow code needs
And then, find behaviours to block
source: HN
The 8086 processor's microcode pipeline from die analysis
http://www.righto.com/2023/01/the-8086-processors-microcode-pipeline.html [www.righto.com]
2023-01-27 18:28
tags:
cpu
hardware
investigation
perf
series
Do Not Taunt Happy Fun Branch Predictor
https://www.mattkeeter.com/blog/2023-01-25-branch/ [www.mattkeeter.com]
2023-01-25 20:09
tags:
cpu
perf
programming
I recently came up with a “clever” idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don’t fall victim in the future.
An instruction oddity in the ppc64 (PowerPC 64-bit) architecture
https://utcc.utoronto.ca/~cks/space/blog/tech/PowerPCInstructionOddity [utcc.utoronto.ca]
2023-01-21 19:45
tags:
bugfix
compiler
cpu
programming
turtles
As Raymond Chen notes, ‘or rd, ra, ra’ has the effect of ‘move ra to rd’. Moving a register to itself is a NOP, but several Power versions (the Go code’s comment says Power8, 9, and 10) overload this particular version of a NOP (and some others) to signal that the priority of your hardware thread should be changed by the CPU; in the specific case of ‘or r1, r1, r1’ it drops you to low priority. That leaves us with the mystery of why such an instruction would be used by a compiler, instead of the official NOP (per Raymond Chen, this is ‘or r0, r0, 0’).
As covered in the specific ppc64 diff in the change that introduced this issue, Go wanted to artificially mark a particular runtime function this way (see CL 425396 and Go issue #54332 for more). To do this it needed to touch the stack pointer in a harmless way, which would trigger the toolchain’s weirdness detector. On ppc64, the stack pointer is in r1. So the obvious and natural thing to do is to move r1 to itself, which encodes as ‘or r1, r1, r1’, and which then triggers this special architectural behavior of lowering the priority of that hardware thread. Oops.
https://devblogs.microsoft.com/oldnewthing/20180809-00/?p=99455
https://github.com/golang/go/issues/54332
Hertzbleed Attack
https://www.hertzbleed.com/ [www.hertzbleed.com]
2022-06-16 18:36
tags:
cpu
crypto
exploit
paper
security
sidechannel
Hertzbleed is a new family of side-channel attacks: frequency side channels. In the worst case, these attacks can allow an attacker to extract cryptographic keys from remote servers that were previously believed to be secure.
Hertzbleed takes advantage of our experiments showing that, under certain circumstances, the dynamic frequency scaling of modern x86 processors depends on the data being processed. This means that, on modern processors, the same program can run at a different CPU frequency (and therefore take a different wall time) when computing, for example, 2022 + 23823 compared to 2022 + 24436.
source: HN
Faster CRC32 on the Apple M1
https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-the-apple-m1/ [dougallj.wordpress.com]
2022-05-22 19:25
tags:
cpu
hash
perf
programming
CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
source: HN
What's new in CPUs since the 80s?
https://danluu.com/new-cpu-features/ [danluu.com]
2022-04-19 17:10
tags:
article
concurrency
cpu
perf
programming
systems
Everything below refers to x86 and linux, unless otherwise indicated. History has a tendency to repeat itself, and a lot of things that were new to x86 were old hat to supercomputing, mainframe, and workstation folks.
x86 chips have picked up a lot of new features and whiz-bang gadgets.
Overall, a pretty good introduction to modern CPUs, performance, and concurrency.
Introduction to Apple Silicon
https://github.com/AsahiLinux/docs/wiki/Introduction-to-Apple-Silicon [github.com]
2022-03-17 03:23
tags:
bios
cpu
development
mac
systems
This document attempts to explain the Apple Silicon (i.e. M1 and later) Mac boot ecosystem (henceforth “AS Macs“), as it pertains for how open OSes interoperate with the platform.
It is intended for developers and maintainers of Linux, BSD and other OS distributions and boot-related components, as well as users interested in the platform, and its goal is to cover the overall picture without delving into excessive technical detail. Specifics should be left to other wiki pages. It also omits details that only pertain to macOS (such as how kernel extensions work and are loaded).
source: HN
Emulating AMD Approximate Arithmetic Instructions On Intel
https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-etc-on-intel.html [robert.ocallahan.org]
2021-09-13 04:29
tags:
cpu
debugging
math
programming
virtualization
Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.
source: HN