The Alder Lake SHLX anomaly
https://tavianator.com/2025/shlx.html [tavianator.com]
2025-01-03 09:54
tags:
benchmark
cpu
perf
programming
It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate). On the other hand, 32-bit instructions, and 64-bit instructions without immediates (even no-op ones), make it fast. All of these ways to initialize RCX lead to 1-cycle latency:
source: L
Static search trees: 40x faster than binary search
https://curiouscoding.nl/posts/static-search-tree/ [curiouscoding.nl]
2025-01-02 01:18
tags:
compsci
perf
programming
rust
In this post, we will implement a static search tree (S+ tree) for high-throughput searching of sorted data, as introduced on Algorithmica. We’ll mostly take the code presented there as a starting point, and optimize it to its limits. For a large part, I’m simply taking the ‘future work’ ideas of that post and implementing them. And then there will be a bunch of looking at assembly code to shave off all the instructions we can. Lastly, there will be one big addition to optimize throughput: batching.
https://en.algorithmica.org/hpc/data-structures/s-tree/
source: HN
Blazingly Fast Shadow Stacks for Go
https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/ [blog.felixge.de]
2024-05-30 07:32
tags:
compiler
go
perf
programming
Software shadow stacks could deliver up to 8x faster stack trace capturing in the Go runtime when compared to the frame pointer unwinding that landed in go1.21. This doesn’t mean that this idea should escape from the laboratory right away, but it offers a fun glimpse into a potential future of hardware accelerated stack trace capturing via shadow stacks.
source: HN
Computing Adler32 Checksums at 41 GB/s
https://wooo.sh/adler32.html [wooo.sh]
2024-04-30 04:32
tags:
c
perf
programming
While looking through the fpng source code, I noticed that its vectorized adler32 implementation seemed somewhat complicated, especially given how simple the scalar version of adler32 is. I was curious to see if I could come up with a simpler method, and in doing so, I came up with an algorithm that can be up to 7x faster than fpng’s version, and 109x faster than the simple scalar version.
source: trivium
Bending pause times to your will with Generational ZGC
https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b [netflixtechblog.com]
2024-03-16 00:20
tags:
garbage-collection
java
perf
The latest long term support release of the JDK delivers generational support for the Z Garbage Collector. Netflix has switched by default from G1 to Generational ZGC on JDK 21 and later, because of the significant benefits of concurrent garbage collection.
source: HN
Low-level thinking in high-level shading languages 2023
https://interplayoflight.wordpress.com/2023/12/29/low-level-thinking-in-high-level-shading-languages-2023/ [interplayoflight.wordpress.com]
2024-01-01 04:21
tags:
gl
perf
programming
This, and the followup, is a presentation that I recommend as required reading to people wanting to get deeper into shader programming, not just for the knowledge but also the attitude towards shader programming (check compiler output, never assume, always profile). It has been 10 years since it was released though; in those 10 years a lot of things have changed on the GPU/shader model/shader compiler front and not all the suggestions in those presentations are still valid. So I decided to do a refresh with a modern compiler and shader model to see what still holds true and what doesn’t. I will target the RDNA 2 GPU architecture on PC using HLSL, the 6.7 shader model and the DXC compiler (using https://godbolt.org/) in this blog post.
Analyzing Starfield’s Performance on Nvidia’s 4090 and AMD’s 7900 XTX
https://chipsandcheese.com/2023/09/14/analyzing-starfields-performance-on-nvidias-4090-and-amds-7900-xtx/ [chipsandcheese.com]
2023-09-15 21:19
tags:
gaming
graphics
investigation
perf
We analyzed this scene using Nvidia’s Nsight Graphics and AMD’s Radeon GPU Profiler to get some insight into why Starfield performs the way it does. On the Nvidia side, we covered the last three generations of cards by testing the RTX 4090, RTX 3090, and Titan RTX. On AMD, we tested the RX 7900 XTX. The i9-13900K was used to collect data for all of these GPUs.
source: HN
FreeBSD on Firecracker
https://www.usenix.org/publications/loginonline/freebsd-firecracker [www.usenix.org]
2023-08-24 15:14
tags:
freebsd
perf
programming
systems
virtualization
Experiences porting FreeBSD 14 to run on the Firecracker VMM
source: L
When Good Correlation is Not Enough
https://hakibenita.com/postgresql-correlation-brin-multi-minmax [hakibenita.com]
2023-07-28 02:39
tags:
database
development
perf
sql
Choosing to use a block range index (BRIN) to query a field with high correlation is a no-brainer for the optimizer. The small size of the index and the field’s correlation makes BRIN an ideal choice. However, a recent event taught us that correlation can be misleading. Under some easily reproducible circumstances, a BRIN index can result in significantly slower execution even when the indexed field has very high correlation.
source: HN
Commander Keen's Adaptive Tile Refresh
https://fabiensanglard.net/ega/ [fabiensanglard.net]
2023-07-27 21:53
tags:
gaming
graphics
perf
programming
retro
I have been reading Doom Guy by John Romero. It is an excellent book which I highly recommend. In the ninth chapter, John describes being hit by lightning upon seeing Adaptive Tile Refresh (ATS). That made me realize I never took the time to understand how this crucial piece of tech powers the Commander Keen (CK) series.
At its heart the problem ATS solves is bandwidth. Writing 320x200 nibbles (32 KiB) per frame is too much for the ISA bus. There is no way to maintain a 60Hz framerate while refreshing the whole screen. If we were to run the following code, which simply fills all banks, it would run at 5 frames per seconds.
source: HN
Shoot ’em up in style: the making of Gun Trails on Playdate
https://news.play.date/news/gun-trails/ [news.play.date]
2023-07-21 21:04
tags:
c
development
gaming
perf
programming
retro
Enter Playdate. I had wanted to build a shmup for years, but for various reasons—primarily bad scoping—the efforts always sputtered out. This little yellow device could provide the constraints needed, with the added bonus of a programming challenge to hit consistently high framerates.
source: L
gotraceui - an efficient frontend for Go execution traces
https://github.com/dominikh/gotraceui [github.com]
2023-03-31 02:29
tags:
development
go
perf
swtools
Gotraceui is a tool for visualizing and analyzing Go execution traces. It is meant to be a faster, more accessible, and more powerful alternative to go tool trace. Unlike go tool trace, Gotraceui doesn’t use deprecated browser APIs (or a browser at all), and its UI is tuned specifically to the unique characteristics of Go traces.
source: L
The futex_waitv() syscall and gaming on Linux
https://www.collabora.com/news-and-blog/blog/2023/02/17/the-futex-waitv-syscall-gaming-on-linux/ [www.collabora.com]
2023-02-17 23:48
tags:
concurrency
gaming
linux
perf
programming
systems
The futex_waitv syscall is a new syscall through which the process can wait for multiple futexes. The task wakes up when any futex in the list is awakened. This can be used to implement wait on multiple locks and wait lists, etc, without the limitations imposed by using eventfd.
source: L
The 8086 processor's microcode pipeline from die analysis
http://www.righto.com/2023/01/the-8086-processors-microcode-pipeline.html [www.righto.com]
2023-01-27 18:28
tags:
cpu
hardware
investigation
perf
series
Do Not Taunt Happy Fun Branch Predictor
https://www.mattkeeter.com/blog/2023-01-25-branch/ [www.mattkeeter.com]
2023-01-25 20:09
tags:
cpu
perf
programming
I recently came up with a “clever” idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don’t fall victim in the future.
Building the fastest Lua interpreter.. automatically!
https://sillycross.github.io/2022/11/22/2022-11-22/ [sillycross.github.io]
2022-11-22 23:10
tags:
compiler
jit
lua
perf
programming
I have been working on a research project to make writing VMs easier. The idea arises from the following observation: writing a naive interpreter is not hard (just write a big switch-case), but writing a good interpreter (or JIT compiler) is hard, as it unavoidably involves hand-coding assembly. So why can’t we implement a special compiler to automatically generate a high-performance interpreter (and even the JIT) from “the big switch-case”, or more formally, a semantical description of what each bytecode does?
source: HN
How fast are Linux pipes anyway?
https://mazzo.li/posts/fast-pipes.html [mazzo.li]
2022-06-02 22:56
tags:
concurrency
linux
malloc
perf
programming
systems
In this post, we will explore how Unix pipes are implemented in Linux by iteratively optimizing a test program that writes and reads data through a pipe.
We will proceed as follows:
A first slow version of our pipe test bench;
How pipes are implemented internally, and why writing and reading from them is slow;
How the vmsplice and splice syscalls let us get around some (but not all!) of the slowness;
A description of Linux paging, leading up to a faster version using huge pages;
The final optimization, replacing polling with busy looping;
Some closing thoughts.
source: L
All About Libpas, Phil's Super Fast Malloc
https://github.com/WebKit/WebKit/blob/main/Source/bmalloc/libpas/Documentation.md [github.com]
2022-06-01 21:43
tags:
c
malloc
perf
programming
Libpas is a fast and memory-efficient memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it’ll be used for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.
source: HN
Faster CRC32 on the Apple M1
https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-the-apple-m1/ [dougallj.wordpress.com]
2022-05-22 19:25
tags:
cpu
hash
perf
programming
CRC32 is a checksum first proposed in 1961, and now used in a wide variety of performance sensitive contexts, from file formats (zip, png, gzip) to filesystems (ext4, btrfs) and protocols (like ethernet and SATA). So, naturally, a lot of effort has gone into optimising it over the years. However, I discovered a simple update to a widely used technique that makes it possible to run twice as fast as existing solutions on the Apple M1.
source: HN
Speeding up sort performance in Postgres 15
https://www.citusdata.com/blog/2022/05/19/speeding-up-sort-performance-in-postgres-15/ [www.citusdata.com]
2022-05-20 23:02
tags:
database
perf
sorting
sql
update
Let’s explore each of the 4 improvements in PostgreSQL 15 that make sort performance go faster:
Change 1: Improvements sorting a single column
Change 2: Reduce memory consumption by using generation memory context
Change 3: Add specialized sort routines for common datatypes
Change 4: Replace polyphase merge algorithm with k-way merge
source: HN