inks

tag: perf

Conferences, Clarity, and Smokescreens

https://infrequently.org/2025/06/conferences-clarity-and-smokescreens/ [infrequently.org]

2025-06-27 20:27

tags: development html javscript perf ux web

Speaker after speaker stressed development speed, including the ability to ship to mobile and desktop from the same React code. The implications for complexity-management and user-experience were somewhat less of a focus. As you might expect, putting 75% of the 3.5MB JS payload (15MB unzipped) in the critical path does unpleasant things to the user experience, but none of the dizzying array of tools involved in constructing bk.com steered this team away from user experience failure.

How fast are Linux pipes anyway?

https://mazzo.li/posts/fast-pipes.html [mazzo.li]

2025-06-22 18:06

tags: dupe linux perf programming systems

In this post, we will explore how Unix pipes are implemented in Linux by iteratively optimizing a test program that writes and reads data through a pipe.

source: HN

The radix 2^51 trick

https://www.chosenplaintext.ca/articles/radix-2-51-trick.html [www.chosenplaintext.ca]

2025-05-31 00:51

tags: cpu math perf programming

The obvious solution would be to break up each 256-bit number into four 64-bit pieces (commonly referred to as “limbs”).

The first reason is that adc is just slower to execute than a normal add on most popular x86 CPUs. Since adc has a third input (the carry flag), it’s a more complex instruction than add. It’s also used less often than add, so there is less incentive for CPU designers to spend chip area on optimizing adc performance.

The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end.

source: L

Making the rav1d Video Decoder 1% Faster

https://ohadravid.github.io/posts/2025-05-rav1d-faster/ [ohadravid.github.io]

2025-05-25 00:24

tags: c compiler perf programming rust

rav1d is a port of dav1d, created by (1) running c2rust on dav1d, (2) incorporating dav1d’s asm-optimized functions, and (3) changing the code to be more Rust-y and safer.

Video decoders are notoriously complex pieces of software, but because we are comparing the performance of two similar deterministic binaries we might be able to avoid a lot of that complexity - with the right tooling.

source: HN

Beating the Fastest Lexer Generator in Rust

https://alic.dev/blog/fast-lexing [alic.dev]

2025-05-09 19:07

tags: compiler perf programming rust text

I was aware of the efficiency of state machine driven lexers, but most generators have one problem: they can’t be arbitrarily generic and consistently optimal at the same time. There will always be some assumptions about your data that are either impossible to express, or outside the scope of the generator’s optimizations. Either way, I was curious to find out how my hand-rolled implementation would fare.

source: L

The State of SSL Stacks

https://www.haproxy.com/blog/state-of-ssl-stacks [www.haproxy.com]

2025-05-07 00:26

tags: development library networking perf security update

For years, OpenSSL maintained its position as the de facto standard SSL library, offering long-term stability and consistent performance. The arrival of version 3.0 in September 2021 changed everything. While designed to enhance security and modularity, the new architecture introduced significant performance regressions in multi-threaded environments, and deprecated essential APIs that many external projects relied upon. The absence of the anticipated QUIC API further complicated matters for developers who had invested in its implementation.

Examining alternatives—BoringSSL, LibreSSL, WolfSSL, and AWS-LC—reveals a landscape of trade-offs. Each offers different approaches to API compatibility, performance optimization, and QUIC support. For developers navigating the modern SSL ecosystem, understanding these trade-offs is crucial for optimizing performance, maintaining compatibility, and future-proofing their infrastructure.

Why did Windows 7, for a few months, log on slower if you have a solid color background?

https://devblogs.microsoft.com/oldnewthing/20250428-00/?p=111121 [devblogs.microsoft.com]

2025-04-28 21:20

tags: bugfix perf ux windows

The code to report that the wallpaper is ready was inside the wallpaper bitmap code, which means that if you don’t have a wallpaper bitmap, the report is never made, and the logon system waits in vain for a report that will never arrive.

Constant-Time Code: The Pessimist Case

https://eprint.iacr.org/2025/435 [eprint.iacr.org]

2025-03-08 06:09

tags: compiler cpu crypto paper pdf perf programming turtles

This note discusses the problem of writing cryptographic implementations in software, free of timing-based side-channels, and many ways in which that endeavour can fail in practice. It is a pessimist view: it highlights why such failures are expected to become more common, and how constant-time coding is, or will soon become, infeasible in all generality.

From compiler optimizations to CPU pipelines and register renaming.

0+0 > 0: C++ thread-local storage performance

https://yosefk.com/blog/cxx-thread-local-storage-performance.html [yosefk.com]

2025-02-17 21:29

tags: compiler concurrency cxx library perf programming

We’ll discuss how to make sure that your access to TLS (thread-local storage) is fast. If you’re interested strictly in TLS performance guidelines and don’t care about the details, skip right to the end — but be aware that you’ll be missing out on assembly listings of profound emotional depth, which can shake even a cynical, battle-hardened programmer. If you don’t want to miss out on that — and who would?! — read on, and you shall learn the computer-scientific insight behind the intriguing inequality 0+0 > 0.

source: HN

Can atproto scale down?

https://bsky.bad-example.com/can-atproto-scale-down/ [bsky.bad-example.com]

2025-02-17 21:10

tags: networking perf programming social storage

And skipping right to the end, my answer to “can it scale down” is just: “yes!”. Here’s my Raspberry Pi 4b, at home, consuming a few watts and pulling around 20GB of simplified firehose events per day. It’s an AppView indexing all cross-repo references (backlinks) in the AT-mosphere, often up to 1,500 created per second. It’s closing in on one billion backlinks, eating up an old SATA SSD connected over a salvaged USB adapter.

source: L

The Alder Lake SHLX anomaly

https://tavianator.com/2025/shlx.html [tavianator.com]

2025-01-03 09:54

tags: benchmark cpu perf programming

It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate). On the other hand, 32-bit instructions, and 64-bit instructions without immediates (even no-op ones), make it fast. All of these ways to initialize RCX lead to 1-cycle latency:

source: L

Static search trees: 40x faster than binary search

https://curiouscoding.nl/posts/static-search-tree/ [curiouscoding.nl]

2025-01-02 01:18

tags: compsci perf programming rust

In this post, we will implement a static search tree (S+ tree) for high-throughput searching of sorted data, as introduced on Algorithmica. We’ll mostly take the code presented there as a starting point, and optimize it to its limits. For a large part, I’m simply taking the ‘future work’ ideas of that post and implementing them. And then there will be a bunch of looking at assembly code to shave off all the instructions we can. Lastly, there will be one big addition to optimize throughput: batching.

https://en.algorithmica.org/hpc/data-structures/s-tree/

source: HN

Blazingly Fast Shadow Stacks for Go

https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/ [blog.felixge.de]

2024-05-30 07:32

tags: compiler go perf programming

Software shadow stacks could deliver up to 8x faster stack trace capturing in the Go runtime when compared to the frame pointer unwinding that landed in go1.21. This doesn’t mean that this idea should escape from the laboratory right away, but it offers a fun glimpse into a potential future of hardware accelerated stack trace capturing via shadow stacks.

source: HN

Computing Adler32 Checksums at 41 GB/s

https://wooo.sh/adler32.html [wooo.sh]

2024-04-30 04:32

tags: c perf programming

While looking through the fpng source code, I noticed that its vectorized adler32 implementation seemed somewhat complicated, especially given how simple the scalar version of adler32 is. I was curious to see if I could come up with a simpler method, and in doing so, I came up with an algorithm that can be up to 7x faster than fpng’s version, and 109x faster than the simple scalar version.

source: trivium

Bending pause times to your will with Generational ZGC

https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b [netflixtechblog.com]

2024-03-16 00:20

tags: garbage-collection java perf

The latest long term support release of the JDK delivers generational support for the Z Garbage Collector. Netflix has switched by default from G1 to Generational ZGC on JDK 21 and later, because of the significant benefits of concurrent garbage collection.

source: HN

Low-level thinking in high-level shading languages 2023

https://interplayoflight.wordpress.com/2023/12/29/low-level-thinking-in-high-level-shading-languages-2023/ [interplayoflight.wordpress.com]

2024-01-01 04:21

tags: gl perf programming

This, and the followup, is a presentation that I recommend as required reading to people wanting to get deeper into shader programming, not just for the knowledge but also the attitude towards shader programming (check compiler output, never assume, always profile). It has been 10 years since it was released though; in those 10 years a lot of things have changed on the GPU/shader model/shader compiler front and not all the suggestions in those presentations are still valid. So I decided to do a refresh with a modern compiler and shader model to see what still holds true and what doesn’t. I will target the RDNA 2 GPU architecture on PC using HLSL, the 6.7 shader model and the DXC compiler (using https://godbolt.org/) in this blog post.

Analyzing Starfield’s Performance on Nvidia’s 4090 and AMD’s 7900 XTX

https://chipsandcheese.com/2023/09/14/analyzing-starfields-performance-on-nvidias-4090-and-amds-7900-xtx/ [chipsandcheese.com]

2023-09-15 21:19

tags: gaming graphics investigation perf

We analyzed this scene using Nvidia’s Nsight Graphics and AMD’s Radeon GPU Profiler to get some insight into why Starfield performs the way it does. On the Nvidia side, we covered the last three generations of cards by testing the RTX 4090, RTX 3090, and Titan RTX. On AMD, we tested the RX 7900 XTX. The i9-13900K was used to collect data for all of these GPUs.

source: HN

FreeBSD on Firecracker

https://www.usenix.org/publications/loginonline/freebsd-firecracker [www.usenix.org]

2023-08-24 15:14

tags: freebsd perf programming systems virtualization

Experiences porting FreeBSD 14 to run on the Firecracker VMM

source: L

When Good Correlation is Not Enough

https://hakibenita.com/postgresql-correlation-brin-multi-minmax [hakibenita.com]

2023-07-28 02:39

tags: database development perf sql

Choosing to use a block range index (BRIN) to query a field with high correlation is a no-brainer for the optimizer. The small size of the index and the field’s correlation makes BRIN an ideal choice. However, a recent event taught us that correlation can be misleading. Under some easily reproducible circumstances, a BRIN index can result in significantly slower execution even when the indexed field has very high correlation.

source: HN

Commander Keen's Adaptive Tile Refresh

https://fabiensanglard.net/ega/ [fabiensanglard.net]

2023-07-27 21:53

tags: gaming graphics perf programming retro

I have been reading Doom Guy by John Romero. It is an excellent book which I highly recommend. In the ninth chapter, John describes being hit by lightning upon seeing Adaptive Tile Refresh (ATS). That made me realize I never took the time to understand how this crucial piece of tech powers the Commander Keen (CK) series.

At its heart the problem ATS solves is bandwidth. Writing 320x200 nibbles (32 KiB) per frame is too much for the ISA bus. There is no way to maintain a 60Hz framerate while refreshing the whole screen. If we were to run the following code, which simply fills all banks, it would run at 5 frames per seconds.

source: HN