The Alder Lake SHLX anomaly
https://tavianator.com/2025/shlx.html [tavianator.com]
2025-01-03 09:54
tags:
benchmark
cpu
perf
programming
It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate). On the other hand, 32-bit instructions, and 64-bit instructions without immediates (even no-op ones), make it fast. All of these ways to initialize RCX lead to 1-cycle latency:
source: L
The Biggest Scandal In Speed Typing History
https://www.youtube.com/watch?v=maCHHSussS4 [www.youtube.com]
2023-06-27 02:30
tags:
benchmark
factcheck
hoipolloi
investigation
retro
tty
video
Barbara Blackburn is often cited as the fastest typist in history. She even appears in the Guinness Book of World Records! She must be legit right? Well, maybe not. I was supposed to make a video about the new typing speed world record, and instead got pulled into a Barbara Blackburn rabbit hole that I can’t seem to escape. TL;DR She’s not that fast.
Block Profiling in Go
https://github.com/felixge/go-profiler-notes/blob/main/block.md [github.com]
2021-02-10 01:46
tags:
benchmark
development
go
perf
The block profile in Go lets you analyze how much time your program spends waiting on the blocking operations listed below:
source: HN
Micro-Optimizing .tar.gz Archives by Changing File Order
https://justinblank.com/experiments/optimizingtar.html [justinblank.com]
2021-01-20 06:50
tags:
benchmark
compression
perf
storage
A few weeks ago, I was doing something with a sizeable .tar.gz file, and wondered how the order of files affected the process. I’m not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.
source: danluu
An Obscure American Automaker Now Has the World’s Fastest Car
https://www.bloomberg.com/news/articles/2020-10-19/ssc-tuatara-is-world-s-fastest-production-car-new-top-speed-record [www.bloomberg.com]
2020-10-20 18:50
tags:
benchmark
cars
AVIF has landed
https://jakearchibald.com/2020/avif-has-landed/ [jakearchibald.com]
2020-09-09 20:52
tags:
benchmark
graphics
web
AVIF is a new image format derived from the keyframes of AV1 video. It’s a royalty-free format, and it’s already supported in Chrome 85 on desktop. Android support will be added soon, Firefox is working on an implementation, and although it took Safari 10 years to add WebP support, I don’t think we’ll see the same delay here, as Apple are a member of the group that created AV1.
Roughly speaking, at an acceptable quality, the WebP is almost half the size of JPEG, and AVIF is under half the size of WebP. I find it incredible that AVIF can do a good job of the image in just 18 kB.
source: L
Is WebP really better than JPEG?
https://siipo.la/blog/is-webp-really-better-than-jpeg [siipo.la]
2020-06-23 16:39
tags:
benchmark
graphics
web
I think Google’s result of 25-34% smaller files is mostly caused by the fact that they compared their WebP encoder to the JPEG reference implementation, Independent JPEG Group’s cjpeg, not Mozilla’s improved MozJPEG encoder. I decided to run some tests to see how cjpeg, MozJPEG and WebP compare. I also tested the new AVIF format, based on the open AV1 video codec. AVIF support is already in Firefox behind a flag and should be coming soon to Chrome if this ticket is to be believed.
source: HN
Ice Lake Store Elimination
https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html [travisdowns.github.io]
2020-05-18 20:25
tags:
benchmark
cpu
investigation
perf
systems
We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
source: HN
ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner
https://arstechnica.com/gadgets/2020/05/zfs-versus-raid-eight-ironwolf-disks-two-filesystems-one-winner/ [arstechnica.com]
2020-05-18 19:32
tags:
admin
benchmark
filesystem
hardware
storage
We exhaustively tested ZFS and RAID performance on our Storage Hot Rod server.
source: ars
Elixir and Postgres: A Rarely Mentioned Problem
https://blog.soykaf.com/post/postgresql-elixir-troubles/ [blog.soykaf.com]
2020-02-19 06:02
tags:
benchmark
database
perf
sql
Last time, we talked about the magic trick to make your full text searches go fast. This time, I’ll tell you about another performance issue I encountered that probably also affects your performance, at least if you are using Ecto and PostgreSQL.
Gathering Intel on Intel AVX-512 Transitions
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html [travisdowns.github.io]
2020-01-17 22:19
tags:
benchmark
cpu
investigation
perf
programming
This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.
source: HN
Clang format tanks performance
https://travisdowns.github.io/blog/2019/11/19/toupper.html [travisdowns.github.io]
2019-11-19 22:54
tags:
benchmark
c
cxx
perf
programming
turtles
Let’s benchmark toupper implementations.
Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.
source: L
An analysis of performance evolution of Linux’s core operations
https://blog.acolyer.org/2019/11/04/an-analysis-of-performance-evolution-of-linuxs-core-operations/ [blog.acolyer.org]
2019-11-04 21:40
tags:
benchmark
development
linux
paper
perf
systems
When you get into the details I found it hard to come away with any strongly actionable takeaways though. Perhaps the most interesting lesson/reminder is this: it takes a lot of effort to tune a Linux kernel. For example:
“Red Hat and Suse normally required 6-18 months to optimise the performance an an upstream Linux kernel before it can be released as an enterprise distribution”, and
“Google’s data center kernel is carefully performance tuned for their workloads. This task is carried out by a team of over 100 engineers, and for each new kernel, the effort can also take 6-18 months.”
Real-world measurements of structured-lattices and supersingular isogenies in TLS
https://www.imperialviolet.org/2019/10/30/pqsivssl.html [www.imperialviolet.org]
2019-10-30 21:45
tags:
benchmark
browser
crypto
networking
quantum
security
This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft’s SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.
Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.
Also: https://blog.cloudflare.com/the-tls-post-quantum-experiment/
source: green
Making the Tokio scheduler 10x faster
https://tokio.rs/blog/2019-10-scheduler/ [tokio.rs]
2019-10-14 16:58
tags:
benchmark
concurrency
perf
programming
rust
systems
update
We’ve been hard at work on the next major revision of Tokio, Rust’s asynchronous runtime. Today, a complete rewrite of the scheduler has been submitted as a pull request. The result is huge performance and latency improvements. Some benchmarks saw a 10x speed up! It is always unclear how much these kinds of improvements impact “full stack” use cases, so we’ve also tested how these scheduler improvements impacted use cases like Hyper and Tonic (spoiler: it’s really good).
In preparation for working on the new scheduler, I spent time searching for resources on scheduler implementations. Besides existing implementations, I did not find much. I also found the source of existing implementations difficult to navigate. To remedy this, I tried to keep Tokio’s new scheduler implementation as clean as possible. I also am writing this detailed article on implementing the scheduler in hope that others in similar positions find it useful.
The article starts with a high level overview of scheduler design, including work-stealing schedulers. It then gets into the details of specific optimizations made in the new Tokio scheduler.
source: HN
PyPy's new JSON parser
https://morepypy.blogspot.com/2019/10/pypys-new-json-parser.html [morepypy.blogspot.com]
2019-10-08 17:06
tags:
benchmark
jit
perf
programming
python
In the last year or two I have worked on and off on making PyPy’s JSON faster, particularly when parsing large JSON files. In this post I am going to document those techniques and measure their performance impact.
source: HN
Benchmarking Fibers, Threads and Processes
http://engineering.appfolio.com/appfolio-engineering/2019/9/13/benchmarking-fibers-threads-and-processes [engineering.appfolio.com]
2019-09-19 19:37
tags:
benchmark
concurrency
perf
programming
ruby
Awhile back, I set out to look at Fiber performance and how it’s improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby’s Fiber class by Samuel Williams.
It’s not hard to write a microbenchmark for something like Fiber.yield. But it’s harder, and more interesting, to write a benchmark that’s useful and representative.
source: L
Go compiler intrinsics
https://dave.cheney.net/2019/08/20/go-compiler-intrinsics [dave.cheney.net]
2019-08-22 05:33
tags:
benchmark
cpu
go
perf
programming
Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
https://www.anandtech.com/show/14043/upgrading-from-an-intel-core-i7-2600k-testing-sandy-bridge-in-2019 [www.anandtech.com]
2019-05-11 01:00
tags:
benchmark
cpu
hardware
perf
retro
One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
source: HN
Who has the fastest website in F1?
https://jakearchibald.com/2019/f1-perf/ [jakearchibald.com]
2019-04-03 03:01
tags:
benchmark
development
html
perf
web
So, I’m going to make my predictions the only way I know how: By comparing the performance of their websites. That’ll work right? If anything, it’ll be interesting to compare 10 sites that have been recently updated, perhaps even rebuilt, and see what the common issues are. I’ll also cover the tools and techniques I use to test web performance.
source: HN