Block Profiling in Go
The block profile in Go lets you analyze how much time your program spends waiting on the blocking operations listed below:
Micro-Optimizing .tar.gz Archives by Changing File Order
A few weeks ago, I was doing something with a sizeable .tar.gz file, and wondered how the order of files affected the process. I’m not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.
An Obscure American Automaker Now Has the World’s Fastest Car
AVIF has landed
AVIF is a new image format derived from the keyframes of AV1 video. It’s a royalty-free format, and it’s already supported in Chrome 85 on desktop. Android support will be added soon, Firefox is working on an implementation, and although it took Safari 10 years to add WebP support, I don’t think we’ll see the same delay here, as Apple are a member of the group that created AV1.
Roughly speaking, at an acceptable quality, the WebP is almost half the size of JPEG, and AVIF is under half the size of WebP. I find it incredible that AVIF can do a good job of the image in just 18 kB.
Is WebP really better than JPEG?
I think Google’s result of 25-34% smaller files is mostly caused by the fact that they compared their WebP encoder to the JPEG reference implementation, Independent JPEG Group’s cjpeg, not Mozilla’s improved MozJPEG encoder. I decided to run some tests to see how cjpeg, MozJPEG and WebP compare. I also tested the new AVIF format, based on the open AV1 video codec. AVIF support is already in Firefox behind a flag and should be coming soon to Chrome if this ticket is to be believed.
Ice Lake Store Elimination
We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.
But there’s a lot of investigation work to get there.
ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner
We exhaustively tested ZFS and RAID performance on our Storage Hot Rod server.
Elixir and Postgres: A Rarely Mentioned Problem
Last time, we talked about the magic trick to make your full text searches go fast. This time, I’ll tell you about another performance issue I encountered that probably also affects your performance, at least if you are using Ecto and PostgreSQL.
Gathering Intel on Intel AVX-512 Transitions
This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.
Clang format tanks performance
Let’s benchmark toupper implementations.
Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.
An analysis of performance evolution of Linux’s core operations
When you get into the details I found it hard to come away with any strongly actionable takeaways though. Perhaps the most interesting lesson/reminder is this: it takes a lot of effort to tune a Linux kernel. For example:
“Red Hat and Suse normally required 6-18 months to optimise the performance an an upstream Linux kernel before it can be released as an enterprise distribution”, and
“Google’s data center kernel is carefully performance tuned for their workloads. This task is carried out by a team of over 100 engineers, and for each new kernel, the effort can also take 6-18 months.”
Real-world measurements of structured-lattices and supersingular isogenies in TLS
This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft’s SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.
Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.
Making the Tokio scheduler 10x faster
We’ve been hard at work on the next major revision of Tokio, Rust’s asynchronous runtime. Today, a complete rewrite of the scheduler has been submitted as a pull request. The result is huge performance and latency improvements. Some benchmarks saw a 10x speed up! It is always unclear how much these kinds of improvements impact “full stack” use cases, so we’ve also tested how these scheduler improvements impacted use cases like Hyper and Tonic (spoiler: it’s really good).
In preparation for working on the new scheduler, I spent time searching for resources on scheduler implementations. Besides existing implementations, I did not find much. I also found the source of existing implementations difficult to navigate. To remedy this, I tried to keep Tokio’s new scheduler implementation as clean as possible. I also am writing this detailed article on implementing the scheduler in hope that others in similar positions find it useful.
The article starts with a high level overview of scheduler design, including work-stealing schedulers. It then gets into the details of specific optimizations made in the new Tokio scheduler.
PyPy's new JSON parser
In the last year or two I have worked on and off on making PyPy’s JSON faster, particularly when parsing large JSON files. In this post I am going to document those techniques and measure their performance impact.
Benchmarking Fibers, Threads and Processes
Awhile back, I set out to look at Fiber performance and how it’s improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby’s Fiber class by Samuel Williams.
It’s not hard to write a microbenchmark for something like Fiber.yield. But it’s harder, and more interesting, to write a benchmark that’s useful and representative.
Go compiler intrinsics
Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
Who has the fastest website in F1?
So, I’m going to make my predictions the only way I know how: By comparing the performance of their websites. That’ll work right? If anything, it’ll be interesting to compare 10 sites that have been recently updated, perhaps even rebuilt, and see what the common issues are. I’ll also cover the tools and techniques I use to test web performance.
Which Programming Languages Use the Least Electricity?
Last year a team of six researchers in Portugal from three different universities decided to investigate this question, ultimately releasing a paper titled “Energy Efficiency Across Programming Languages.” They ran the solutions to 10 programming problems written in 27 different languages, while carefully monitoring how much electricity each one used — as well as its speed and memory usage.
Methodology may have flaws, but interesting topic.
What has your microcode done for you lately?
Did you ever wonder what is inside those microcode updates that get silently applied to your CPU via Windows update, BIOS upgrades, and various microcode packages on Linux? Well, you are in the wrong place, because this blog post won’t answer that question (you might like this though).
In fact, the overwhelming majority of this this post is about the performance of scattered writes, and not very much at all about the details of CPU microcode. Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.