Gathering Intel on Intel AVX-512 Transitions
> This is a post about AVX and AVX-512 related frequency scaling. Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?
> Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.
Clang format tanks performance
> Let’s benchmark toupper implementations.
> Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.
An analysis of performance evolution of Linux’s core operations
> When you get into the details I found it hard to come away with any strongly actionable takeaways though. Perhaps the most interesting lesson/reminder is this: it takes a lot of effort to tune a Linux kernel. For example:
> “Red Hat and Suse normally required 6-18 months to optimise the performance an an upstream Linux kernel before it can be released as an enterprise distribution”, and
> “Google’s data center kernel is carefully performance tuned for their workloads. This task is carried out by a team of over 100 engineers, and for each new kernel, the effort can also take 6-18 months.”
Real-world measurements of structured-lattices and supersingular isogenies in TLS
> This is the third in a series of posts about running experiments on post-quantum confidentiality in TLS. The first detailed experiments that measured the estimated network overhead of three families of post-quantum key exchanges. The second detailed the choices behind a specific structured-lattice scheme. This one gives details of a full, end-to-end measurement of that scheme and a supersingular isogeny scheme, SIKE/p434. This was done in collaboration with Cloudflare, who integrated Microsoft’s SIKE code into BoringSSL for the tests, and ran the server-side of the experiment.
> Because optimised assembly implementations are labour-intensive to write, they were only available/written for AArch64 and x86-64. Because SIKE is computationally expensive, it wasn’t feasible to enable it without an assembly implementation, thus only AArch64 and x86-64 clients were included in the experiment and ARMv7 and x86 clients did not contribute to the results even if they were assigned to one of the experiment groups.
Making the Tokio scheduler 10x faster
> We’ve been hard at work on the next major revision of Tokio, Rust’s asynchronous runtime. Today, a complete rewrite of the scheduler has been submitted as a pull request. The result is huge performance and latency improvements. Some benchmarks saw a 10x speed up! It is always unclear how much these kinds of improvements impact “full stack” use cases, so we’ve also tested how these scheduler improvements impacted use cases like Hyper and Tonic (spoiler: it’s really good).
> In preparation for working on the new scheduler, I spent time searching for resources on scheduler implementations. Besides existing implementations, I did not find much. I also found the source of existing implementations difficult to navigate. To remedy this, I tried to keep Tokio’s new scheduler implementation as clean as possible. I also am writing this detailed article on implementing the scheduler in hope that others in similar positions find it useful.
> The article starts with a high level overview of scheduler design, including work-stealing schedulers. It then gets into the details of specific optimizations made in the new Tokio scheduler.
PyPy's new JSON parser
> In the last year or two I have worked on and off on making PyPy’s JSON faster, particularly when parsing large JSON files. In this post I am going to document those techniques and measure their performance impact.
Benchmarking Fibers, Threads and Processes
> Awhile back, I set out to look at Fiber performance and how it’s improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby’s Fiber class by Samuel Williams.
> It’s not hard to write a microbenchmark for something like Fiber.yield. But it’s harder, and more interesting, to write a benchmark that’s useful and representative.
Go compiler intrinsics
> Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
> An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
> One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
Who has the fastest website in F1?
> So, I’m going to make my predictions the only way I know how: By comparing the performance of their websites. That’ll work right? If anything, it’ll be interesting to compare 10 sites that have been recently updated, perhaps even rebuilt, and see what the common issues are. I’ll also cover the tools and techniques I use to test web performance.
Which Programming Languages Use the Least Electricity?
> Last year a team of six researchers in Portugal from three different universities decided to investigate this question, ultimately releasing a paper titled “Energy Efficiency Across Programming Languages.” They ran the solutions to 10 programming problems written in 27 different languages, while carefully monitoring how much electricity each one used — as well as its speed and memory usage.
Methodology may have flaws, but interesting topic.
What has your microcode done for you lately?
> Did you ever wonder what is inside those microcode updates that get silently applied to your CPU via Windows update, BIOS upgrades, and various microcode packages on Linux? Well, you are in the wrong place, because this blog post won’t answer that question (you might like this though).
> In fact, the overwhelming majority of this this post is about the performance of scattered writes, and not very much at all about the details of CPU microcode. Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.
Gallery of Processor Cache Effects
> In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.
The State of Caching in Go
> In particular, Go lacks a concurrent LRU (or LFU) cache which can scale well enough to be a process-global cache. In this blog post, I will take you through the various attempts at workarounds that are typically advocated, including some which we have executed and learnt from within Dgraph. Aman will then present the design, performance and hit ratio comparison for the existing popular cache implementations in the Go ecosystem.
Achieving 100k connections per second with Elixir
> By analyzing the initial test results, proposing a theory, and confirming it by measuring against modified software, we were able to find two bottlenecks on the way to getting to 100k connections per second with Elixir and Ranch. The combination of multiple connection supervisors in Ranch and multiple listener sockets in the Linux kernel is necessary to achieve full utilization of the 36-core machine under the target workload.
Faster vlan(4) forwarding?
> Two years ago we observed that vlan(4) performances suffered from the locks added to the queueing API. At that time, the use of SRP was also pointed out as a possible responsible for the regression. Since dlg@ recently reworked if_enqueue() to allow pseudo-drivers to bypass the use of queues, and their associated locks, let’s dive into vlan(4) performances again
The Curious Case of BEAM CPU Usage
> Turns out, busy waiting in BEAM is an optimization that ensures maximum responsiveness. In essence, when waiting for a certain event, the virtual machine first enters a CPU-intensive tight loop, where it continuously checks to see if the event in question has occurred.
> In our test, we found that BEAM’s busy wait settings do have a significant impact on CPU usage. The highest impact was observed on the instance with the most available CPU capacity. At the same time, we did not observe any meaningful difference in performance between VMs with busy waiting enabled and disabled.
"Modern" C++ Lamentations
> This will be a long wall of text, and kinda random! My main points are:
> 1. C++ compile times are important,
> 2. Non-optimized build performance is important,
> 3. Cognitive load is important. I don’t expand much on this here, but if a programming language or a library makes me feel stupid, then I’m less likely to use it or like it. C++ does that a lot :)
Firefox 64 built with GCC and Clang
> Clang built Firefox is claimed to outperform GCC, but it is hard to get actual numbers. Firefox builds switched from GCC 6 builds (GCC 6 was released in 2016) with profile guided optimization (PGO) to Clang 7 builds (latest release) which in addition enable link time optimization (LTO). Link-time optimization can have important performance and code size impact.
Plus all the difficulties encountered trying to enable all the options.
More consistent LuaJIT performance
> So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we’ve been able to make a lot of programs (particularly large programs) have more consistent performance; and we’ve got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!