Mispredicted branches can multiply your running times
> Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.
Making the Tokio scheduler 10x faster
> We’ve been hard at work on the next major revision of Tokio, Rust’s asynchronous runtime. Today, a complete rewrite of the scheduler has been submitted as a pull request. The result is huge performance and latency improvements. Some benchmarks saw a 10x speed up! It is always unclear how much these kinds of improvements impact “full stack” use cases, so we’ve also tested how these scheduler improvements impacted use cases like Hyper and Tonic (spoiler: it’s really good).
> In preparation for working on the new scheduler, I spent time searching for resources on scheduler implementations. Besides existing implementations, I did not find much. I also found the source of existing implementations difficult to navigate. To remedy this, I tried to keep Tokio’s new scheduler implementation as clean as possible. I also am writing this detailed article on implementing the scheduler in hope that others in similar positions find it useful.
> The article starts with a high level overview of scheduler design, including work-stealing schedulers. It then gets into the details of specific optimizations made in the new Tokio scheduler.
PyPy's new JSON parser
> In the last year or two I have worked on and off on making PyPy’s JSON faster, particularly when parsing large JSON files. In this post I am going to document those techniques and measure their performance impact.
A fast alternative to the modulo reduction
> Assume that x and N are 32-bit integers, consider the 64-bit product x * N. You have that (x * N) div 2^32 is in the range, and it is a fair map.
A Couple of (Probabilistic) Worst-case Bounds for Robin Hood Linear Probing
> I like to think of Robin Hood hash tables with linear probing as arrays sorted on uniformly distributed keys, with gaps. That makes it clearer that we can use these tables to implement algorithms based on merging sorted streams in bulk, as well as ones that rely on fast point lookups. A key question with randomised data structures is how badly we can expect them to perform.
Introducing Ristretto: A High-Performance Go Cache
> With over six months of research and development, we’re proud to announce the initial release of Ristretto: A High Performance, Concurrent, Memory-Bound Go cache. It is contention-proof, scales well and provides consistently high hit-ratios.
Interesting read even if only for the links to prior art and research.
Common Systems Programming Optimizations & Tricks
> Today’s blog post is an overview of some common optimization techniques and neat tricks for doing “systems programming” – whatever that means today. We’ll walk through some methods to make your code run faster, be more efficient, and to squeeze just a little more juice from whatever you got.
Benchmarking Fibers, Threads and Processes
> Awhile back, I set out to look at Fiber performance and how it’s improved in recent Ruby versions. After all, concurrency is one of the three pillars of Ruby 3x3! Also, there have been some major speedups in Ruby’s Fiber class by Samuel Williams.
> It’s not hard to write a microbenchmark for something like Fiber.yield. But it’s harder, and more interesting, to write a benchmark that’s useful and representative.
Postgres Execution Plans - Field Glossary
> There are lots of guides out there to the basics of execution plans, but a lot are quite scarce on the details - how to interpret particular values, what they really mean, and where the pitfalls are.
> We’ve spent a lot of time over the last 18 months learning, clarifying, and downright misinterpreting how each of these fields work — and there’s still further for us to go on that.
> But we have come a long way, and I’d like to share the guide that I wish had existed when we started out — a glossary of the most common fields you’ll see on the operations in a query plan, and a detailed description of what each one means.
Taskbar Latency and Kernel Calls
> I work quickly on my computer and I get frustrated when I am forced to wait on an operation that should be fast. A persistent nuisance on my over-powered home laptop is that closing windows on the taskbar is slow. I right-click on an entry, wait for the menu to appear, and then select “Close window”. The mouse movement should be the slow part of this but instead I find that the delay before the menu appears is the longest component.
> What this says is that, over the course of two right-mouse clicks, RuntimeBroker.exe, thread 10,252, issued 229,604 ReadFile calls, reading a total of 15,686,586 bytes. That is an average read of 68 bytes each time.
The Synchronization of Periodic Routing Messages
> The paper considers a network with many apparently-independent periodic processes and discusses one method by which these processes can inadvertently become synchronized. In particular, we study the synchronization of periodic routing messages, and offer guidelines on how to avoid inadvertent synchronization. Using simulations and analysis, we study the process of synchronization and show that the transition from unsynchronized to synchronized traffic is not one of gradual degradation but is instead a very abrupt ‘phase transition’: in general, the addition of a single router will convert a completely unsynchronized traffic stream into a completely synchronized one. We show that synchronization can be avoided by the addition of randomization to the traffic sources and quantify how much randomization is necessary. In addition, we argue that the inadvertent synchronization of periodic processes is likely to become an increasing problem in computer networks.
The Baseline Interpreter: a faster JS interpreter in Firefox 70
> The Baseline Interpreter sits between the C++ interpreter and the Baseline JIT and has elements from both. It executes all bytecode instructions with a fixed interpreter loop (like the C++ interpreter). In addition, it uses Inline Caches to improve performance and collect type information (like the Baseline JIT).
The story of a V8 performance cliff in React
Position Independent Code (PIC) in shared libraries
> This article explained what position independent code is, and how it helps create shared libraries with shareable read-only text sections. There are some tradeoffs when choosing between PIC and its alternative (load-time relocation), and the eventual outcome really depends on a lot of factors, like the CPU architecture on which the program is going to run.
Avoid Premature Optimization
> How I fell into the trap of premature optimization, the root of all evil.
Go compiler intrinsics
> Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
> An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Thoughts on Rust bloat
> I’m about to accept a PR that will increase druid’s compile time about 3x and its executable size almost 2x. In this case, I think the tradeoff is worth it (without localization, a GUI toolkit is strictly a toy), but the bloat makes me unhappy and I think there is room for improvement in the Rust ecosystem.
Security flaws caused by compiler optimizations
I think I would prefer “enabled” to “caused” since the flaw is in the code. A collection of bug patterns worth considering.
DROB (Dynamic Rewriter and Optimizer of Binary code)
> This library implements application-guided rewriting of binary functions at runtime. Binary functions can be optimized and specialized based on runtime information. In contrast to transparent binary optimization, only selected binary functions are rewritten. No metadata (e.g. debug information) is required.
High-performance input handling on the web
> There is a class of UI performance problems that arise from the following situation: An input event is firing faster than the browser can paint frames.
> In a previous post, I discussed Lodash’s debounce and throttle functions, which I find very useful for these kinds of situations. Recently however, I found a pattern I like even better, so I want to discuss that here.
Follow up: https://nolanlawson.com/2019/08/14/browsers-input-events-and-frame-throttling/