DROB (Dynamic Rewriter and Optimizer of Binary code)
> This library implements application-guided rewriting of binary functions at runtime. Binary functions can be optimized and specialized based on runtime information. In contrast to transparent binary optimization, only selected binary functions are rewritten. No metadata (e.g. debug information) is required.
High-performance input handling on the web
> There is a class of UI performance problems that arise from the following situation: An input event is firing faster than the browser can paint frames.
> In a previous post, I discussed Lodash’s debounce and throttle functions, which I find very useful for these kinds of situations. Recently however, I found a pattern I like even better, so I want to discuss that here.
Follow up: https://nolanlawson.com/2019/08/14/browsers-input-events-and-frame-throttling/
> Many ambulances now have electronic PCRs, which fix a lot of these problems. The report is automatically filed with the hospital. The software can enter timestamps and fill in necessary boilerplate. By spellchecking known medications it saves time at the hospital. Nobody has to guess whether you scrawled “100mg” or “160mg”.
> The ambulance I shadowed had an ePCR. Nobody used it. I talked to the EMTs about this, and they said nobody they knew used it either. Lack of training? «No, we all got trained.» Crippling bugs? No, it worked fine. Paper was good enough? No, the ePCR was much better than paper PCRs in almost every way. It just had one problem: it was too slow.
The Ethics of Web Performance
> Advocates of any technique or technology can be a bit heavy-handed when it suits them if they’re not being careful–myself included. But I’m not sure if that’s the case here. When you stop to consider all the implications of poor performance, it’s hard not to come to the conclusion that poor performance is an ethical issue.
> Poor performance can, and does, lead to exclusion. This point is extremely well documented by now, but warrants repeating. Sites that use an excess of resources, whether on the network or on the device, don’t just cause slow experiences, but can leave entire groups of people out.
> The “best practice” solution is a technique known as suffix trees, which requires some moderately complex coding. However, we can get very reasonable performance for strings up to hundreds of thousands of characters long using a much simpler approach.
Two Performance Aesthetics: Never Miss a Frame and Do Almost Nothing
> I’ve noticed when I think about performance nowadays that I think in terms of two different aesthetics. One aesthetic, which I’ll call Never Miss a Frame, comes from the world of game development and is focused on writing code that has good worst case performance by making good use of the hardware. The other aesthetic, which I’ll call Do Almost Nothing comes from a more academic world and is focused on algorithmically minimizing the work that needs to be done to the extent that there’s barely any work left, paying attention to the performance at all scales. In this post I’ll describe the two aesthetics, look at some case studies of pairs of programs in different domains that follow different aesthetics, and talk about the trade-offs involved and how to choose which direction to lean for a project.
Efficient Go APIs with the mid-stack inliner
> A common task in Go API design is returning a byte slice. In this post I will explore some old techniques and a new one.
Scrolling the main document is better for performance, accessibility, and usability
> This subscroller fix may be obvious to more experienced web devs, but to me it was a bit surprising. From a design standpoint, the two options seemed roughly equivalent, and it didn’t occur to me that one or the other would have such a big impact, especially on mobile browsers. Given the difference in performance, accessibility, and usability though, I’ll definitely think harder in the future about exactly which element I want to be the scrollable one.
Bringing service workers to Google Search
> The story of what shipped, how the impact was measured, and the tradeoffs that were made.
Quite long. Considers a variety of aspects.
> Guidelines like RAIL are popular in the web performance community. They often define time limits that must be respected, like 100ms for what feels instantaneous, or 1000ms for the limit of acceptable response time. Prominent people in the performance community keep telling us that there’s a lot of science behind those numbers.
> Nielsen essentially takes some of the numbers from the Miller paper, brushes the dust of off them since they were pre-web and presents them in a simpler fashion that everyone understands, stating that they apply to the web. What Nielsen doesn’t do, however, is prove that those numbers are true with research of any kind. Jakob Nielsen is simply stating these limits as facts, but no science has been done to prove that they are true. And ever since, the entire web community has believed what a self-proclaimed expert said on the matter and turned it into guidelines. Surely, if an authoritative-looking man with glasses who holds a PhD in HCI states something very insistingly, it must be true.
Trash talk: the Orinoco garbage collector
> Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.
In-DRAM Bulk Bitwise Execution Engine
> Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.
AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome
> We have been teased with AMD’s next generation processor products for over a year. The new chiplet design has been heralded as a significant breakthrough in driving performance and scalability, especially as it becomes increasingly difficult to create large silicon with high frequencies on smaller and smaller process nodes. AMD is expected to deploy its chiplet paradigm across its processor line, through Ryzen and EPYC, with those chiplets each having eight next-generation Zen 2 cores. Today AMD went into more detail about the Zen 2 core, providing justification for the +15% clock-for-clock performance increase over the previous generation that the company presented at Computex last week.
Software-defined far memory in warehouse scale computers
> Font and date adjustments to accommodate the new Reiwa era in Japan
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
> One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
Faster arithmetic by flipping signs
> TL;DR: Flipping the sign of some arithmetic operations can lead to shorter and faster assembly.
2D Graphics on Modern GPU
> I have found that, if you can depend on modern compute capabilities, it seems quite viable to implement 2D rendering directly on GPU, with very promising quality and performance. The prototype I built strongly resembles a software renderer, just running on an outsized multicore GPU with wide SIMD vectors, much more so than rasterization-based pipelines.
Parsing Gigabytes of JSON per Second
> Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.