Go compiler intrinsics
> Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions1.
> An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation.
Upgrading from an Intel Core i7-2600K: Testing Sandy Bridge in 2019
> One of the most popular processors of the last decade has been the Intel Core i7-2600K. The design was revolutionary, as it offered a significant jump in single core performance, efficiency, and the top line processor was very overclockable. With the next few generations of processors from Intel being less exciting, or not giving users reasons to upgrade, and the phrase ‘I’ll stay with my 2600K’ became ubiquitous on forums, and is even used today. For this review, we dusted off our box of old CPUs and put it in for a run through our 2019 benchmarks, both at stock and overclocked, to see if it is still a mainstream champion.
Who has the fastest website in F1?
> So, I’m going to make my predictions the only way I know how: By comparing the performance of their websites. That’ll work right? If anything, it’ll be interesting to compare 10 sites that have been recently updated, perhaps even rebuilt, and see what the common issues are. I’ll also cover the tools and techniques I use to test web performance.
Which Programming Languages Use the Least Electricity?
> Last year a team of six researchers in Portugal from three different universities decided to investigate this question, ultimately releasing a paper titled “Energy Efficiency Across Programming Languages.” They ran the solutions to 10 programming problems written in 27 different languages, while carefully monitoring how much electricity each one used — as well as its speed and memory usage.
Methodology may have flaws, but interesting topic.
What has your microcode done for you lately?
> Did you ever wonder what is inside those microcode updates that get silently applied to your CPU via Windows update, BIOS upgrades, and various microcode packages on Linux? Well, you are in the wrong place, because this blog post won’t answer that question (you might like this though).
> In fact, the overwhelming majority of this this post is about the performance of scattered writes, and not very much at all about the details of CPU microcode. Where the microcode comes in, and what might make this more interesting than usual, is that performance on a purely CPU-bound benchmark can vary dramatically depending on microcode version. In particular, we will show that the most recent Intel microcode version can significantly slow down a store heavy workload when some stores hit in the L1 data cache, and some miss.
Gallery of Processor Cache Effects
> In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.
The State of Caching in Go
> In particular, Go lacks a concurrent LRU (or LFU) cache which can scale well enough to be a process-global cache. In this blog post, I will take you through the various attempts at workarounds that are typically advocated, including some which we have executed and learnt from within Dgraph. Aman will then present the design, performance and hit ratio comparison for the existing popular cache implementations in the Go ecosystem.
Achieving 100k connections per second with Elixir
> By analyzing the initial test results, proposing a theory, and confirming it by measuring against modified software, we were able to find two bottlenecks on the way to getting to 100k connections per second with Elixir and Ranch. The combination of multiple connection supervisors in Ranch and multiple listener sockets in the Linux kernel is necessary to achieve full utilization of the 36-core machine under the target workload.
Faster vlan(4) forwarding?
> Two years ago we observed that vlan(4) performances suffered from the locks added to the queueing API. At that time, the use of SRP was also pointed out as a possible responsible for the regression. Since dlg@ recently reworked if_enqueue() to allow pseudo-drivers to bypass the use of queues, and their associated locks, let’s dive into vlan(4) performances again
The Curious Case of BEAM CPU Usage
> Turns out, busy waiting in BEAM is an optimization that ensures maximum responsiveness. In essence, when waiting for a certain event, the virtual machine first enters a CPU-intensive tight loop, where it continuously checks to see if the event in question has occurred.
> In our test, we found that BEAM’s busy wait settings do have a significant impact on CPU usage. The highest impact was observed on the instance with the most available CPU capacity. At the same time, we did not observe any meaningful difference in performance between VMs with busy waiting enabled and disabled.
"Modern" C++ Lamentations
> This will be a long wall of text, and kinda random! My main points are:
> 1. C++ compile times are important,
> 2. Non-optimized build performance is important,
> 3. Cognitive load is important. I don’t expand much on this here, but if a programming language or a library makes me feel stupid, then I’m less likely to use it or like it. C++ does that a lot :)
Firefox 64 built with GCC and Clang
> Clang built Firefox is claimed to outperform GCC, but it is hard to get actual numbers. Firefox builds switched from GCC 6 builds (GCC 6 was released in 2016) with profile guided optimization (PGO) to Clang 7 builds (latest release) which in addition enable link time optimization (LTO). Link-time optimization can have important performance and code size impact.
Plus all the difficulties encountered trying to enable all the options.
More consistent LuaJIT performance
> So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we’ve been able to make a lot of programs (particularly large programs) have more consistent performance; and we’ve got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!
SMT Solving on an iPhone
> I’ve been seeing discussion for a while about the incredible progress Apple’s processor design team is making, and how it won’t be too long until Macs use Apple’s own ARM processors. These reports usually cite some cross-platform benchmarks like Geekbench to show that Apple’s mobile processors are at least as fast as Intel’s laptop and desktop chips. But I’ve always been a little skeptical of these cross-platform benchmarks (as are others)—do they really represent the sorts of workloads I use my Macs for?
At least one practical benchmark.
Why Aren’t More Users More Happy With Our VMs?
> In the process of using the Kalibera and Jones methodology, we noticed quite a lot of variation in the warmup time of different VMs and cases where VMs didn’t seem to warmup at all. This was surprising because pretty much every paper we’d read until that point had assumed – and, in many cases, explicitly stated – that warmup was a quick, consistent, thing. On that basis, it seemed interesting to see how the warmup time of different VMs compared. In May 2015, I asked Edd if he’d knock together a quick experiment in this vein, estimating that it would take a couple of weeks. After a couple of weeks we duly had data to look at but, to put it mildly, it wasn’t what we had expected: it showed all sorts of odd effects. My first reaction was that if we showed this data to anyone else without checking it thoroughly, we’d be in danger of becoming a laughing stock. It was tempting to bury my head in the sand again, but this time it seemed like it would be worth digging deeper to see where we’d gone wrong.
Be careful what you measure. You may not like the result...
Part 2: https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_2.html
GPU & FPGA cracking speeds for bcrypt, sha512crypt, sha256crypt, bsdicrypt scaled for same running time on CPU
Other comments in the discussion are also interesting.
PostgreSQL 11 and Just In Time Compilation of Queries
> One of the big changes in the next PostgreSQL release is the result of Andres Freund’s work on the query executor engine. Andres has been working on this part of the system for a while now, and in the next release we are going to see a new component in the execution engine: a JIT expression compiler!
> In our benchmarking, PostgreSQL 11 JIT is an awesome piece of technology and provides up to 29.31% speed improvements, executing TPC-H Q1 at scale factor 10 in 20.5s instead of 29s when using PostgreSQL 10.
Is Prefix Of String In Table?
> Wrote some C and assembly code that uses SIMD instructions to perform prefix matching of strings. The C code was between 4-7x faster than the baseline implementation for prefix matching. The assembly code was 9-12x faster than the baseline specifically for the negative match case (determining that an incoming string definitely does not prefix match any of our known strings). The fastest negative match could be done in around 6 CPU cycles, which is pretty quick. (Integer division, for example, takes about 90 cycles.)
A Timely Discovery: Examining Our AMD 2nd Gen Ryzen Results
> Instead of being a benefit to testing, what our investigation found is that when HPET is forced as the sole system timer, it can sometimes a hindrance to system performance, particularly gaming performance. Worse, because HPET is implemented differently on different platforms, the actual impact of enabling it isn’t even consistent across vendors.
This is a kinda interesting read, and also a good cautionary tale from the “turn all the knobs” crypt. Foolish consistency is the hobgoblin of many benchmarks.
OpenBSD also suffered a bit from HPET slowdown, resolved by switching to TSC.
The Mainstream Phoenix Rises: Samsung's 970 EVO (500GB & 1TB) SSDs Reviewed
> The Intel SSD 750 was the first to bring the large performance benefits of NVMe to the consumer market. It was soon eclipsed by the Samsung 950 PRO, which offered much better real-world performance thanks to better optimization for consumer workloads - the Intel SSD 750’s enterprise roots were still quite apparent. When Samsung introduced the 960 PRO and 960 EVO generation, performance jumped again thanks in large part to their much improved second-generation NVMe controller. The 970 EVO brings another generation of new controllers and NAND flash, but huge performance jumps aren’t as easy to come by. We’re closing in on the limits of PCIe 3 x4 for sequential read speeds, and there’s not much low-hanging fruit for optimization left in the NVMe controllers and how they manage flash memory. Samsung’s 3D NAND is still increasing in density, but we’re not seeing much improvement in performance or power efficiency from it.
> That leaves Samsung having to make tradeoffs with the 970 EVO, sacrificing power efficiency in many places for slight performance gains. Since almost any consumer would find the 960 PRO and 960 EVO to already be plenty fast enough, this means the 970 EVO is not at all a compelling upgrade over its predecessors.
Have we reached peak SSD?