Announcing composable multi-threaded parallelism in Julia
> Software performance depends more and more on exploiting multiple processor cores. The free lunch from Moore’s Law is still over. Well, we here in the Julia developer community have something of a reputation for caring about performance. In pursuit of it, we have already built a lot of functionality for multi-process, distributed programming and GPUs, but we’ve known for years that we would also need a good story for composable multi-threading. Today we are happy to announce a major new chapter in that story. We are releasing a preview of an entirely new threading interface for Julia programs: general task parallelism, inspired by parallel programming systems like Cilk, Intel Threading Building Blocks (TBB) and Go. Task parallelism is now available in the v1.3.0-alpha release, an early preview of Julia version 1.3.0 likely to be released in a couple months. You can find binaries with this feature on the downloads page, or build the master branch from source.
Ten Years of Erlang
> I wanted to take a bit of time to reflect over most of that decade. In this post, I’ll cover a few things such as hype phases and how this related to Erlang, the ladder of ideas within the language and how that can impact adoption, what changed in my ten years here, and I’ll finish up with what I think Erlang still has to bring to the programming community at large.
Provide protection against starvation of the ll/sc loops when accessing userpace.
> Casueword(9) on ll/sc architectures must be prepared for userspace constantly modifying the same cache line as containing the CAS word, and not loop infinitely. Otherwise, rogue userspace livelock kernel.
The convoy phenomenon
> The duration of a lock is the average number of instructions executed while the lock is held. The execution interval of a lock is the average number of instructions executed between successive requests for that lock by a process. The collision cross section of the lock is the fraction of time it is granted, i.e., the lock grant probability.
> Most of us are stuck with a pre-emptive scheduler (i.e., general purpose operating system with virtual memory). Hence convoys will occur. The problem is to make them evaporate quickly when they do occur rather than have them persist forever.
Trash talk: the Orinoco garbage collector
> Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.
In-DRAM Bulk Bitwise Execution Engine
> Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.
If each thread’s TEB is referenced by the fs selector, does that mean that the 80386 is limited to 1024 threads?
> No, it doesn’t, because nobody said that the distinct values had to be different simultaneously.
Understanding real-world concurrency bugs in Go
> We perform the first systematic study on concurrency bugs in real Go programs. We studied six popular Go software [projects] including Docker, Kubernetes, and gRPC. We analyzed 171 concurrency bugs in total, with more than half of them caused by non-traditional, Go-specific problems. Apart from root causes of these bugs, we also studied their fixes, performed experiments to reproduce them, and evaluated them with two publicly-available Go bug detectors.
Improvements in forking, threading, and signal code
> I am improving signaling code in the NetBSD kernel, covering corner cases with regression tests, and improving the documentation. I’ve been working at the level of sytems calls (syscalls): forking, threading, handling these with GDB, and tracing syscalls. Some work happens behind the scenes as I support the work of Michal Gorny on LLDB/ptrace features.
Bad utmp implementations in Glibc and FreeBSD
> I wondered: If the files consist of fixed-sized records, and are readable by regular users, how is consistency maintained? That is – how can a process ensure that, when it updates the database, it doesn’t conflict with another process also attempting to update the database at the same time? Similarly, how can a process reading an entry from the database be sure that it receives a consistent, full record and not a record which has been partially updated? (after all, POSIX allows that a write(2) call can return without having written all the requested bytes, and I’m not aware of Linux or any of the *BSDs documenting that this cannot happen for regular files). Clearly, some kind of locking is needed; a process that wants to write to or read from the database locks it first, performs its operation, and then unlocks the database. Once again, this happens under the hood, in the implementation of the getutent/pututline functions or their equivalents.
Go Concurrency from the Ground Up
> Sometimes the best way to learn something is to build it. This guide will walk you through how to reproduce Go’s concurrency features in another programming language.
Splitting atoms in XNU
> A locking bug in the XNU virtual memory subsystem allowed violation of the preconditions required for the correctness of an optimized virtual memory operation. This was abused to create shared memory where it wasn’t expected, allowing the creation of a time-of-check-time-of-use bug where one wouldn’t usually exist. This was exploited to cause a heap overflow in XPC, which was used to trigger the execution of a jump-oriented payload which chained together arbitrary function calls in an unsandboxed root process, even in the presence of Apple’s implementation of ARM’s latest Pointer Authentication Codes (PAC) hardware mitigation. The payload opened a privileged socket and sent the file descriptor back to the sandboxed process, where it was used to trigger a kernel heap overflow only reachable from outside the sandbox.
This is excellent work.
> Windows fibers are really just stackful, symmetric coroutines. From a different point of view, they’re cooperatively scheduled threads, which is the source of the analogous name, fibers. They’re symmetric because all fibers are equal, and no fiber is the “main” fiber. If any fiber returns from its start routine, the program exits. (Older versions of Wine will crash when this happens, but it was recently fixed.) It’s equivalent to the process’ main thread returning from main(). The initial fiber is free to create a second fiber, yield to it, then the second fiber destroys the first.
Parallelism and concurrency need different tools
> In this piece, I disagree with Joe Armstrong and Rob Pike, basing my argument on the differences between vending machines and gift boxes (illustrated with drawings carefully prepared in Microsoft Paint).
> Parallelism and concurrency are both very fashionable notions. Lots of languages and tools are advertised as good at these things – often at both things.
> I believe that concurrency and parallelism call for very different tools, and each tool can be really good at either one or the other.
The State of Caching in Go
> In particular, Go lacks a concurrent LRU (or LFU) cache which can scale well enough to be a process-global cache. In this blog post, I will take you through the various attempts at workarounds that are typically advocated, including some which we have executed and learnt from within Dgraph. Aman will then present the design, performance and hit ratio comparison for the existing popular cache implementations in the Go ecosystem.
Keeping CALM: when distributed consistency is easy
> When it comes to high performing scalable distributed systems, coordination is a killer. It’s the dominant term in the Universal Scalability Law. When we can avoid or reduce the need for coordination things tend to get simpler and faster. See for example Coordination avoidance in database systems, and more recently the amazing performance of Anna which gives a two-orders-of-magnitude speed-up through coordination elimination. So we should avoid coordination whenever we can.
> So far so good, but when exactly can we avoid coordination? Becoming precise in the answer to that question is what the CALM theorem is all about. You’re probably familiar with Brooks’ distinction between essential complexity and accidental complexity in his ‘No silver bullet’ essay. Here we get to tease apart the distinction between essential coordination, a guarantee that cannot be provided without coordinating, and accidental coordination, coordination that could have been avoided with a more careful design.
Understanding Real-World Concurrency Bugs in Go
> In this paper, we perform the first systematic study on concurrency bugs in real Go programs. We studied six popular Go software including Docker, Kubernetes, and gRPC. We analyzed 171 concurrency bugs in total, with more than half of them caused by non-traditional, Go-specific problems. Apart from root causes of these bugs, we also studied their fixes, performed experiments to reproduce them, and evaluated them with two publicly-available Go bug detectors. Overall, our study provides a better understanding on Go’s concurrency models and can guide future researchers and practitioners in writing better, more reliable Go software and in developing debugging and diagnosis tools for Go.
The Unscalable, Deadlock-prone, Thread Pool
> Thread pools tend to only offer a sparse interface: pass a closure or a function and its arguments to the pool, and that function will be called, eventually. Functions can do anything, so this interface should offer all the expressive power one could need. Experience tells me otherwise.
> This post comes in two parts. First, the story of a simple program that’s parallelised with a thread pool, then hits a wall as a wider set of resources becomes scarce. Second, a solution I like for that kind of program: an explicit state machine, where each state gets a dedicated queue that is aware of the state’s resource requirements.
The Curious Case of BEAM CPU Usage
> Turns out, busy waiting in BEAM is an optimization that ensures maximum responsiveness. In essence, when waiting for a certain event, the virtual machine first enters a CPU-intensive tight loop, where it continuously checks to see if the event in question has occurred.
> In our test, we found that BEAM’s busy wait settings do have a significant impact on CPU usage. The highest impact was observed on the instance with the most available CPU capacity. At the same time, we did not observe any meaningful difference in performance between VMs with busy waiting enabled and disabled.
Why is My Perfectly Good Shellcode Not Working?: Cache Coherency on MIPS and ARM
> To set the scene: You found a stack buffer overflow, wrote your shellcode to an executable heap or stack, and used your overflow to direct the instruction pointer to the address of your shellcode. Yet your shellcode is inconsistent, crashes frequently, and core dumps show the processor jumped to an address halfway through your shellcode, seemingly without executing the first half. The symptoms haven’t helped diagnose the problem, they’ve left you more confused.