What are the most important statistical ideas of the past 50 years?
We argue that the most important statistical ideas of the past half century are: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss common features of these ideas, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science.
Micro-Optimizing .tar.gz Archives by Changing File Order
A few weeks ago, I was doing something with a sizeable .tar.gz file, and wondered how the order of files affected the process. I’m not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.
Hunting a Linux kernel bug
Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug!
95%-ile isn't that good
Reaching 95%-ile isn’t very impressive because it’s not that hard to do. I think this is one of my most ridiculable ideas. It doesn’t help that, when stated nakedly, that sounds elitist. But I think it’s just the opposite: most people can become (relatively) good at most things.
There are several sections here. Every time I thought I was nearing the end, more content showed up.
Let’s talk about files! Most developers seem to think that files are easy.
In this talk, we’re going to look at how file systems differ from each other and other issues we might encounter when writing to files. We’re going to look at the file “stack”, starting at the top with the file API, moving down to the filesystem, and then moving down to disk.
Modifying reassociate for improved CSE: fairly large perf gains
Wed Oct 25 11:36:54 PDT 2017
When playing around with reassociate I noticed a seemingly obvious optimization that was not getting done anywhere in llvm… nor in gcc or ICC.
Some bounds checks are elided by Apple's compiler and possibly others
Although triggered by a compiler optimization, this is a bug in Cap’n Proto, not the compiler.
To most observers, this code would appear to be correct. However, as it turns out, pointer arithmetic that overflows is undefined behavior under the C standard. As a result, the compiler is allowed to assume that the addition on the first line never overflows.
C with ABC!
In this paper, I describe a new compiler for the C89 programming language.
A paper and a compiler!
An Adaptive Packed-Memory Array
The packed-memory array (PMA) is a data structure that maintains a dynamic set of N elements in sorted order in a Θ(N)-sized array. The idea is to intersperse Θ(N) empty spaces or gaps among the elements so that only a small number of elements need to be shifted around on an insert or delete. Because the elements are stored physically in sorted order in memory or on disk, the PMA can be used to support extremely efficient range queries.
A book-in-progress about the linux kernel and its insides. The goal is simple - to share my modest knowledge about the insides of the linux kernel and help people who are interested in linux kernel insides, and other low-level subject matter.
Timers in Google Home!
Ok Google, set a timer for ninety-nine years
timer for *minus* one-thousand-nine-hundred-thirty-nine weeks, two days, six hours, twenty-eight minutes and sixteen seconds starting now
Is there data on the quality of management decisions?
Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.
Are we right or wrong? Tune in next decade to see what’s changed.
Musings on Kotlin Ranges
Here are a few interesting aspects of Kotlin ranges, some of which I’ve found to be less-than-intuitive.
Filesystem error handling
Prabhakaran et al. injected errors at the block device level (just underneath the filesystem) and found that ext3, resierfs, ntfs, and jfs mostly handled read errors reasonbly but ext3, ntfs, and jfs mostly ignored write errors. While the paper is interesting, someone installing Linux on a system today is much more likely to use ext4 than any of the now-dated filesystems tested by Prahbhakaran et al. We’ll try to reproduce some of the basic results from the paper on more modern filesystems like ext4 and btrfs, some legacy filesystems like exfat, ext3, and jfs, as well as on overlayfs.
Strange Hash Instances in Ruby
Everything can be patched, except the things that cant.
A history of branch prediction from 1500000 BC to 1995
We’ll start with the most naive things someone might do and work our way up to something better.
Why does Sattolo's algorithm produce a permutation with exactly one cycle?
I recently had a problem where part of the solution was to do a series of pointer accesses that would walk around a chunk of memory in pseudo-random order. Sattolo’s algorithm provides a solution to this because it produces a permutation of a list with exactly one cycle, which guarantees that we will reach every element of the list even though we’re traversing it in random order
Book review: "Working Effectively with Legacy Code" by Michael C. Feathers
The hacks are a good match to the foe - they’re about as awful as the code itself, so young and innocent developers may find themselves (rightfully) horrified.
Terminal and shell performance
Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, benchmark the speed of sinking stdout or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink stdout many orders of magnitude faster than humans can keep up with, so further optimizing stdout sink speed has a relatively small impact on actual user experience for most users.
Writing a SAT Solver
In this post, we’ll look at how to teach computers to solve puzzles. Specifically, we’ll look at a simple puzzle that can be expressed as a boolean constraint satisfaction problem, and we’ll write a simple constraint solver (a SAT solver) and mention how our algorithm, when augmented with a few optimizations, is used in modern SAT solvers.