Introducing Glush: a robust, human readable, top-down parser compiler
> It’s been 45 years since Stephen Johnson wrote Yacc (Yet another compiler-compiler), a parser generator that made it possible for anyone to write fast, efficient parsers. Yacc, and its many derivatives, quickly became popular and were included in many Unix distributions. You would imagine that in 45 years we would have further perfected the art of creating parsers and would have standardized on a single tool. A lot of progress has been made, but there are still annoyances and problems affecting every tool out there.
This is great, even just for the overview of parsing.
> The CYK algorithm (named after Cocke–Younger–Kasami) is in my opinion of great theoretical importance when it comes to parsing context-free grammars. CYK will parse all context-free parsers in O(n3), including the “simple” grammars that LL/LR can parse in linear time. It accomplishes this by converting parsing into a different problem: CYK shows that parsing context-free languages is equivalent to doing a boolean matrix multiplication. Matrix multiplication can be done naively in cubic time, and as such parsing context-free languages can be done in cubic time. It’s a very satisfying theoretical result, and the actual algorithm is small and easy to understand.
Hacking GitHub with Unicode's dotless 'i'.
> GitHub’s forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker- which was technically speaking, not the same email address.
This is beautiful.
Teletext’s creative legacy lives on
> Like Walkmans and VHS recorders, teletext now seems impossibly quaint. But designer and writer Craig Oldham explains that not only was Teletext a revolutionary technology in its prime, its creative legacy lives on with a new generation of artists who love its creative limits.
Announcing the Allsorts Font Shaping Engine
> Today YesLogic is open-sourcing the Allsorts font parser, shaping engine, and subsetter for OpenType, WOFF, and WOFF2 under the Apache 2.0 license. Allsorts was extracted from the Prince HTML to PDF typesetting and layout tool and is implemented in Rust.
> Font shaping is the process of laying out the glyphs of a font in order to represent some input text. Rasterisation of the glyphs is a separate process. Font shaping for Latin text is quite simple. For some scripts, like those used by Indic languages, it is quite complex and requires reordering and substituting the glyphs in each syllable to produce the final output. There are only three main font shaping engines in use today: DirectWrite on Windows, CoreText on macOS and iOS, and HarfBuzz on open-source operating systems and some web-browsers. Of these, only HarfBuzz is open source.
Text Editing Hates You Too
> Alexis Beingessner’s Text Rendering Hates You, published exactly a month ago today, hits very close to my heart.
> Back in 2017, I was building a rich text editor in the browser. Unsatisfied with existing libraries that used ContentEditable, I thought to myself “hey, I’ll just reimplement text selection myself! How difficult could it possibly be?” I was young. Naive. I estimated it would take two weeks. In reality, attempting to solve this problem would consume several years of my life, and even landed me a full time job for a year implementing text editing for a new operating system.
An unexpected character replacement
> A few weeks ago I found a replacement in GBIF that I’d never seen before: M<fc>ller. It was a hexadecimal value for the character “ü” enclosed in angle brackets. That particular hex value for “ü” appears in Windows-1252 and other encodings, but what program did this replacement? And why?
Text Rendering Hates You
> Rendering text, how hard could it be? As it turns out, incredibly hard! To my knowledge, literally no system renders text “perfectly”. It’s all best-effort, although some efforts are more important than others.
I lost it at multicolored ligatures.
The (Mostly) True Story of Helvetica and the New York City Subway
> There is a commonly held belief that Helvetica is the signage typeface of the New York City subway system, a belief reinforced by Helvetica, Gary Hustwit’s popular 2007 documentary about the typeface. But it is not true—or rather, it is only somewhat true. Helvetica is the official typeface of the MTA today, but it was not the typeface specified by Unimark International when it created a new signage system at the end of the 1960s. Why was Helvetica not chosen originally? What was chosen in its place? Why is Helvetica used now, and when did the changeover occur? To answer those questions this essay explores several important histories: of the New York City subway system, transportation signage in the 1960s, Unimark International and, of course, Helvetica. These four strands are woven together, over nine pages, to tell a story that ultimately transcends the simple issue of Helvetica and the subway.
> It’s Not Wrong that “🤦🏼♂️”.length == 7 But It’s Better that “🤦🏼♂️”.len() == 17 and Rather Useless that len(“🤦🏼♂️“) == 5
> The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.
And then we move on from there, in quite some depth.
ASCII table and history
> To understand why Control+i inserts a Tab in your terminal you need to understand ASCII, and to understand ASCII you need know a bit about its history and the world it was developed in. Please bear with me (or just go the table).
> Most teleprinters communicated using the ITA2 protocol. For the most part this would just encode the alphabet, but there are a few control codes: WRU (“Who R U”) would cause the receiving teleprinter to send back its identification, BEL would ring a bell, and it had the familiar CR (Carriage Return) and LF (Line Feed).
Interview with Bill Joy
> The following interview is taken from the August 1984 issue of Unix Review magazine.
A lot of text editor history here, featuring of course, vi.
> I think it killed the performance on a lot of the systems in the Labs for years because everyone had their own copy of it, but it wasn’t being shared, and so they wasted huge amounts of memory back when memory was expensive. With 92 people in the Labs maintaining vi independently, I think they ultimately wasted incredible amounts of money. I was surprised about vi going in, though, I didn’t know it was in System V. I learned about it being in System V quite a while after it had come out.
Plus some commentary on other topics.
> The point is that you want to have a system that is responsive. You don’t want a car that talks to you. I’ll never buy a car that says, “Good morning.” The neat thing about UNIX is that it is very responsive. You just say, “A pipe to B” - it doesn’t blather at you that “execution begins,” or “execution terminated, IEFBR14.”
> The trouble is that UNIX is not accessible, not transparent in the way that Interleaf is, where you sit down and start poking around in the menu and explore the whole system. Someone I know sat down with a Macintosh and a Lisa and was disappointed because, in a half hour, he explored the whole system and there wasn’t as much as he thought. That’s true, but the point is in half an hour, almost without a manual you can know which button to push and you can find nearly everything. Things don’t get lost. I think that’s the key.
Modern text rendering with Linux
> Welcome to part 1 of Modern text rendering in Linux. In each part of this series we will build a self-contained C program to render a character or sequence of characters. Each of these programs will implement a feature which I consider essential to achieve state of the art text rendering.
> In this first part I will show how to setup FreeType and we will build a console character renderer.
Announcing code annotations for SourceHut
> A lot of design thought went into this feature, but I knew one thing from the outset: I wanted to make a generic system that users could use to annotate their source code in any manner they chose. My friend Andrew Kelley (of Zig fame) once expressed to me his frustration with GitHub’s refusal to implement syntax highlighting for “small” languages, citing a shortage of manpower. It’s for this reason that it’s important to me that SourceHut’s open-source platform allows users large and small to volunteer to build the perfect integration for their needs - I don’t scale alone.
A program to detect mojibake that results from a UTF-8-encoded file being misinterpreted as code page 1252
> It is not uncommon to see UTF-8-encoded data misinterpreted as code page 1252, due to file content lacking an explicit encoding declaration, such as you might encounter with the Resource Compiler. So let’s write a program to detect that specific kind of mojibake.
Why you should learn just a little Awk: An Awk tutorial by Example
> In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.
Lyrics Site Accuses Google of Lifting Its Content
> Starting around 2016, Genius said, the company made a subtle change to some of the songs on its website, alternating the lyrics’ apostrophes between straight and curly single-quote marks in exactly the same sequence for every song.
> When the two types of apostrophes were converted to the dots and dashes used in Morse code, they spelled out the words “Red Handed.”
Unicode programming, with examples
> Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.
The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8
> Text editors nowadays will happily “help” you out by silently converting to UTF-8, but I don’t know of any that silently convert to Windows-1252.
UTF-8 String Indexing Strategies
> One issue to consider is that strings typically feature random access indexing of code points with a time complexity resembling constant time (O(1)). However, not all string representations actually support this well. Strings using variable length encoding, such as UTF-8 or UTF-16, have O(n) time complexity indexing, ignoring special cases (discussed below). The most obvious choice to achieve O(1) time complexity — an array of 32-bit values, as in UCS-4 — makes very inefficient use of memory, especially with typical strings.
> Despite this, UTF-8 is still chosen in a number of programming languages, or at least in their implementations. In this article I’ll discuss four examples — Emacs Lisp, Julia, and Go — and how each takes a slightly different approach.
Parsing Gigabytes of JSON per Second
> Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.