Text Rendering Hates You
> Rendering text, how hard could it be? As it turns out, incredibly hard! To my knowledge, literally no system renders text “perfectly”. It’s all best-effort, although some efforts are more important than others.
I lost it at multicolored ligatures.
The (Mostly) True Story of Helvetica and the New York City Subway
> There is a commonly held belief that Helvetica is the signage typeface of the New York City subway system, a belief reinforced by Helvetica, Gary Hustwit’s popular 2007 documentary about the typeface. But it is not true—or rather, it is only somewhat true. Helvetica is the official typeface of the MTA today, but it was not the typeface specified by Unimark International when it created a new signage system at the end of the 1960s. Why was Helvetica not chosen originally? What was chosen in its place? Why is Helvetica used now, and when did the changeover occur? To answer those questions this essay explores several important histories: of the New York City subway system, transportation signage in the 1960s, Unimark International and, of course, Helvetica. These four strands are woven together, over nine pages, to tell a story that ultimately transcends the simple issue of Helvetica and the subway.
> It’s Not Wrong that “🤦🏼♂️”.length == 7 But It’s Better that “🤦🏼♂️”.len() == 17 and Rather Useless that len(“🤦🏼♂️“) == 5
> The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.
And then we move on from there, in quite some depth.
ASCII table and history
> To understand why Control+i inserts a Tab in your terminal you need to understand ASCII, and to understand ASCII you need know a bit about its history and the world it was developed in. Please bear with me (or just go the table).
> Most teleprinters communicated using the ITA2 protocol. For the most part this would just encode the alphabet, but there are a few control codes: WRU (“Who R U”) would cause the receiving teleprinter to send back its identification, BEL would ring a bell, and it had the familiar CR (Carriage Return) and LF (Line Feed).
Interview with Bill Joy
> The following interview is taken from the August 1984 issue of Unix Review magazine.
A lot of text editor history here, featuring of course, vi.
> I think it killed the performance on a lot of the systems in the Labs for years because everyone had their own copy of it, but it wasn’t being shared, and so they wasted huge amounts of memory back when memory was expensive. With 92 people in the Labs maintaining vi independently, I think they ultimately wasted incredible amounts of money. I was surprised about vi going in, though, I didn’t know it was in System V. I learned about it being in System V quite a while after it had come out.
Plus some commentary on other topics.
> The point is that you want to have a system that is responsive. You don’t want a car that talks to you. I’ll never buy a car that says, “Good morning.” The neat thing about UNIX is that it is very responsive. You just say, “A pipe to B” - it doesn’t blather at you that “execution begins,” or “execution terminated, IEFBR14.”
> The trouble is that UNIX is not accessible, not transparent in the way that Interleaf is, where you sit down and start poking around in the menu and explore the whole system. Someone I know sat down with a Macintosh and a Lisa and was disappointed because, in a half hour, he explored the whole system and there wasn’t as much as he thought. That’s true, but the point is in half an hour, almost without a manual you can know which button to push and you can find nearly everything. Things don’t get lost. I think that’s the key.
Modern text rendering with Linux
> Welcome to part 1 of Modern text rendering in Linux. In each part of this series we will build a self-contained C program to render a character or sequence of characters. Each of these programs will implement a feature which I consider essential to achieve state of the art text rendering.
> In this first part I will show how to setup FreeType and we will build a console character renderer.
Announcing code annotations for SourceHut
> A lot of design thought went into this feature, but I knew one thing from the outset: I wanted to make a generic system that users could use to annotate their source code in any manner they chose. My friend Andrew Kelley (of Zig fame) once expressed to me his frustration with GitHub’s refusal to implement syntax highlighting for “small” languages, citing a shortage of manpower. It’s for this reason that it’s important to me that SourceHut’s open-source platform allows users large and small to volunteer to build the perfect integration for their needs - I don’t scale alone.
A program to detect mojibake that results from a UTF-8-encoded file being misinterpreted as code page 1252
> It is not uncommon to see UTF-8-encoded data misinterpreted as code page 1252, due to file content lacking an explicit encoding declaration, such as you might encounter with the Resource Compiler. So let’s write a program to detect that specific kind of mojibake.
Why you should learn just a little Awk: An Awk tutorial by Example
> In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.
Lyrics Site Accuses Google of Lifting Its Content
> Starting around 2016, Genius said, the company made a subtle change to some of the songs on its website, alternating the lyrics’ apostrophes between straight and curly single-quote marks in exactly the same sequence for every song.
> When the two types of apostrophes were converted to the dots and dashes used in Morse code, they spelled out the words “Red Handed.”
Unicode programming, with examples
> Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.
The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8
> Text editors nowadays will happily “help” you out by silently converting to UTF-8, but I don’t know of any that silently convert to Windows-1252.
UTF-8 String Indexing Strategies
> One issue to consider is that strings typically feature random access indexing of code points with a time complexity resembling constant time (O(1)). However, not all string representations actually support this well. Strings using variable length encoding, such as UTF-8 or UTF-16, have O(n) time complexity indexing, ignoring special cases (discussed below). The most obvious choice to achieve O(1) time complexity — an array of 32-bit values, as in UCS-4 — makes very inefficient use of memory, especially with typical strings.
> Despite this, UTF-8 is still chosen in a number of programming languages, or at least in their implementations. In this article I’ll discuss four examples — Emacs Lisp, Julia, and Go — and how each takes a slightly different approach.
Parsing Gigabytes of JSON per Second
> Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.
> last week i got to witness an engineering department lose a full day’s work because if you put an emoji in a git commit message, Atlassian Bamboo chokes on it forever and you’re forced to rebase master, like you should NEVER DO.
docbook2mdoc — DocBook to mdoc converter
> The docbook2mdoc utility is a converter from DocBook version 4.x and 5.x XML to mdoc. Unlike most DocBook utilities, it’s a standalone ISC-licensed ISO C utility with no external dependencies that should compile on any modern UNIX system.
> docbook2mdoc is no longer experimental, but many features are likely to still be missing, and the rendering of many elements is not ideal yet. However, it can already be used for production purposes and was tested on parts of the OpenGL and X.Org DocBook corpuses.
Parsing: The Solved Problem That Isn't
> One of the things that’s become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. Composition is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL
I used the Right-to-Left UTF-8 character
> in my info on GitHub. Makes it easy to spot automated emails (and it’s also fun to see UIs importing the data breaking)
All you need to know about hyphenation in CSS
> There are two steps required to turn on automatic hyphenation.
Plus an additional 101 not fully supported prefixed properties.
> Swift 5 switches the preferred encoding of strings from UTF-16 to UTF-8 while preserving efficient Objective-C-interoperability. Because the String type abstracts away these low-level concerns, no source-code changes from developers should be necessary*, but it’s worth highlighting some of the benefits this move gives us now and in the future.