How can CharUpper and CharLower guarantee that the uppercase version of a string is the same length as the lowercase version?
Let's build a Full-Text Search engine
> Today we are going to build our own FTS engine. By the end of this post, we’ll be able to search across millions of documents in less than a millisecond. We’ll start with simple search queries like “give me all documents that contain the word cat” and we’ll extend the engine to support more sophisticated boolean queries.
SAT solver on top of regex matcher
> A SAT problem is an NP-problem, while regex matching is not. However, a quite popular regex ‘backreferences’ extension extends regex matching to a (hard) NP-problem.
> I still believe it would be possible to build a high quality editor based on the original design. But I also believe that this would be quite a complex system, and require significantly more work than necessary.
A few good ideas and observations could be mined out of this post.
Unicode Security Considerations
> Because Unicode contains such a large number of characters and incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This is especially important as more and more products are internationalized. This document describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account, and provides specific recommendations to reduce the risk of problems.
A large number of problems as well.
> Yesterday Apple released iOS 13.5 beta 3 (seemingly renaming iOS 13.4.5 to 13.5 there), and that killed one of my bugs. It wasn’t just any bug though, it was the first 0day I had ever found. And it was probably also the best one. Not necessarily for how much it gives you, but certainly for how much I’ve used it for, and also for how ridiculously simple it is. So simple, in fact, that the PoC I tweeted out looks like an absolute joke. But it’s 100% real.
> I dubbed it “psychic paper” because, just like the item by that name that Doctor Who likes to carry, it allows you get past security checks and make others believe you have a wide range of credentials that you shouldn’t have.
Notes on Parsing in Rust
> I’ve recently been writing a bit of parsing code in Rust, and I’ve been jumping back and forth between a few different parsing libraries - they all have different advantages and disadvantages, so I wanted to write up some notes here to help folks who are undecided choose what libraries and techniques to consider, and also to offer some suggestions for the future of the Rust parsing ecosystem.
Hashtag of note
> You will probably notice immediately that it contains a full-width dash, in other words a Unicode (probably Chinese-origin?) character. For some reason, this is all over Twitter in posts from Anglophone people I am almost completely sure have no input method installed that can actually produce it.
> It’s not a real dash at all but a “Katakana-Hiragana prolonged sound mark“:
The unexpected Google wide domain check bypass
> Let me tell you this “funny” story of me trying to bypass a domain check in a little webapp, and acidentally bypassing a URL parser that is used in (almost) every Google product.
Spoiler: it’s a regex bug.
Another developer font. With a fancy web site to explain the design.
Introducing Glush: a robust, human readable, top-down parser compiler
> It’s been 45 years since Stephen Johnson wrote Yacc (Yet another compiler-compiler), a parser generator that made it possible for anyone to write fast, efficient parsers. Yacc, and its many derivatives, quickly became popular and were included in many Unix distributions. You would imagine that in 45 years we would have further perfected the art of creating parsers and would have standardized on a single tool. A lot of progress has been made, but there are still annoyances and problems affecting every tool out there.
This is great, even just for the overview of parsing.
> The CYK algorithm (named after Cocke–Younger–Kasami) is in my opinion of great theoretical importance when it comes to parsing context-free grammars. CYK will parse all context-free parsers in O(n3), including the “simple” grammars that LL/LR can parse in linear time. It accomplishes this by converting parsing into a different problem: CYK shows that parsing context-free languages is equivalent to doing a boolean matrix multiplication. Matrix multiplication can be done naively in cubic time, and as such parsing context-free languages can be done in cubic time. It’s a very satisfying theoretical result, and the actual algorithm is small and easy to understand.
Hacking GitHub with Unicode's dotless 'i'.
> GitHub’s forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker- which was technically speaking, not the same email address.
This is beautiful.
Teletext’s creative legacy lives on
> Like Walkmans and VHS recorders, teletext now seems impossibly quaint. But designer and writer Craig Oldham explains that not only was Teletext a revolutionary technology in its prime, its creative legacy lives on with a new generation of artists who love its creative limits.
Announcing the Allsorts Font Shaping Engine
> Today YesLogic is open-sourcing the Allsorts font parser, shaping engine, and subsetter for OpenType, WOFF, and WOFF2 under the Apache 2.0 license. Allsorts was extracted from the Prince HTML to PDF typesetting and layout tool and is implemented in Rust.
> Font shaping is the process of laying out the glyphs of a font in order to represent some input text. Rasterisation of the glyphs is a separate process. Font shaping for Latin text is quite simple. For some scripts, like those used by Indic languages, it is quite complex and requires reordering and substituting the glyphs in each syllable to produce the final output. There are only three main font shaping engines in use today: DirectWrite on Windows, CoreText on macOS and iOS, and HarfBuzz on open-source operating systems and some web-browsers. Of these, only HarfBuzz is open source.
Text Editing Hates You Too
> Alexis Beingessner’s Text Rendering Hates You, published exactly a month ago today, hits very close to my heart.
> Back in 2017, I was building a rich text editor in the browser. Unsatisfied with existing libraries that used ContentEditable, I thought to myself “hey, I’ll just reimplement text selection myself! How difficult could it possibly be?” I was young. Naive. I estimated it would take two weeks. In reality, attempting to solve this problem would consume several years of my life, and even landed me a full time job for a year implementing text editing for a new operating system.
An unexpected character replacement
> A few weeks ago I found a replacement in GBIF that I’d never seen before: M<fc>ller. It was a hexadecimal value for the character “ü” enclosed in angle brackets. That particular hex value for “ü” appears in Windows-1252 and other encodings, but what program did this replacement? And why?
Text Rendering Hates You
> Rendering text, how hard could it be? As it turns out, incredibly hard! To my knowledge, literally no system renders text “perfectly”. It’s all best-effort, although some efforts are more important than others.
I lost it at multicolored ligatures.
The (Mostly) True Story of Helvetica and the New York City Subway
> There is a commonly held belief that Helvetica is the signage typeface of the New York City subway system, a belief reinforced by Helvetica, Gary Hustwit’s popular 2007 documentary about the typeface. But it is not true—or rather, it is only somewhat true. Helvetica is the official typeface of the MTA today, but it was not the typeface specified by Unimark International when it created a new signage system at the end of the 1960s. Why was Helvetica not chosen originally? What was chosen in its place? Why is Helvetica used now, and when did the changeover occur? To answer those questions this essay explores several important histories: of the New York City subway system, transportation signage in the 1960s, Unimark International and, of course, Helvetica. These four strands are woven together, over nine pages, to tell a story that ultimately transcends the simple issue of Helvetica and the subway.
> It’s Not Wrong that “🤦🏼♂️”.length == 7 But It’s Better that “🤦🏼♂️”.len() == 17 and Rather Useless that len(“🤦🏼♂️“) == 5
> The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.
And then we move on from there, in quite some depth.
ASCII table and history
> To understand why Control+i inserts a Tab in your terminal you need to understand ASCII, and to understand ASCII you need know a bit about its history and the world it was developed in. Please bear with me (or just go the table).
> Most teleprinters communicated using the ITA2 protocol. For the most part this would just encode the alphabet, but there are a few control codes: WRU (“Who R U”) would cause the receiving teleprinter to send back its identification, BEL would ring a bell, and it had the familiar CR (Carriage Return) and LF (Line Feed).