inks

UTF-8 String Indexing Strategies

https://nullprogram.com/blog/2019/05/29/ [nullprogram.com]

2019-06-04 05:16

One issue to consider is that strings typically feature random access indexing of code points with a time complexity resembling constant time (O(1)). However, not all string representations actually support this well. Strings using variable length encoding, such as UTF-8 or UTF-16, have O(n) time complexity indexing, ignoring special cases (discussed below). The most obvious choice to achieve O(1) time complexity — an array of 32-bit values, as in UCS-4 — makes very inefficient use of memory, especially with typical strings.

Despite this, UTF-8 is still chosen in a number of programming languages, or at least in their implementations. In this article I’ll discuss four examples — Emacs Lisp, Julia, and Go — and how each takes a slightly different approach.

source: L