Cutting a String is much harder than you think!

Mythical 5th, March 2024

Wikipedia makes the following claim of UTF-8 encoding:

Reading from a stream[, a processor] can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication.
Wikipedia, UTF-8: Comparison with other encodings

During testing of a recent release of the Scratchpad browser extension, this claim unravelled, at least in terms of how UTF-8 is used.

Because of the limitations of Chrome's Synced Storage area for extensions (at the time of Scratchpad's version 1.0), the maximum size of the saved input in Scratchpad is set at 7,000 bytes. After the first 7,000 bytes, everything is discarded (with a suitable warning to the user). But what happens if the 7,000^th byte falls in the middle of a character? This is not normally a problem because a JavaScript engine will substitute a replacement character for the incomplete character at the end of a cut string. The replacement character can be removed and the result is a less-than-7,000-byte string of properly valid text.

If you search the web for javascript cut string at byte, you will find that this is a common task. Suggested methods involve encoding the string with something representing each byte, and then calculating where the n^th byte appears. The cleanest method, and very fast for strings of about 7,000 bytes, uses a TextEncoder, which transforms the String into a type of Array. Here's a TextEncoder being used to cut a string, resulting in a replacement character:

> function cutString(str, byteCount)
  {
    const te = new TextEncoder(),
      td = new TextDecoder(),
      enc = te.encode(str),
      cut = enc.slice(0, byteCount),
      res = td.decode(cut);
    return res;
  }
> cutString("abçdef", 3)
  'ab�'

But when testing Scratchpad, it just so happened that at the 6998^th byte there was a raised fist symbol—✊🏼. The result was surprising!

> cutString("x".repeat(6997) + "✊🏼", 7000)
  '...xxx✊'

We need to look at what's going on behind the scenes. TextEncoder.encode() returns a Uint8Array, with each element containing a byte from the input string. We can convert that into an Array and then map each element to its byte's binary representation.

> function explainCharacter(char)
  {
    const enc = new TextEncoder().encode(char),
      mapBits = [...enc].map(i => i.toString(2)),
      codePoint = [...Array(char.length)]
        .map((cur, idx) => char.codePointAt(idx).toString(16));
    console.group("Explain Character");
    console.log("Character:", char);
    console.log("String.length:", char.length);
    console.group("Bytes:", enc.length);
    console.log(...mapBits);
    console.groupEnd();
    console.log("Code Points:", ...codePoint);
    console.groupEnd();
  }
> explainCharacter("✊🏼")
▼ Explain Character
    Character: ✊🏼
    String.length: 3
  ▼ Bytes: 7
      11100010 10011100 10001010 11110000 10011111 10001111 10111100
    Code Points: 270a 1f3fc dffc

From the UTF-8 conversion table in the aforementioned Wikipedia article, we learn that the "✊🏼" emoji's binary representation tells us that it is a 3-byte sequence followed by a 4-byte sequence. The structure of the first byte—the initial 1110—indicates the number of bytes in the sequence, and the next two bytes begin with 10 to indicate that they are a continuation of the sequence, rather than the start of a new one.

This explains the article's claim about decoding each byte sequence without having to see the first byte of the next one. When a byte starting with 1110 is received, the processor knows it needs to receive another two bytes and process them as a single sequence. But where a character consists of another character followed by a specific byte sequence, both sequences need to be received before the first can be processed.

The explainCharacter() function above outputs three Code Points for the "✊🏼" emoji, because JavaScript says its length is 3 and the function is concise rather than perfect. The first two Code Points here are sufficient to represent the character. A Code Point is the non-structural bits in the binary representation of a UTF-8 byte sequence.

UTF-8 structure:1110----10------10------Output above:111000101001110010001010Code Point:0010011100001010

So the first Code Point we have is 0010011100001010 in binary, which is 270a in hexadecimal. That can be referenced within JavaScript as \u{270a}, and in the Console it gives us the output we saw above.

> "\u{270a}"
  '✊'

The second Code Point happens to be a Fitzpatrick Type-3 emoji modifier. The Fitzpatrick scale is a classification of human skin tones, which makes a lot of sense here.

Here's the type-6 modifier in action.

> "\u{270a}\u{1f3ff}"
  '✊🏿'
> "✊\u{1f3ff}"
  '✊🏿'

Whereas a computer sees a string as a sequence of bytes, a user sees their text as a sequence of characters. To satisfy the user when we cut a string to be within a maximum byte size, we need to terminate at the last character to fall within the imposed limit. This means that if we find all or part of a Fitzpatrick modifier at the start of the bytes we need to discard, then we should also discard the character before it. Problem solved? Well, let's look at some other emojis to check.

> explainCharacter("🇦🇨")
▼ Explain Character
    Character: 🇦🇨
    String.length: 4
  ▼ Bytes: 8
      11110000 10011111 10000111 10100110 11110000 10011111 10000111 10101000
    Code Points: 1f1e6 dde6 1f1e8 dde8

Thus the two Code Points which represent the flag of Ascension Island are 1f1e6 and 1f1e8.

> "\u{1f1e6}\u{1f1e8}"
  '🇦🇨'

Both of these are from a block of 26 Unicode Regional Indicator Symbols. This time, the first Code Point on its own is not a character anyone would use. Similarly, combining three indicator symbols doesn't produce a usable character. However, reversing the two Code Points results in new valid output—the flag of Canada. That's clever! The 2-letter country code for Ascension Island is AC, and the one for Canada is the reverse of that—CA; the flags are just the country codes being interpreted by the operating system or platform they are displayed on. That provides a tidy fall-back on systems which do not want to show the flags, but unfortunately a string which is cut between a pair of symbols will terminate messily.

> "\u{1f1e6}"
  '🇦'
> "\u{1f1e6}\u{1f1e8}\u{1f1ef}"
  '🇦🇨🇯'
> "\u{1f1e8}\u{1f1e6}"
  '🇨🇦'
> "\u{1f1e8} \u{1f1e6}"
  '🇨 🇦'

So, to cut a string properly at a specific byte length, we have to look for a Fitzpatrick modifier, and discard the character which precedes it. We also have to look for a regional indicator and make sure there is not an odd number of those at the end of the cut string. But is that sufficient? What if a regional indicator sometimes acts as a moderator, somewhere within the full Unicode specification, either now or in the future? Following from the raised fist example, the "woman's hat" symbol (👒, \u{1f452}) might become a "Peruvian woman's hat" symbol when followed by a regional indicator or two...or three. And there are almost certainly other moderators and indicators which need to be considered.

This bug is not trivial to fix then, and fixing it may not be possible without a lookup table, which might become out-of-date and cause an error worse than the original bug. Hmmm...perhaps nobody will ever notice the bug in the first place...