This is the third and final post in a series and builds on the previous two, which introduce pre-tokenization and introduce edge cases in previous models. You probably want to read them in order, as they get increasingly more complex.
Unreachable tokens
In our recent paper we talked about “unreachable tokens”, and defined them as those tokens for which decoding the token to a string, and re-tokenizing this string, does not results in the token id.
Typically this means there is no input text at all which results in the token being used, and it is an effective way to catch certain configuration errors.
For example, a model may re-use a tokenizer which includes the token ‘123’ in its vocabulary, but then turn digit splitting on, resulting in the pretokenization ['1', '2', '3']
. As as result, the token ‘123’ is never used, and could have been removed.
However, there are certain cases in which a token is detected as unreachable, but there is still some input that results in it. GPT-4o has a surprising number of these, all related to its pre-tokenization strategy.
Listing unreachable tokens
Listing all unreachable tokens in GPT-4o is straightforward with tiktoken:
enco = tiktoken.get_encoding("o200k_base")
for i in enco._mergeable_ranks.values():
s = enco.decode([i])
tokens = enco.encode(s)
if '�' not in s and tokens != [i]:
print(...)
This gives us the following list
token id decoded to token re-encoded strings for re-encoded
3413 " I'" [357, 6] [' I', "'"]
3914 '\n//' [198, 393] ['\n', '//']
24091 '\r\n//' [370, 393] ['\r\n', '//']
48235 '\n///' [198, 5991] ['\n', '///']
63100 '\n\n//' [279, 393] ['\n\n', '//']
65447 '\n//\n//' [198, 5754] ['\n', '//\n//']
125141 '\r\n\r\n//' [1414, 393] ['\r\n\r\n', '//']
175653 '\n\n\n//' [2499, 393] ['\n\n\n', '//']
182292 ' 天天中彩票APP' [2783, 13444] [' 天天中彩票', 'APP']
147008 '无码AV' [9070, 5345] ['无码', 'AV']
99494 ' 亚洲AV' [12555, 5345] [' 亚洲', 'AV']
193819 '亚洲AV' [6199, 5345] ['亚洲', 'AV']
We can see three different types:
The single token
I’
Seven tokens related to line breaks and slashes
Four tokens which mix Chinese and upper-case Latin letters. These particular ones are also among the many tokens related to ads and spam.
We can explain all of these three categories by looking at pre-tokenization. However, some are easier than others!
Pre-tokenization in GPT-4o
GPT-4o comes with a new tokenizer (“o200k_base”) which doubles the vocabulary size to around 200k. It also comes which a new pre-tokenization pattern, which is even more complicated than the ones used in previous models.
[^\r\n\p{L}\p{N}]?[letter]+(?i:'s|'t|'re|'ve|'m|'ll|'d)? # Normal words
\p{N}{1,3} # same: 1-3 digits
?[^\s\p{L}\p{N}]+[\r\n/]* # Now also allows '/' with trailing \n
|\s*[\r\n] # same: any whitespace ending in \r or \n
|\s+(?!\S) # same: any whitespace preceding a non-space
|\s+ # same: any whitespace
I’m glossing over how letters are handled for now, which relates to the more complicated case we’ll tackle later.
First, we can see where the I'
token comes from now, and why it is a false positive for unreachable tokens: English suffixes like 'm
are only kept together when preceded by a letter. This is a change from GPT-2 and GPT-4, where suffixes are pre-tokenized by themselves, and I’
could never be a token.
We can also explain the second category with slashes and newlines; these are only used when preceded by punctuation. Let’s have a quick look at another example:
enco = tiktoken.get_encoding("o200k_base")
regex.findall(enco._pat_str, "// hello, world\n//~bye") # \n //~ bye
regex.findall(enco._pat_str, "// hello, world!\n//~bye") # !\n// ~bye
Once again we see that pre-tokenization with complex patterns extremely sensitive. A single exclamation mark affects multiple tokens after it, with splits becoming completely different.
The source code of the tiktoken library contains some wonderful comments about this phenomenon where split points can disappear. They refer to it as “unstable” splits:
Unfortunately, the locations where our regex splits can be unstable.
cl100k_base makes our life hard by including the \s*[\r\n]+ pattern. This can e.g. cause "\n" + " " to become "\n \n".
Here is a quick and dirty fix
Technically, whether or not this arm is correct depends on whether there
would be a regex split before the UTF-8 truncation point.
Probably niche enough that no one will ever notice (after all, people didn't notice all the big holes in the previous unstable token implementation)
This is also not straightforward. While we generally assume that regex splits are stable, unfortunately, they are not. That is, if adding bytes were to make a split appear in unstable_bytes, this could make tokens possible which our logic would otherwise think would be merged.
For example, with gpt2, the use of \s+(?!\S) means that "\n\n" could develop a split, e.g. "\n\n0" splits into "\n"+"\n"+"0", making "\n" a possible token.
CamelCase pre-tokenization splitting
Now we’ve done the relatively easy cases, it’s time for the Chinese/English mixed spam tokens. Why are they split up, and how did they become a single token?
The [letter]+
pattern actually consists of two rather complicated alternatives. Simplified they read:
[upper case letter]*[lower case letter]+
[upper case letter]+[lower case letter]*
Let’s start with a quick look on what this does in practice.
regex.findall(enco._pat_str, "camelCase CamelCase CAMELcase camelCASE")
# ['camel', 'Case', ' Camel', 'Case', ' CAMELcase', ' camel', 'CASE']
By not allowing multiple switches between upper-and lower case, it prevents long tokens from code, such as GPT-4’s latesAutoresizingMaskIntoConstraints
.
However, there are characters from languages that don’t have an upper- and lower case, like Chinese. These characters are included as matches in both the upper- and lower-case patterns (see \p{Lo}
in the full regex here). This explains the final case with the mixed Chinese/Latin tokens.
Let’s look at the token with 天天中彩票APP
and the first sub-pattern [upper case letter]*[lower case letter]+
The regex parser will match the entire string as upper case, fail to find a lower case letter and backtrack all the way until it can match a lower-case letter (P? no - P? no! A? no! 票? yes!), leading it to match the Chinese part only.
Recall that regex alternatives don’t look for the longest match, just the first alternative that matches at all. So, even though the second pattern [upper case letter]+[lower case letter]*
would match the entire string, it still gets split.
Now we can also tell how to match the entire token - just add any lower case or Chinese character after it! We can also visualize the matches with regex101.
And there we are, all the not-quite-unreachable tokens explained!
Wrap-up
If you’ve made it this far, thanks for sticking with me!
Tokenizers are among the most frequently copied components between models. We've observed a trend from replicating GPT-2's pre-tokenization and tokenization methods to adopting GPT-4's approach. While I haven’t yet seen any models re-using GPT-4o’s approach, I wouldn’t be surprised if it happens soon, bringing with it all the complexities and edge cases we've discussed.
It seems to me that the edge cases described in these posts reflect a mix of deliberate trade-offs by OpenAI (which we can only speculate about) and unintended side effects. However, copying an approach without a good understanding means you risk replicate all the accidents without capturing the right trade-offs for your specific goals.
I hope these posts contribute, even in a small way, to a better understanding of pre-tokenization in large language models. And maybe they’ll inspire a bit more diversity in tokenization approaches.