The never-ending edge cases in LLM tokenization
Language models rely on separately trained tokenizers, as well as manually created pre-tokenizers. Not much has been written about pre-tokenizers specifically, even though they are among the most frequently copied components across large language models. This space is dedicated to sharing my findings on tokenization, focusing especially on the curious edge cases.
Pre-tokenization
Before tokenizing an input string, it is normalized and pre-tokenized, the latter step splitting the string up in a number of separate chunks.
The main purpose of this pre-tokenization is to prevent tokens from crossing word boundaries, as there is nothing to prevent it otherwise. This would result in multi-word tokens such as “and the” or separate tokens for words with punctuation attached such as “world!”.
Let’s have a look at an example from GPT-2:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, world!")
# Results in 4 chunks: 'Hello' ',' ' world' '!'
Note that this step relies on a great deal of intuition about what would be a good balance between a higher number of semantically meaningful tokens, and a desire for a lower number of tokens for efficiency. There is nothing inherently bad about multi-word tokens, and they are quite common in languages such as Chinese.
Regular expression based pre-tokenization
Pre-tokenizers can be arbitrarily complex and vary by implementation, but for now we’ll focus on the common case of a single regular expression (‘regex’) doing most of the heavy lifting, as is the case for the various models by OpenAI from GPT-2 up to the most recent ones.
Let’s have a closer look at the GPT-2 pre-tokenization regex:
'(?:[sdmt]|ll|ve|re) # English contractions such as 'm and 've
| ?\p{L}+ # Optional space + one or more letters
| ?\p{N}+ # Optional space + one of more numbers
| ?[^\s\p{L}\p{N}]+ # Optional space + one or more punctuation(-ish)
| \s+(?!\S) # Whitespace not followed by non-whitespace
| \s+""" # Whitespace
These expressions are notoriously hard to understand, and the matching process can be rather unintuitive. When the pre-tokenizer runs the typical regex.findall(pattern, string)
, it will try to find a match starting from the next character with the first alternative, then the second, and so on. It will choose the first alternative that matches at all, rather than the alternative with the longest match.
As the first curious example:
tok = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
text="""a = [
'really', # Tokenizes as: <space>' really '
'really', # Tokenizes as: 're eally '
'strange']""" # Tokenizes as: 's trange ']
tok.tokenize(text)
Despite there being a token for “really”, the pre-tokenization forces the word apart with the ‘re pattern meant to match words such as we’re. For a model of this size I’ve not found it to lead to problems in instruction-following, but the internal representation must be pretty strange to deal with this.
Now, let’s test your understanding: What does the string "\n\n\ndef f():"
pre-tokenize as when using the GPT-2 pattern?
not a lot of articles that give insights on how pre-tokenization works.