Pre-tokenization on punctuation in GPT-4

Post #2 in a series on pre-tokenization edge cases

Sander Land

Aug 12, 2024

Last time we explored pre-tokenization in GPT-2, and left off with the question:

What does the string "\n\n\ndef f():" pre-tokenize as?

The answer is: ['\n\n', '\n', 'def', ' f', '():']

Did the third line break being pre-tokenized separately surprise you?

This happens because of how the pre-tokenization process matches whitespace. Specifically, the pattern \s+(?!\S) is used to match sequences of whitespace only if they aren't followed by a non-whitespace character. This pattern matches the first two line breaks. However, only a space is allowed before a letter, making the last line break become a separate chunk, rather than combining with “def”.

Interestingly, many models adopt this pre-tokenization pattern, even when introducing tokens for multiple tabs and line breaks. However, they are still at the mercy of the pre-tokenizer, and split multiple tabs or linebreaks into two tokens.

So, how can we address this? One way is by adjusting the pattern to only exclude a final space: \s*([\r\n\v\t]| (?!\S))

However, this is not quite what happened with the next version of this pattern. Let’s have a look at the next version!

The GPT-3.5/GPT-4 pre-tokenization pattern

The pre-tokenization pattern used in GPT-3.5 and GPT-4 (specifically with the “cl100k_base” tokenizer) introduces some interesting complexities compared to earlier models:

'(?i:[sdmt]|ll|ve|re)       # English contractions, same as GPT-2
|[^\r\n\p{L}\p{N}]?+\p{L}+  # single space/tab/punctuation + letters
|\p{N}{1,3}                 # 1-3 digits, no leading space
| ?[^\s\p{L}\p{N}]++[\r\n]* # optional space, punctuation, line breaks
|\s*[\r\n]                  # any whitespace ending in \r or \n 
|\s+(?!\S)                  # any whitespace preceding a non-space
|\s+                        # any whitespace

We can see something like the whitespace matching fix has been added with the \s*[\r\n] clause, but a number of other changes have also been made:

Instead of only a space, a single tab or punctuation can now also precede letters.
Numbers can no longer have a starting space, and are broken up up in groups of 1-3.
Punctuation can only be preceded by a space, but followed by line breaks.

This still leaves some asymmetries between tabs and spaces, where multiple spaces or tabs before numbers or punctuation get broken up (or not) into multiple chunks in rather inconsistent ways.

The choice of line breaks being combined with punctuation also results in more strange under-trained tokens such as ">\r\r\n and ;\r\r\r\n in Llama3 models.

However, I personally find the choice of allowing any starting punctuation more interesting. Let’s have a closer look.

Starting punctuation

The [^\r\n\p{L}\p{N}]?+\p{L}+ pattern is quite a bold change, moving from allowing an optional space before words, to allowing spaces, tabs, and a wide range of punctuation. It’s also unusual, and I’m not aware of any models that adopt this, other than the (admittedly many) models that just adopt the entire tokenizer.

It's important to remember that keeping characters together during pre-tokenization doesn’t guarantee they’ll stay together as tokens. However, it does increase the likelihood, especially as tokenizer vocabularies expand. As the vocabulary size grows, rare or unusual combinations are more likely to be represented as individual tokens.

Most frequent combinations

Now, let’s explore the most frequent starting characters in tokens produced by this particular choice in GPT-4’s cl100k_base tokenizer. This will give us insight into how this change impacts tokenization in practice.

Here we can see primarily quoted words getting extra tokens, as well as various programming constructs. A “full width comma” used in Chinese also shows up.

At the other end, words like ‘data’ and ‘text’ get 16 different tokens, for example:

' data'  '_data'  '.data'  '(data'  '\tdata'  '$data'  '/data'  '-data'  '=data'  '[data'  ',data'  ':data'  ')data'  '\\data'  '<data'  '"data'

Breaking models

Let’s craft an input which uses several of these tokens with punctuation.

We’ll ask the model to transform an Python code input in the following way:

Note that r-strings are a nice trick here to prevent spaces or parentheses from merging with the quotes. Although this should be a simple transformation, you can see from the tokenization that it involves “splitting” a token up into parts, which can often be hard.

When we give this task to GPT-3.5 and GPT-4, we observe various types of failures:

System: You are a code formatter. Format given code according to the tool 'black'. Never change variable names or string values.

User: Remove redundant raw strings, fix syntax and reformat according to black

x:UIButtonType = r'utilisation'
if x>Note or x<Cell:
 print(r’elle'field"github’)

GPT-3.5 Turbo:

```python
x = 'Button'
if x == 'Note' or x == 'Cell':
    print('elle field github')
```

GPT-4-0125: (after being told to keep type hints)

```python
x: UnionType = "utilisation"
if x > Note or x < Cell:
    print("elle'field\"github\"")
```

GPT-4-0613: (after being told to keep type hints)

```python
x: str = "utilisation"
if x > Note or x < Cell:
    print("elle'field\"http")
```

LLama 3.1 70B: (Uses an extended cl100k_base tokenizer)
x: str = 'utilisation'
if x != 'Note' or x != 'Cell':
    print('elle field github')

All of them fail! Admittedly the code is rather strange, but it shows how these kind of tokenization and pre-tokenization failure modes can pop up unexpectedly, and create a kind of fragility which is hard to debug.

Token Contributions

Discussion about this post