While working on our recent paper, I realized just how uniquely closed the Claude 3 tokenizer is.
Aside from a single comment in their SDK mentioning a change from Claude 2, and how you should essentially use the bills they send to see how much their service costs, there is basically nothing else out there.
This series of posts will summarize my investigations in trying to figure out what lies behind this mysterious black box. TL;DR: it looks normal at first, but some things are seriously strange when you look closely!
As datawrapper tables are crashing, this post is best viewed on desktop.
Getting Token Counts from Usage Data
As they suggest, we can send API requests and use the token costs in the response to determine the input and output token counts.
We use a prompt that simply asks for a phrase to be repeated, and checks the input/output tokens used. In addition, we save the streaming ‘chunks’ sent to see if we can learn something from those. I’m happy to share details or code if anyone is interested.
Tokenization of Single Characters
We’ll start with the most basic case of echo’ing a single character back.
An empty output costs three tokens. Requesting the API to produce at most one token results in the same output, so they are likely all at the end. A single token here is somewhat expected (<|end_of_turn|> or similar), but three is rather strange.
Throughout, I’ll subtract the baseline (56 for the prompt, 3 for the output)
The number 1 costs an extra token, but only in outputs.
| text | chunks | input_tokens - 56 | output_tokens - 3 |
|--------|----------|---------------------|---------------------|
| 'a' | ['a'] | 1 | 1 |
| 'A' | ['A'] | 1 | 1 |
| '.' | ['.'] | 1 | 1 |
| '1' | ['1'] | 1 | 2 |
Chinese/Japanese/Korean (CJK) characters also exhibit this difference between input and output counts. In addition, even a very common character such as 둘 (‘two’) takes up extra tokens and occupies as much space as a character that is essentially never used. However, they do not use one token per byte, as in UTF-8 encoding, Korean uses three bytes per character.
| text | chunks | input_tokens - 56 | output_tokens - 3 |
|--------|----------|---------------------|---------------------|
| '국' | ['국'] | 1 | 2 |
| '한' | ['한'] | 1 | 2 |
| '둘' | ['둘'] | 2 | 3 |
| '췥' | ['췥'] | 2 | 3 |
Starting the string with an “a” or a “1” removes the difference in input/output counts, and an unassigned Unicode character with a 4-byte UTF-8 encoding becomes 4 tokens.
| text | chunks | input - 56 | output - 3 |
|---------------|---------------------|--------------|--------------|
| 'a한' | ['a', '한'] | 2 | 2 |
| 'a둘' | ['a', '둘'] | 3 | 3 |
| 'a췥' | ['a', '췥'] | 3 | 3 |
| 'a\U00010ffd' | ['a', '\U00010ffd'] | 5 | 5 |
Preliminary Conclusions:
The streaming chunks are not tokens.
So far, things look like a fairly standard byte-based BPE, although the Korean token counts point to something a little more handcrafted for certain Unicode ranges.
There is something extra in the output that we are not seeing. I suspect all outputs start with a space in their training data, and all model outputs start with a space, which is stripped out before being sent over. English letters like “a” are able to absorb the space into a token, but punctuation, numbers, and CJK characters are not, making the space a separate token.
The models seem unable to output text starting with a space, even when setting up a task in which indentation is key to code continuation, which further supports this idea.
Say AAAAAA
Let’s continue with a few curious examples. First, we will look at repeating A and B.
| text | chunks | input_tokens - 56 | output_tokens - 3 |
|---------|-------------|---------------------|---------------------|
| 'A' | ['A'] | 1 | 1 |
| 'AA' | ['AA'] | 1 | 1 |
| 'AAA' | ['AA', 'A'] | 2 | 2 |
| 'AAAA' | ['AAAA'] | 4 | 4 |
| 'AAAAA' | ['AAAAA'] | 4 | 4 |
| text | chunks | input_tokens - 56 | output_tokens - 3 |
|---------|-------------|---------------------|---------------------|
| 'B' | ['B'] | 1 | 1 |
| 'BB' | ['BB'] | 1 | 1 |
| 'BBB' | ['BB', 'B'] | 2 | 2 |
| 'BBBB' | ['BBBB'] | 3 | 3 |
| 'BBBBB' | ['BBBBB'] | 4 | 4 |
In addition to making it very clear that streaming doesn’t send a single token per chunk, even for ASCII, both of these examples reveal some interesting inconsistencies.
Note that AAAA is 4 tokens, but AA is a single token, suggesting that AAAA could be represented as two tokens. This implies that either the tokenizer is unable to use AA when encoding AAAA, or it is randomly choosing not to use it. For BBBB, we get a different token count, and one that suggest at least one use of ‘BB’.
Finally one example where the opposite happens:
| text | chunks | input_tokens - 56 | output_tokens - 3 | +
|---------|--------------|---------------------|---------------------|--
| 'Alpha' | ['Alpha'] | 1 | 1 |
|---------|--------------|---------------------|---------------------|--
| 'A' | ['A'] | 1 | 1 |
| 'lpha' | ['l', 'pha'] | 2 | 2 | 3
|---------|--------------|---------------------|---------------------|--
| 'Al' | ['Al'] | 1 | 1 |
| 'pha' | ['pha'] | 2 | 2 | 3
|---------|--------------|---------------------|---------------------|--
| 'Alp' | ['Alp'] | 3 | 3 |
| 'ha' | ['ha'] | 1 | 1 | 4
|---------|--------------|---------------------|---------------------|--
| 'Alph' | ['Alph'] | 3 | 3 |
| 'a' | ['a'] | 1 | 1 | 4
|---------|--------------|---------------------|---------------------|--
| 'AlphaA'|['Al', 'phaA']| 4 | 4 |
In this case, “Alpha” is a single token, while all ways to split the word result in more than two tokens (indicated by the ‘+’ column). This could suggest that the word is manually added, bypassing the normal tokenization. Indeed, adding an A
at the end also incurs an extra cost of 3 tokens and splits up the ‘Alpha’ into different streaming chunks.
Conclusion
What can we conclude from this? It seems clear that the tokenizer is somewhat different from the usual byte pair encoding, but not drastically so.
Firstly, the streaming chunks are not very informative for tokenization.
There are three tokens at the end of the text, forming some sort of multi-token end-of-text sequence. It is unclear why it is three tokens long.
There is an extra output token for outputs starting with numbers, punctuation and certain foreign characters. I suspect this extra token is a space, due to such differences disappearing with a prefix, and the inability to get outputs starting with spaces.
Some words appear to bypass normal tokenization. This could be due to adding a number of high-frequency complete words on top of a subword approach.
Even so, the AAAA/BBBB results can’t quite be explained by this, and suggest a level of randomness reminiscent of Unigram or BPE-Dropout methods.
The tokenizer is not particularly big, with fairly common words taking multiple tokens - I’ll have more on this later.