Whole words and Claude tokenization
Using the new counting endpoint reveals .. a preference for whole words?
In the previous post I noted that the only way to get token counts in Claude 3 was to pay for requests and see the result. I also showed that certain phrases such as “Alpha” and “AA” take fewer tokens than expected. In this post we’ll go into another aspect of the tokenizer which might explain this.
With the release of the token counting endpoint, there is at least a way to know what you’ll pay in advance for a prompt. However, the minimal info it gives just increases the feeling that something significant is hidden.
Full Word Tokens
Previously I suggested that some words or phrases bypass normal tokenization.
This is relatively easy to test. When we take the 10,000 most common English words, 8311 are a single token in the Claude3 tokenizer.
By contrast, searching across the Llama3, Mistral, Cohere, and Gemma vocabularies combined finds only 7,682 of them as single tokens.
At the same time, a common suffix like ‘ent’ (which is a subword token in all four of the aforementioned BPE tokenizers), takes two tokens to encode with Claude3, which suggests Claude’s tokenizer is relatively small.
Overlap with other tokenizers
Mistral small has a relatively small tokenizer with around 32k tokens. Tokenizing its vocabulary with Claude shows that only 15937 tokens are a single token in it, which again suggests that the Claude tokenizer is relatively small.
For comparison, if instead we use a tokenizer with a large vocabulary like Cohere’s, the vast majority of Mistral’s tokens (29782) are included.
A more exhaustive search
A more exhaustive search finds only around 22,000 tokens, not counting those with starting spaces separately.
Latin script tokens are the overwhelming majority, with the longest being ‘githubusercontent’ and ‘telecommunications’.
Many languages are present, but mostly in small amounts. Chinese characters lead with around 1100 tokens, and Korean and Cyrillic tokens have a few hundred tokens each, but there are only 4 Thai tokens, and a single Georgian one (და, ‘and’).
Other than very short fragments, the vast majority appear to be full words, perhaps even all tokens of 3 or more characters.
How does it tokenize non-words?
It seems clear it is not a standard BPE tokenizer, but other than that, it’s not clear how it works. It’s clear that the ‘whole word’ tokens don’t always get used when the word is part of a larger phrase, which explains why ‘AAAA’ takes more tokens than the minimum, but exactly how and what does happen is not clear.
Another curious case
As a final strange case, I’ll leave you with this:
J: 1 token
JJ: 2 tokens
JJJ: 3 tokens
JJJJ: 5 tokens(!)
Of all the labs, I'd be most interested in an open weights Claude so we could know.
This article was noted by
Interconnects by Nathan Lambert <robotic@substack.com>
I am interested in Datasets having limited vocabulary. . https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py And I have built some simple whole-word tokenizers for research and educational purposes.
https://huggingface.co/MartialTerran/coherent_text_from_1_megabyte_GPT2_model
Generally:
https://huggingface.co/MartialTerran/Toy_GPTs_LLMs_for_CPU_Educational/blob/main/Gettysburg_GPT2_v1.4.2.py
[uses whole-word tokenization to miniaturize ToyGPT models]
In your article it seems that you determined a list of 8311 common words that are a single token in the Claud vocab. I was hoping that you would publish the LIST of 8311 common words. Please publish the LIST of 8311 common words. If there is a different list of most-common words that are tokenized as whole words (e.g., 7,682 of them as single tokens) please let me know that list also.
P.S. Concerning your question:
Another curious case
J: 1 token
JJ: 2 tokens
JJJ: 3 tokens
JJJJ: 5 tokens(!)
The "riddle" might be "solved" with further investigation of invisible characters (like \n or \t) as follows:
"J:" is a distinct token.
":" is a distinct token.
"J" is also a distinct token.
"J-line-return" (or J-tab) is also a distinct token.
or
":-line-return"(e.g. : plus invisible \n) is a distinct token.
"line-return" (e.g., invisible \n) is a distinct token
Thus:
[J:] or [J:\n] or 1 token
[J][J:] or [J][J:\n] 2 tokens
[J][J][J:] or [J][J][J:\n] 3 tokens
[J][J][J][J:][\n] or [J][J][J][J][:] or [J][J][J][J][:\n] 5 tokens(!)