Whole words and Claude tokenization

Sander Land

Jan 27

Using the new counting endpoint reveals .. a preference for whole words?

Read →

3 Comments

Nathan Lambert

Jan 31

Of all the labs, I'd be most interested in an open weights Claude so we could know.

Expand full comment

Marital Terran

Feb 20

This article was noted by

Interconnects by Nathan Lambert <robotic@substack.com>

I am interested in Datasets having limited vocabulary. . https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py And I have built some simple whole-word tokenizers for research and educational purposes.

https://huggingface.co/MartialTerran/coherent_text_from_1_megabyte_GPT2_model

Generally:

https://huggingface.co/MartialTerran/Toy_GPTs_LLMs_for_CPU_Educational/blob/main/Gettysburg_GPT2_v1.4.2.py

[uses whole-word tokenization to miniaturize ToyGPT models]

In your article it seems that you determined a list of 8311 common words that are a single token in the Claud vocab. I was hoping that you would publish the LIST of 8311 common words. Please publish the LIST of 8311 common words. If there is a different list of most-common words that are tokenized as whole words (e.g., 7,682 of them as single tokens) please let me know that list also.

P.S. Concerning your question:

Another curious case

J: 1 token

JJ: 2 tokens

JJJ: 3 tokens

JJJJ: 5 tokens(!)

The "riddle" might be "solved" with further investigation of invisible characters (like \n or \t) as follows:

"J:" is a distinct token.

":" is a distinct token.

"J" is also a distinct token.

"J-line-return" (or J-tab) is also a distinct token.

":-line-return"(e.g. : plus invisible \n) is a distinct token.