3 Comments

Of all the labs, I'd be most interested in an open weights Claude so we could know.

Expand full comment

This article was noted by

Interconnects by Nathan Lambert <robotic@substack.com>

I am interested in Datasets having limited vocabulary. . https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py And I have built some simple whole-word tokenizers for research and educational purposes.

https://huggingface.co/MartialTerran/coherent_text_from_1_megabyte_GPT2_model

Generally:

https://huggingface.co/MartialTerran/Toy_GPTs_LLMs_for_CPU_Educational/blob/main/Gettysburg_GPT2_v1.4.2.py

[uses whole-word tokenization to miniaturize ToyGPT models]

In your article it seems that you determined a list of 8311 common words that are a single token in the Claud vocab. I was hoping that you would publish the LIST of 8311 common words. Please publish the LIST of 8311 common words. If there is a different list of most-common words that are tokenized as whole words (e.g., 7,682 of them as single tokens) please let me know that list also.

P.S. Concerning your question:

Another curious case

J: 1 token

JJ: 2 tokens

JJJ: 3 tokens

JJJJ: 5 tokens(!)

The "riddle" might be "solved" with further investigation of invisible characters (like \n or \t) as follows:

"J:" is a distinct token.

":" is a distinct token.

"J" is also a distinct token.

"J-line-return" (or J-tab) is also a distinct token.

or

":-line-return"(e.g. : plus invisible \n) is a distinct token.

"line-return" (e.g., invisible \n) is a distinct token

Thus:

[J:] or [J:\n] or 1 token

[J][J:] or [J][J:\n] 2 tokens

[J][J][J:] or [J][J][J:\n] 3 tokens

[J][J][J][J:][\n] or [J][J][J][J][:] or [J][J][J][J][:\n] 5 tokens(!)

Expand full comment

For the whole words I just used lists like:

https://raw.githubusercontent.com/first20hours/google-10000-english/refs/heads/master/google-10000-english.txt

https://raw.githubusercontent.com/first20hours/google-10000-english/refs/heads/master/20k.txt

will add the list

Regarding JJJJ - The ":" was not part of the input, just a formatting leftover.

Expand full comment