[uses whole-word tokenization to miniaturize ToyGPT models]
In your article it seems that you determined a list of 8311 common words that are a single token in the Claud vocab. I was hoping that you would publish the LIST of 8311 common words. Please publish the LIST of 8311 common words. If there is a different list of most-common words that are tokenized as whole words (e.g., 7,682 of them as single tokens) please let me know that list also.
P.S. Concerning your question:
Another curious case
J: 1 token
JJ: 2 tokens
JJJ: 3 tokens
JJJJ: 5 tokens(!)
The "riddle" might be "solved" with further investigation of invisible characters (like \n or \t) as follows:
"J:" is a distinct token.
":" is a distinct token.
"J" is also a distinct token.
"J-line-return" (or J-tab) is also a distinct token.
or
":-line-return"(e.g. : plus invisible \n) is a distinct token.
"line-return" (e.g., invisible \n) is a distinct token
Thus:
[J:] or [J:\n] or 1 token
[J][J:] or [J][J:\n] 2 tokens
[J][J][J:] or [J][J][J:\n] 3 tokens
[J][J][J][J:][\n] or [J][J][J][J][:] or [J][J][J][J][:\n] 5 tokens(!)
Of all the labs, I'd be most interested in an open weights Claude so we could know.
This article was noted by
Interconnects by Nathan Lambert <robotic@substack.com>
I am interested in Datasets having limited vocabulary. . https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py And I have built some simple whole-word tokenizers for research and educational purposes.
https://huggingface.co/MartialTerran/coherent_text_from_1_megabyte_GPT2_model
Generally:
https://huggingface.co/MartialTerran/Toy_GPTs_LLMs_for_CPU_Educational/blob/main/Gettysburg_GPT2_v1.4.2.py
[uses whole-word tokenization to miniaturize ToyGPT models]
In your article it seems that you determined a list of 8311 common words that are a single token in the Claud vocab. I was hoping that you would publish the LIST of 8311 common words. Please publish the LIST of 8311 common words. If there is a different list of most-common words that are tokenized as whole words (e.g., 7,682 of them as single tokens) please let me know that list also.
P.S. Concerning your question:
Another curious case
J: 1 token
JJ: 2 tokens
JJJ: 3 tokens
JJJJ: 5 tokens(!)
The "riddle" might be "solved" with further investigation of invisible characters (like \n or \t) as follows:
"J:" is a distinct token.
":" is a distinct token.
"J" is also a distinct token.
"J-line-return" (or J-tab) is also a distinct token.
or
":-line-return"(e.g. : plus invisible \n) is a distinct token.
"line-return" (e.g., invisible \n) is a distinct token
Thus:
[J:] or [J:\n] or 1 token
[J][J:] or [J][J:\n] 2 tokens
[J][J][J:] or [J][J][J:\n] 3 tokens
[J][J][J][J:][\n] or [J][J][J][J][:] or [J][J][J][J][:\n] 5 tokens(!)
For the whole words I just used lists like:
https://raw.githubusercontent.com/first20hours/google-10000-english/refs/heads/master/google-10000-english.txt
https://raw.githubusercontent.com/first20hours/google-10000-english/refs/heads/master/20k.txt
will add the list
Regarding JJJJ - The ":" was not part of the input, just a formatting leftover.