@aredridel @maco @futurebird @EricLawton @david_chisnall Tokens tend to be common short words or common series of letters. They're usually derived from character-pair encoding over a large corpus.
Basically take a text and count all the pairs of characters and replace the most common pair with a new character, repeat until you reach the desired vocab size.
As a result, LLMs don't know how to *spell* and are blind to how long words are, etc. Reversing letters, for instance, is a hard task.