Beiträge von stevenaleach@sigmoid.social | Abspeckgeflüster

stevenaleach@sigmoid.social

@aredridel @maco @futurebird @EricLawton @david_chisnall Tokens tend to be common short words or common series of letters. They're usually derived from character-pair encoding over a large corpus.

Basically take a text and count all the pairs of characters and replace the most common pair with a new character, repeat until you reach the desired vocab size.

As a result, LLMs don't know how to *spell* and are blind to how long words are, etc. Reversing letters, for instance, is a hard task.

Abspeckgeflüster – Forum für Menschen mit Gewicht(ung)

stevenaleach@sigmoid.social

Beiträge