🍈 Zettelkasten

❯

Machine Learning

❯

Byte Pair Encoding

Byte Pair Encoding

Jun 30, 20251 min read

machine_learning

This is a Tokenizer. Originally designed to be a Compression Algorithm, but later turned to an encoder. Encodes plaintext into tokens.

Algorithm

Have a target size $k$ , and a starting size $n$
Treats $1 : n$ character streams as a token
The most frequent pairs of two tokens are merged together to a single $2 n$ token
Continue until you have tokens of $k$ size

Graph View

Backlinks

Feature Encoding
Transformer

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community