My tokenizer is faster than yours (maybe)
An O(m*log n) optimization I found to speed up tokenizer throughput
In this article I’ll run through how I used Rust and pyo3 to implement a fast BPE tokenizer (4x faster than tokenizers and as fast as tiktoken) which you can install from PyPI today!
All the code mentioned in this post can be found on github.

This article is about an efficient way I discovered to quickly encode and decode sequences using a pretrained tokenizer rather than an in-depth review of byte-pair encoding. For that feel free to check out any of these links: wikipedia, huggingface, tiktoken.
A naive approach - O(mn)
Suppose you’ve already trained a tokenizer (doing this is trivial), i.e. you have a some hashmap-like object that lets you map a sequence of bytes to unique tokens. What’s the fastest way to use this lookup table to encode and decode your data?
The easiest approach is to loop over all the possible token merges in the order we learned them and replace any matches we find along the way. For example, if your tokenizer has the following token mapping rules:
{(97, 97): 128, (128,97): 129, (129, 98): 130}
Then the encoding for “aaabcaaab” (or [97,97,97,98,99,97,97,97,98] as a byte array) would look like this (notice how everything gets compressed):
1. [97,97,97,98,99,97,97,97,98] -> [128,97,98,99,128,97,98]
2. [128,97,98,99,128,97,98] -> [129,98,99,129,98]
3. [129,98,99,129,98] -> [130,99,130]
In Rust you can write that loop as:
use HashMap;
type Rank = u32;
For a vocabulary size of 50257, the token throughput with this approach for a 2.5MB subset of the wikitext dataset is somewhere in the neighborhood of 0.09MB/s.
This approach definitely gets the job done but it’s incredibly inefficient! Indeed, this solution has a time complexity of $O(mn)$, where $m$ is the vocab size and $n$ is the length of the text you want to encode. As the vocabulary size and/or length of the text increase we get significant slowdowns.
A better solution - O(m log(n))
The approach I ended up stumbling across after around 6 hours of refactoring is closer to $O(m\log{n})$. It relies on the fact that we don’t need to loop over every entry in the hashmap when it’s sufficient to notice that we can apply merges in a way which respects the order the tokenizer learned them in. This lets us apply multiple different token merges in a single pass instead of only searching for a single pattern each time. We can also detect early on if no more token merging is possible and break out of the function.
If you’ve been grinding your fair share of Leetcode this sort of two-pointers approach should feel familiar…
//lib.rs
On the same wikitext split our throughput using this encoding algorithm jumps to 24.35MB/s. That’s over a 100x improvement with respect to where we started from.
I took a lot of inspiration from official openai implementation in their repo tiktoken
but handled the merging aspect quite differently by leveraging the fact that we could store the prospective merges in a stack instead of finding the single-best merge at each iteration.
Python bindings
To expose the Rust code in Python I made use of pyo3 and maturin. Getting started with these libraries is incredibly easy and just requires adding a few pyo3 attributes to your existing rust code. What’s also nice is that maturin automatically adds a CI workflow to your project which makes distributing your python package infinitely easier. By default the workflow listens for new tag pushes to the main branch and builds the wheels for all the major platforms.
I encourage you to check out a few official examples and the pyo3 docs, overall though its a pretty frictionless experience.
Using maturin I published toktokenizer
- a lightweight python package for BPE tokenizers - to PyPI. The only class the library exposes is BPETokenizer
. The class itself is pretty minimal, with all major methods being showed below:
# demo.py
=
# train a byte-pair tokenizer on some corpus
=
= 8
# save tokenizer state
# load tokenizer from dumped file
# encode and decode
=
=
To get started with toktokenizer
today you can install it with pip as follows:
pip install toktokenizer
I’m looking forward to using this library moving forwards as I build up various components of modern NLP models from scratch!