Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Negar Foroutan; Clara Meister; Debjit Paul; Joel Niklaus; Sina Ahmadi; Antoine Bosselut; Rico Sennrich

2026 ACL ACL 2026

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Abstract

AbstractTokenization is the first—and often least scrutinized—step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE applies a fair-max rule that maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE reduces tokenization inequality—operationalized by the Gini coefficient of per-language token costs—by up to 89% relative to Classical BPE. This comes with negligible impact on global compression rate and no evidence of systematic degradation in downstream LM performance.

Authors

Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Topics

Natural Language Processing > Resources & Methods > Multilingual NLP Artificial Intelligence > Core AI > Fairness

Keywords

low-resource language byte-pair encoding cross-lingual fairness

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026