TokCollate: A Comprehensive Tool for Tokenizer Evaluation and Visualization across Languages
Abstract
AbstractTokenization quality varies significantly across languages, contributing to disparities in LLM performance and cost for speakers of less-resourced languages – a phenomenon known as the "token premium" problem. Despite growing research interest, no existing tool provides a comprehensive intrinsic evaluation of tokenizers paired with interactive visualization. We present TokCollate (pronounced similarly to chocolate), a Python-based evaluation framework combined with a JavaScript visualization interface that addresses this gap. TokCollate implements a wide range of intrinsic metrics, including monolingual measures such as average token length and Rényi/Shannon efficiency, and cross-lingual measures such as vocabulary overlap, Jensen-Shannon divergence, alignment-based Eflomal scores, and length ratios. It further enables analysis across language groups defined by genealogical families, scripts, geographic regions, speaker populations, and estimated data availability. TokCollate is open-source under the MIT license and available on GitHub.