View publication

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

Diagram of Hindi-to-English translation pipeline: left shows source encoding and generation process, right shows fallback network segmentation, patch rendering, and word embedding output.
Illustration of fallback network: text segmented, rendered into bigram patches, and embedded for input into decoder-only LLM.

Related readings and updates.

This paper was accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023. Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap, and find that pre-trained language…
Read more
Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users’ devices without harming privacy. PFL is efficient for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not tokenizers, the topic of this work. Training a tokenizer…
Read more