Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
AuthorsMing Gui†‡*, Johannes Schusterbauer†‡*, Timy Phan†‡, Felix Krause†‡, Josh Susskind, Miguel Angel Bautista, Björn Ommer†‡
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
AuthorsMing Gui†‡*, Johannes Schusterbauer†‡*, Timy Phan†‡, Felix Krause†‡, Josh Susskind, Miguel Angel Bautista, Björn Ommer†‡
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
September 22, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference NeurIPS
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose…
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
February 19, 2025research area Computer Vision
This work was done in collaboration with Swiss Federal Institute of Technology Lausanne (EPFL).
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid…