paperJuly 2025

Point-3D LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

AuthorsHugues Thomas, Chen Chen, Jian Zhang

Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We emphasize our analysis of token structures as a key contribution, alongside transparent reporting of results averaged over multiple seeds, a practice we believe is vital for robust progress in the field.

Diagram illustrating PT3D method: spatial token fusion using SigLIP and Sonata encoders, FPS6D structuring, and performance comparisons across 3D benchmarks. — Figure 1: Overview of our approach. (a) We solve 3D scene understanding tasks with a multimodal LLM that merges image semantics, shape patterns, and location information in unified spatial tokens. (b) These tokens are built from SigLIP (Zhai et al., 2023) (image encoder) features, Sonata (Chen et al., 2023) (point cloud encoder) features, and position encodings. The point cloud features are pooled to the image feature locations (obtained from corresponding depths) via nearest neighbor interpolation. (c) The tokens' structure and permutation directly impact performance. In addition to the standard video-based structure, we propose a view-sensitive subsampling method (FPS6D) to obtain efficient point-based token structures. (d) Our approach achieves state-of-the-art performance across various 3D scene understanding benchmarks.

Point-3D LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

Related readings and updates.

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

(1D) Ordered Tokens Enable Efficient Test-Time Search

Discover opportunities in Machine Learning.