Multimodal Large Language Models exhibit impressive vision-language capabilities but often struggle with fine-grained spatial understanding. We introduce FERRET, a novel MLLM capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. A hybrid region representation is proposed to marry discrete coordinates with continuous visual features, endowing versatile referring aptitude. To fortify its capability, we construct a comprehensive refer-and-ground dataset that contains hierarchical spatial knowledge, flexible location-aware instruction tuning data, and promotes model robustness. Our evaluations reveal that FERRET demonstrates superior performance in conventional referring and grounding tasks as well as region-based and localization-demanded multimodal chatting, and showcases a notable reduction in object hallucination.

FERRET Spotlight
Figure 1: Ferret enables referring and grounding capabilities for multimodal large language model (LLM). In terms of referring, a user can refer a region or an object in point, box, or any free-form shapes. The regionN(green) in the input will be replaced by the proposed hybrid representation before fed into the LLM. In terms of grounding, Ferret is able to accurately ground any open-vocabulary descriptions. The boxN(red) in the output denotes the predicted bounding box coordinates.
FERRET Diagram
Figure 2: Overview of the proposed Ferret model architecture. (Left) The proposed hybrid region representation and spatial-aware visual sampler. (Right) Overall model architecture. All parameters besides the image encoder are trainable.

Related readings and updates.

4M: Massively Multimodal Masked Modeling

*=Equal Contributors Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer…
See paper details

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

We present Spatial LibriSpeech, a spatial audio dataset with over 570 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with >220k simulated acoustic conditions across >8k synthetic rooms…
See paper details