Ferretv2: An Improved Baseline for Referring and Grounding

AuthorsHaotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

View publication

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model’s ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Figure 1: (a) The comparison showcases Ferret-v2's superior referring and grounding abilities over Ferret, particularly in identifying objects and texts within small regions. (b) Ferret-v2 notably exceeds Ferret's performance in tasks requiring detailed regional and global reasoning and understanding (all w/ 7B models).

Figure 2: Overview of the proposed Ferret-v2 model architecture.

Ferretv2: An Improved Baseline for Referring and Grounding

Related readings and updates.

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

FERRET: Refer and Ground Anything Anywhere at Any Granularity

Discover opportunities in Machine Learning.