FERRET: Refer and Ground Anything Anywhere at Any Granularity
AuthorsHaoxuan You, Haotian (AIML) Zhang, Liangliang Cao, Zhe Gan, Bowen Zhang, Zirui Wang, Xianzhi Du, Shih-Fu Chang, Yinfei Yang
AuthorsHaoxuan You, Haotian (AIML) Zhang, Liangliang Cao, Zhe Gan, Bowen Zhang, Zirui Wang, Xianzhi Du, Shih-Fu Chang, Yinfei Yang
Multimodal Large Language Models exhibit impressive vision-language capabilities but often struggle with fine-grained spatial understanding. We introduce FERRET, a novel MLLM capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. A hybrid region representation is proposed to marry discrete coordinates with continuous visual features, endowing versatile referring aptitude. To fortify its capability, we construct a comprehensive refer-and-ground dataset that contains hierarchical spatial knowledge, flexible location-aware instruction tuning data, and promotes model robustness. Our evaluations reveal that FERRET demonstrates superior performance in conventional referring and grounding tasks as well as region-based and localization-demanded multimodal chatting, and showcases a notable reduction in object hallucination.