Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

AuthorsZhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan

View publication

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate thatFerret-UI One significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Related readings and updates.

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Discover opportunities in Machine Learning.