Apple researchers have published a new paper on artificial intelligence (AI) models. The paper focuses on understanding and navigating through smartphone user interfaces (UI). The paper highlights a new large language model (LLM) called Ferret UI. This model can understand complex smartphone screens.
The research paper titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs" has been published on arXiv, an open-access online repository of scholarly papers. The paper discusses the development of Ferret UI, which is a vision language model designed to understand and interact with complex and dynamic interfaces on a smartphone. The paper highlights that most language models with multimodal capabilities are restricted to natural images and cannot understand beyond that.
The Ferret UI model is capable of executing precise referring and grounding tasks specific to UI screens, and can adeptly interpret and act upon open-ended language instructions. The model can process a smartphone screen with multiple elements representing different information, and can also answer queries from users.
The paper shares an image that shows the model's ability to understand and classify widgets and recognize icons. It can also navigate to different parts of an iPhone based on a prompt from the user. This demonstrates that the AI is not only capable of explaining the screen it sees but can also interact with it in a meaningful way.
Apple researchers have created data of varying complexities to train Ferret UI. This has helped the model to learn basic tasks and understand single-step processes. To teach the model advanced tasks, they have used GPT-4 [40] to generate data, including detailed description, conversation perception, conversation interaction, and function inference. These advanced tasks prepare the model to engage in more nuanced discussions about visual components, formulate action plans with specific goals in mind, and interpret the general purpose of a screen.
If it passes the peer-review stage, Apple could potentially add powerful tools to the iPhone that can perform complex UI navigation tasks with simple text or verbal prompts. This capability appears to be ideal for Siri.
ALSO READ: Google Cloud Next 2024 unveils Gemini-powered AI features for Workspace tools