View publication

Recent years have witnessed the rapid development of open-world image segmentation, including open-vocabulary segmentation and in-context segmentation. Nonetheless, existing methods are limited to a single modality prompt, which lacks the flexibility and accuracy needed for complex object-aware prompting. In this work, we present COSINE, a unified open-world segmentation model that Consolidates Open-vocabulary Segmentation and IN-context sEgmentation. By framing open-vocabulary task and in-context segmentation task as promptable segmentation tasks, COSINE supports diverse modalities of input, such as images and text. Containing a model pool and a segdecoder, COSINE makes full use of the representation capability of foundations models and is able to accurately segment specific concept based on diverse modalities of input, such as images and text, offering powerful open-world perception capabilities. Experiments on various segmentation tasks show the effectiveness of the proposed method.

Related readings and updates.

In 2022, we launched a new systemwide capability that allows users to automatically and instantly lift the subject from an image or isolate the subject by removing the background. This feature is integrated across iOS, macOS, iPadOS and accessible in several apps like Photos, Preview, Safari, Keynote, and more. Underlying this feature is an on-device deep neural network that performs real-time salient object segmentation — or categorizes each pixel of an image as either a part of the foreground or background. Each pixel is assigned a score, denoting how likely it is to be part of the foreground. While prior methods often restrict this process to a fixed set of semantic categories (like people and pets), we designed our model to be unrestricted and generalize to arbitrary classes of subjects (for example, furniture, apparel, collectibles) — including ones it hasn’t encountered during training. While this is an active area of research in Computer Vision, there are many unique challenges that arise when considering this problem within the constraints of a product ready to be used by consumers. This year, we are launching Live Stickers in iOS and iPadOS, as seen in Figure 1, where static and animated sticker creation are built on the technology discussed in this article. In the following sections, we’ll explore some of these challenges and how we approached them.

Read more

Camera (in iOS and iPadOS) relies on a wide range of scene-understanding technologies to develop images. In particular, pixel-level understanding of image content, also known as image segmentation, is behind many of the app’s front-and-center features. Person segmentation and depth estimation powers Portrait Mode, which simulates effects like the shallow depth of field and Stage Light. Person and skin segmentation power semantic rendering in group shots of up to four people, optimizing contrast, lighting, and even skin tones for each subject individually. Person, skin, and sky segmentation power Photographic Styles, which creates a personal look for your photos by selectively applying adjustments to the right areas guided by segmentation masks, while preserving skin tones. Sky segmentation and skin segmentation power denoising and sharpening algorithms for better image quality in low-texture regions. Several other features consume image segmentation as an essential input.

Read more