AI Object Clipping: A Fundamental Advancement from Meta

Fundamental advancement: AI object clipping from Meta

Meta AI’s research department has developed a new image segmenter based on the Transformer architecture. The Segment Anything Model (SAM) can detect and isolate objects in images in seconds, making it a powerful tool for image processing. Unlike other transformers, SAM was trained exclusively on images, allowing it to develop a general understanding of objects. SAM can also perform one-shot learning, meaning it can correctly isolate unknown objects it hasn’t seen before.

SAM is particularly useful for separating objects from their background, which is a crucial task in image and video editing, as well as VR and AR applications. SAM can even mask or cut out objects using speech input in combination with a language model. However, SAM does not yet support text commands.

SAM is easy to control, requiring only a single prompt, such as a few pixels, to generate a valid mask for any input. In addition, Meta’s researchers have made SAM’s source code available as open-source on GitHub, as well as a training dataset of 11 million images with 1.1 billion masks.

The researchers at Meta AI are trying to avoid statistical bias due to distortions in the training data, as certain groups of people or regions are often underrepresented in common training data collections. SAM has been trained with high-quality data from a wide variety of topics and objects. Meta’s researchers used a multi-stage strategy to train SAM, with a training phase with free and manually produced masks, a semi-automatic phase in which annotators improved photos that were already masked, and a final stage in which SAM trained itself.

SAM is just one of Meta’s recent developments in AI. The company has also developed DINOv2, a segmenter that learns without annotated training data. Both SAM and DINOv2 use ViT, a vision transformer network that cuts images into small squares and adds information to each of where in the image it was originally found.

These new transformer architectures work directly with pixels, allowing for a more profound understanding of objects and scenes than previously used methods. SAM, DINOv2, and other transformer architectures could complement language models and image generators, leading to even higher-class AIs.

Leave a Reply