Finding NeMo : Negative-mined Mosaic Augmentation for Referring Image Segmentation

1 Seoul National University 2 Twelve Labs 3 Allen Institute for AI 4 Google Research
ECCV 2024
*Indicates Equal Contribution

Negative-mined Mosaic Augmentation (NeMo) generates ambiguous examples where a model is encouraged to concretely understand the scene and the query. In the original image, understanding “a woman” is enough to find the target, but in the augmented image, the model needs to understand what “woman in front of the wall” means.

Abstract

Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.

What makes Referring Image Segmentation difficult?

Referring Image Segmentation predicts a segmentation mask of the object referred with a given an image and a text. The key to RIS task is to discern the referent among visually similar objects via textual cues. The difficulty of each RIS scenario can be affected by the degree of visual ambiguity in the scene given the linguistic complexity of the referring expression. That is, the more negative objects the scene has, the more difficult the RIS problem becomes and the model needs to fully understand the words to discern similar objects. For example, in (a), query (1) demands linguistic understanding of "directing the left side" and also visual discernment among three road signs, while query (2) involves identifying a “woman”, relatively easier due to a single instance.

Do current RIS models actually perform well in hard scenarios?

We manually pick 100 easy and hard samples depending on the number of negative objects. A huge performance gap exists between easy & hard examples in current models. We find that variant Inter and even Intra-dataset grounding difficulty levels exist in training data as well; many easy samples are found in the datasets that are thought to be harder. We ask whether these samples are challenging enough to learn how to discern subtle visual and textual nuance for RIS.

Negative-mined Mosaic Augmentation

Exposure to hard examples can improve a model in complex scenarios. Complexity arises when multiple similar objects coexist, requiring deeper understanding of the scene and expression. We propose Negative-mined Mosaic Augmentation that mimics such challenging scenarios without any data-collection cost. Given an image and a query, it selects negative images at a proper level of difficulty, filtering out visually or semantically images to the query to avoid false negatives and irrelevant (easy) images identified by text-to-image retrieval. It randomly selects three among the remaining to construct a mosaic.

Overall Comparison

We observe a larger performance boost on more complex datasets. Harder datasets benefit more because of its intricate referring expressions and visually dense scenes.

Comparison to other augmentation methods

Our approach performed best, as it preserves the original context needed for referring image segmentation, while the other methods often disrupt key visual elements.

Detailed Analysis of NeMo

Qualitative Results

Qualitative results demonstrates that (a) our method successfully detects a blurred and small “person” in the background, positioned behind the most prominent person in the image. (b) Our method also segments the entire dish while the baseline only detects the left half without fully understanding the query describing the right half. Also, (c) it is apparent that our method yields an output with more distinct shape for “the second” horse. (d) Our method captures objects more accurately in scenarios involving directional expressions, indicating improved understand- ing both in absolute and relative positions.

Video Presentation

BibTeX

BibTex Code Here