Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.
Referring Image Segmentation predicts a segmentation mask of the object referred with a given an image and a text. The key to RIS task is to discern the referent among visually similar objects via textual cues. The difficulty of each RIS scenario can be affected by the degree of visual ambiguity in the scene given the linguistic complexity of the referring expression. That is, the more negative objects the scene has, the more difficult the RIS problem becomes and the model needs to fully understand the words to discern similar objects. For example, in (a), query (1) demands linguistic understanding of "directing the left side" and also visual discernment among three road signs, while query (2) involves identifying a “woman”, relatively easier due to a single instance.
We manually pick 100 easy and hard samples depending on the number of negative objects. A huge performance gap exists between easy & hard examples in current models. We find that variant Inter and even Intra-dataset grounding difficulty levels exist in training data as well; many easy samples are found in the datasets that are thought to be harder. We ask whether these samples are challenging enough to learn how to discern subtle visual and textual nuance for RIS.
Exposure to hard examples can improve a model in complex scenarios. Complexity arises when multiple similar objects coexist, requiring deeper understanding of the scene and expression. We propose Negative-mined Mosaic Augmentation that mimics such challenging scenarios without any data-collection cost. Given an image and a query, it selects negative images at a proper level of difficulty, filtering out visually or semantically images to the query to avoid false negatives and irrelevant (easy) images identified by text-to-image retrieval. It randomly selects three among the remaining to construct a mosaic.
We observe a larger performance boost on more complex datasets. Harder datasets benefit more because of its intricate referring expressions and visually dense scenes.
Our approach performed best, as it preserves the original context needed for referring image segmentation, while the other methods often disrupt key visual elements.
Performance on Challenging Scenarios The performance gap between with and without NeMo gets larger with more negative objects in the image. NeMo performs better on challenging samples as expected.
Robustness on Object Scale We evaluate the impact of our method across various object sizes. Improvement is observed in most cases, especially for smaller objects. This can be attributed to the wider range of object scales seen during training by integrating objects both in the original and 1/4 size.
Complexity of Referring Expression We measure how our method behaves depending on sentence lengths. NeMo is effective across all sentence lengths, even with longer, complex ones. NeMo also helps capture important linguistic cues for grounding.
Positional Understanding NeMo exhibits stronger improvements, especially for queries with positional keywords, even when the queries are long and complex.
Qualitative results demonstrates that (a) our method successfully detects a blurred and small “person” in the background, positioned behind the most prominent person in the image. (b) Our method also segments the entire dish while the baseline only detects the left half without fully understanding the query describing the right half. Also, (c) it is apparent that our method yields an output with more distinct shape for “the second” horse. (d) Our method captures objects more accurately in scenarios involving directional expressions, indicating improved understand- ing both in absolute and relative positions.
BibTex Code Here