I'm interested in computer vision, multimodal representation learning, foundation models. Most of my research is about analyzing existing mulimodal model and improving its application capabilities.
We propose a simple but powerful data augmentation method which augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging.
We conduct an extensive benchmark study to measure the performance of representative methods on widely used 7 datasets, while posing additional research questions and empirically verify them.
This page is a fork of Jon Barron's. Thank you for sharing :)