We assess the object localisation capabilities of caption-based VLMs by analysing their textual responses given various prompts. We examine models such as GPT-4V, BLIP-2, Flamingo (using OpenFlamingo), and Fromage. For that, we use prompts aimed at generating a bounding box or text localisation response from these VLMs. Extended analysis can be found in the supplmental.
Examples from our analysis on localisation abilities of existing caption-based VLMs.
We find that among the evaluated VLMs, only GPT-4V successfully returns bounding boxes that roughly localise the intended object. Other VLMs are unable to provide any location information even in text form and instead are "chatty" (FROMAGe, OpenFlamingo) or return the input or provide no output (BLIP-2). One recent stream of research (VisionLLM, UniTab, CogVLM, OFA, MiniGPTv2, mPLUG-Owl, GLIPv2, UnifiedIO, UniTab, Shikra) focuses on developing unified expert VLMs capable of performing a variety of tasks, including localisation, with a universal architecture. Although these models show impressive results across different tasks, their success largely depends on the availability of extensive task-specific, supervised data. Furthermore, they typically require a large amount of compute for training. The setting we tackle in this paper is different. Our goal is to efficiently enable the localisation capabilities of VLMs while keeping their parameters untouched and without the need for localisation supervised datasets.