PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Accepted to CVPR'24
University of Amsterdam
*equal last author

Abstract

Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not utilizing any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

We learn a single Positional Insert (PIN) for unlocking zero-shot object localisation abilities in a frozen Vision Language Model (VLM) without adding any additional heads or requiring supervised datasets.

Limitations in caption-based VLMs

We assess the object localisation capabilities of caption-based VLMs by analysing their textual responses given various prompts. We examine models such as GPT-4V, BLIP-2, Flamingo (using OpenFlamingo), and Fromage. For that, we use prompts aimed at generating a bounding box or text localisation response from these VLMs. Extended analysis can be found in the supplmental.

Examples from our analysis on localisation abilities of existing caption-based VLMs.

We find that among the evaluated VLMs, only GPT-4V successfully returns bounding boxes that roughly localise the intended object. Other VLMs are unable to provide any location information even in text form and instead are "chatty" (FROMAGe, OpenFlamingo) or return the input or provide no output (BLIP-2). One recent stream of research (VisionLLM, UniTab, CogVLM, OFA, MiniGPTv2, mPLUG-Owl, GLIPv2, UnifiedIO, UniTab, Shikra) focuses on developing unified expert VLMs capable of performing a variety of tasks, including localisation, with a universal architecture. Although these models show impressive results across different tasks, their success largely depends on the availability of extensive task-specific, supervised data. Furthermore, they typically require a large amount of compute for training. The setting we tackle in this paper is different. Our goal is to efficiently enable the localisation capabilities of VLMs while keeping their parameters untouched and without the need for localisation supervised datasets.

Positional Insert (PIN)

We tackle the shortcomings of caption-based Vision-Language Models (VLMs) in their ability to localise objects within images. To this end, we introduce a simple yet effective Positional Insert (PIN), designed to enhance the VLMs' object localisation capabilities without altering their existing parameters or the use of annotated data.

Overview of our method: PIN is a spatial learnable prompt \( \pi \) which is added to the vision enconding in the forward pass of the VLM. We train PIN with the self-generated supervision signal from our synthetic training data generation.

PIN is a learnable an input-agnostic spatial feature vector and is inserted directly after the vision encoder \( \phi_V \). To instill spatial awareness into our PIN module, we start with fixed positional embeddings employing sinusoidal functions. Each of the sinusoidal vectors is further refined by a shallow feed-forward neural network \( \psi \), resulting in our PIN \(\pi {=} \psi(S) \). This learned embedding is then added to the output from the vision encoder \( x_v \), resulting in the enriched visual feature representation \( x_v^\star = x_v + \pi. \) From there the standard forward pass of the VLM is followed. PIN's parameters are optimized via simple next token prediction by using the bounding box location of a inquired object contained in the text prompt.

We do not rely on manually labeled data to unlock the positional information in the VLM. Instead, we generate our own synthetic data following XPaste and by utilizing Stable Diffusion to synthesize objects from the LVIS category list. Note, since the vision encoder's weights remain unchanged, it is unlikely to overfit to any pasting artifacts. The composition function \( C \) overlays objects on randomly picked locations while considering the various constraints. This process creates a self-generated supervision signal that is subsequently exploited in the training of PIN.

Sample images from our synthetic training data.

Results

PIN is applied on the Flamingo (using OpenFlamingo) and BLIP-2 VLMs. For COCO, PVOC, and LVIS, localisation is based on ground truth object names. All results are zero-shot as overlapping categories with the synthetic training data are removed. We also evaluate PIN for visual grounding on RefCOCO.

Localisation on a wide range of image types ranging from paintings, and comics to unique scenarios. Despite the varying image content, the enhanced OpenFlamingo VLM with PIN shows strong localisation abilities.

Object localisation results on PVOC and COCO. The PIN module unlocks spatial localisation in the OpenFlamingo VLM.

Object localisation results with BLIP-2 on 224\( \times \)224, BLIP-2 (224), image resolution and 364\( \times \)364, BLIP-2 (364), on PVOC. The PIN trained with the higher image resolution BLIP-2 version is able to predict more accurate bounding boxes.

Comparison on object localisation on a subset of PVOC, COCO and LVIS with up to 3 objects per image. PIN improves on the OpenFlamingo in-context and PEFT baselines for both the OpenFlamingo and BLIP-2 VLM.

Zero-shot visual grounding results on RefCOCO of PIN with the OpenFlamingo VLM. The adapted VLM struggles with more complex scenarios(B and C), yet, it effectively handles simpler cases (F, G, H, J).

Evaluation on RefCOCO Test-A. PIN shows decent grounding abilities without using any annotated training data, outperforming the in-context learning Flamingo baseline. Extending our synthetic dataset with positional referrals improves performance.

BibTeX

@InProceedings{Dorkenwald_2024_CVPR, author = {Dorkenwald, Michael and Barazani, Nimrod and Snoek, Cees G. M. and Asano, Yuki M.}, title = {PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13548-13558} }