PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations
for VLM-based 3D Visual Grounding

  • 1Seoul National University
  • 2Pohang University of Science and Technology (POSTECH)
* Equal contribution

Overview

Overview

Abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

PanoGrounder Framework

PanoGrounder Framework

Results

3D Visual Grounding Demonstration

Green boxes denote ground-truth objects, and red boxes denote predictions from PanoGrounder. Across diverse scenes and query types, PanoGrounder produces accurate and spatially consistent grounding results.

PanoGrounder Results Demo

Evaluation on Nr3D, Sr3D, and ScanRefer

To ensure a fair comparison, we group methods by training regime: models trained on a single benchmark versus those trained on mixed datasets (e.g., ScanRefer + ReferIt3D). S+R denotes our model trained jointly on ScanRefer and ReferIt3D (Nr3D/Sr3D).

Method Nr3D Sr3D ScanRefer
Easy Hard VD VID Overall Easy Hard VD VID Overall Unique Multiple Overall
Single Dataset BUTD-DETR 60.7 48.4 46.0 58.0 54.6 68.6 63.2 53.0 67.6 67.0 84.2 46.6 52.2
ViL3DRel 70.2 57.4 62.0 64.5 64.4 74.9 67.9 63.8 73.2 72.8 81.6 40.3 47.9
3D-VisTA 65.9 49.4 53.7 59.4 57.5 72.1 63.6 57.9 70.1 69.6 77.4 38.7 45.9
MIKASA 69.7 59.4 65.4 64.0 64.4 78.6 67.3 70.4 75.4 75.2 - - -
GPS 67.0 50.9 55.8 59.8 58.7 70.5 63.4 53.1 69.0 68.4 - - -
MCLN - - - - 59.8 - - - - 68.4 86.9 52.0 57.2
PQ3D 73.3 56.7 60.7 67.0 64.9 78.8 68.2 51.5 76.7 75.6 85.2 46.8 52.8
LIBA - 57.2 60.3 - 64.5 - 70.2 61.7 - 75.8 88.8 54.4 59.6
TSP3D - - - - 48.7 - - - - 57.1 87.3 51.0 56.5
VGMamba - 61.4 - - 68.3 - 74.4 - - 81.3 91.9 54.8 60.0
ViewSRD 75.3 64.8 68.6 70.6 69.9 78.3 70.6 69.0 76.2 76.0 82.1 37.4 45.4
Ours 82.2 67.2 70.5 76.3 74.6 81.3 74.2 60.5 80.0 79.1 84.3 55.3 61.0
Multi Dataset 3D-VisTA 72.1 56.7 61.5 65.1 64.2 78.8 71.3 58.9 77.3 76.4 81.6 43.7 50.6
GPS 72.5 57.8 56.9 67.9 64.9 80.1 71.6 62.8 78.2 77.5 - - -
PQ3D 75.0 58.7 62.8 68.6 66.7 82.7 72.8 62.9 80.5 79.7 86.7 51.5 57.0
Chat-Scene - - - - - - - - - - 89.6 47.8 55.5
LLaVA-3D - - - - - - - - - - - - 50.1
UniVLG 73.3 57.0 55.1 69.9 65.2 84.4 75.2 66.2 82.4 81.7 - - 60.7
Ours S+R 84.1 68.4 72.9 77.5 76.1 82.3 74.5 66.6 80.6 79.9 85.0 56.4 62.0

Citation

Acknowledgements

We thank the research community for providing the benchmarks and datasets used in this research. We thank Chunghyun Park for his helpful comments and technical advice during the development of the model.
The website template was borrowed from Michaƫl Gharbi.