PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations
for VLM-based 3D Visual Grounding
- 1Seoul National University
- 2Pohang University of Science and Technology (POSTECH)
Overview
Abstract
3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
PanoGrounder Framework
Results
3D Visual Grounding Demonstration
Green boxes denote ground-truth objects, and red boxes denote predictions from PanoGrounder. Across diverse scenes and query types, PanoGrounder produces accurate and spatially consistent grounding results.
Evaluation on Nr3D, Sr3D, and ScanRefer
To ensure a fair comparison, we group methods by training regime: models trained on a single benchmark versus those trained on mixed datasets (e.g., ScanRefer + ReferIt3D). S+R denotes our model trained jointly on ScanRefer and ReferIt3D (Nr3D/Sr3D).
| Method | Nr3D | Sr3D | ScanRefer | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Hard | VD | VID | Overall | Easy | Hard | VD | VID | Overall | Unique | Multiple | Overall | ||
| Single Dataset | BUTD-DETR | 60.7 | 48.4 | 46.0 | 58.0 | 54.6 | 68.6 | 63.2 | 53.0 | 67.6 | 67.0 | 84.2 | 46.6 | 52.2 |
| ViL3DRel | 70.2 | 57.4 | 62.0 | 64.5 | 64.4 | 74.9 | 67.9 | 63.8 | 73.2 | 72.8 | 81.6 | 40.3 | 47.9 | |
| 3D-VisTA | 65.9 | 49.4 | 53.7 | 59.4 | 57.5 | 72.1 | 63.6 | 57.9 | 70.1 | 69.6 | 77.4 | 38.7 | 45.9 | |
| MIKASA | 69.7 | 59.4 | 65.4 | 64.0 | 64.4 | 78.6 | 67.3 | 70.4 | 75.4 | 75.2 | - | - | - | |
| GPS | 67.0 | 50.9 | 55.8 | 59.8 | 58.7 | 70.5 | 63.4 | 53.1 | 69.0 | 68.4 | - | - | - | |
| MCLN | - | - | - | - | 59.8 | - | - | - | - | 68.4 | 86.9 | 52.0 | 57.2 | |
| PQ3D | 73.3 | 56.7 | 60.7 | 67.0 | 64.9 | 78.8 | 68.2 | 51.5 | 76.7 | 75.6 | 85.2 | 46.8 | 52.8 | |
| LIBA | - | 57.2 | 60.3 | - | 64.5 | - | 70.2 | 61.7 | - | 75.8 | 88.8 | 54.4 | 59.6 | |
| TSP3D | - | - | - | - | 48.7 | - | - | - | - | 57.1 | 87.3 | 51.0 | 56.5 | |
| VGMamba | - | 61.4 | - | - | 68.3 | - | 74.4 | - | - | 81.3 | 91.9 | 54.8 | 60.0 | |
| ViewSRD | 75.3 | 64.8 | 68.6 | 70.6 | 69.9 | 78.3 | 70.6 | 69.0 | 76.2 | 76.0 | 82.1 | 37.4 | 45.4 | |
| Ours | 82.2 | 67.2 | 70.5 | 76.3 | 74.6 | 81.3 | 74.2 | 60.5 | 80.0 | 79.1 | 84.3 | 55.3 | 61.0 | |
| Multi Dataset | 3D-VisTA | 72.1 | 56.7 | 61.5 | 65.1 | 64.2 | 78.8 | 71.3 | 58.9 | 77.3 | 76.4 | 81.6 | 43.7 | 50.6 |
| GPS | 72.5 | 57.8 | 56.9 | 67.9 | 64.9 | 80.1 | 71.6 | 62.8 | 78.2 | 77.5 | - | - | - | |
| PQ3D | 75.0 | 58.7 | 62.8 | 68.6 | 66.7 | 82.7 | 72.8 | 62.9 | 80.5 | 79.7 | 86.7 | 51.5 | 57.0 | |
| Chat-Scene | - | - | - | - | - | - | - | - | - | - | 89.6 | 47.8 | 55.5 | |
| LLaVA-3D | - | - | - | - | - | - | - | - | - | - | - | - | 50.1 | |
| UniVLG | 73.3 | 57.0 | 55.1 | 69.9 | 65.2 | 84.4 | 75.2 | 66.2 | 82.4 | 81.7 | - | - | 60.7 | |
| Ours S+R | 84.1 | 68.4 | 72.9 | 77.5 | 76.1 | 82.3 | 74.5 | 66.6 | 80.6 | 79.9 | 85.0 | 56.4 | 62.0 | |
Citation
Acknowledgements
We thank the research community for providing the benchmarks and datasets used in this research. We thank Chunghyun Park for his helpful comments and technical advice during the development of the model.
The website template was borrowed from Michaƫl Gharbi.