Commonsense-Guided VLMs for 2D Fully UnobservedObject Detection
Main Article Content
Keywords
unobserved object detection, vision-language models, commonsense reasoning, in-context learning, spatial-semantic distribution
Abstract
This paper focuses on the task of 2D fully unobserved object detection, which requires inferring the presence and spatial distribution of targets that are entirely absent from the visible frame. Despite the potential of Vision-Language Models (VLMs), they often lack the spatial precision required for fine-grained 2D reasoning, while alternative diffusion-based methods incur prohibitive computational costs. To address these challenges, we propose CIUL, a novel reasoning framework that synergizes VLMs with structured prior knowledge through two core innovations: (1) Object-Oriented Commonsense Integration, which constructs an automated knowledge base to provide robust semantic constraints via typical spatial arrangements; and (2) Lightweight In-Context Learning, a paradigm that enables the model to adaptively refine its reasoning logic for unobserved regions using local visual cues from a single image without extensive retraining. Experimental results on the NYU Depth V2 benchmark demonstrate that our approach significantly out-performs existing baselines across key metrics, including 2D Region-wise Accuracy and Normalized Cross-Entropy, effectively bridging the gap between low-level perception and high-level commonsense reasoning.
References
- [1] Ling, H., Acuna, D., Kreis, K., Kim, S. W. and Fidler, S. Variational amodal object completion. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), San Diego, CA, USA, 2020, https://proceedings.neurips.cc/paper/2020/file/bacadc62d6e67d7897cef027fa2d416c-Paper.pdf.
- [2] Bhattacharjee, S. S., Campbell, D. and Shome, R. Believing is Seeing: Unobserved Object Detection using Generative Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Music City Center, Nashville, Tennessee, USA, 2025; pp. 19366-19377. https://doi.org/10.1109/CVPR52734.2025.01804.
- [3] Ren, S., He, K., Girshick, R. and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015, pp. 91-99.
- [4] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, Nevada, USA, 2016; pp. 779-788. https://doi.org/10.1109/CVPR.2016.91.
- [5] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision - ECCV 2020. ECCV 2020, Cham, 2020; pp. 213-229. https://doi.org/10.1007/978-3-030-58452-8_13.
- [6] Li, K. and Malik, J. Amodal Instance Segmentation. In Computer Vision – ECCV 2016. ECCV 2016, Cham, 2016; pp. 677-693. https://doi.org/10.1007/978-3-319-46475-6_42.
- [7] Back, S., Lee, J., Kim, T., Noh, S., Kang, R., Bak, S. and Lee, K. Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling. In 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 2022; pp. 5085-5092. https://doi.org/10.1109/ICRA46639.2022.9811646.
- [8] Sun, Y., Kortylewski, A. and Yuille, A. Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Piscataway, NJ, USA, 2022; pp. 1215-1224. https://doi.org/10.1109/CVPR52688.2022.00128.
- [9] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P. and Clark, J. Learning transferable visual models from natural language supervision. In International conference on machine learning, London, UK, 2021; pp. 8748-8763, http://proceedings.mlr.press/v139/radford21a.html.
- [10] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z. and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, London, UK, 2021; pp. 4904-4916, http://proceedings.mlr.press/v139/jia21b.html.
- [11] Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L. and Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Washington, USA, 2024; pp. 14455-14465, https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_SpatialVLM_Endowing_Vision-Language_Models_with_Spatial_Reasoning_Capabilities_CVPR_2024_paper.pdf.
- [12] Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D. and Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, NJ, 2022; pp. 5238-5248. https://doi.org/10.1109/CVPR52688.2022.00517.
- [13] Silberman, N., Hoiem, D., Kohli, P. and Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, Berlin, Heidelberg, 2012; pp. 746-760. https://doi.org/10.1007/978-3-642-33715-4_54.
- [14] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, Cham, 2014; pp. 740-755.
- [15] Shannon, C. E. A mathematical theory of communication. SIGMOBILE Mobile Computing and Communications Review. 2001, 5(1), pp. 3–55. https://doi.org/10.1145/584091.584093.
