Multimodal Pretrained Knowledge for Real-world Object Navigation

Hui Yuan; Yan Huang; Naigong Yu; Dongbo Zhang; Zetao Du; Ziqi Liu; Kun Zhang

doi:10.1007/s11633-024-1537-x

Hui Yuan, Yan Huang, Naigong Yu, Dongbo Zhang, Zetao Du, Ziqi Liu, Kun Zhang. Multimodal Pretrained Knowledge for Real-world Object Navigation[J]. Machine Intelligence Research, 2025, 22(4): 713-729. DOI: 10.1007/s11633-024-1537-x

Citation:

Multimodal Pretrained Knowledge for Real-world Object Navigation

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Most visual-language navigation (VLN) research focuses on simulate environments, but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments, leading to path deviations. To address this, we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text. This improves navigation supervision at key-points and enhances robustness. Specifically, we 1) randomly generate key-points within a specific density range and optimize them on the basis of challenging locations; 2) use pretrained multimodal knowledge to efficiently retrieve target objects; 3) combine depth information with simultaneous localization and mapping (SLAM) map data to predict optimal positions and orientations for accurate navigation; and 4) implement the method on a physical robot, successfully conducting navigation tests. Our approach achieves a maximum success rate of 66.7\%, outperforming existing VLN methods in real-world environments.

FullText(HTML)

References (53)

Supplements (0)

Cited By

Multimodal Pretrained Knowledge for Real-world Object Navigation

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content