Yanze Min, Yawei Sun, Yin Zhu, Jun Zhu, Bo Zhang. Bootstrapping Large Language Models with Outside-knowledge for Knowledge-based Visual Question AnsweringJ. Machine Intelligence Research, 2026, 23(1): 115-132. DOI: 10.1007/s11633-025-1591-z
Citation: Yanze Min, Yawei Sun, Yin Zhu, Jun Zhu, Bo Zhang. Bootstrapping Large Language Models with Outside-knowledge for Knowledge-based Visual Question AnsweringJ. Machine Intelligence Research, 2026, 23(1): 115-132. DOI: 10.1007/s11633-025-1591-z

Bootstrapping Large Language Models with Outside-knowledge for Knowledge-based Visual Question Answering

  • Knowledge-based visual question answering (KB-VQA), requiring external world knowledge beyond the image for reasoning, is more challenging than traditional visual question answering. Recent works have demonstrated the effectiveness of using a large (vision) language model as an implicit knowledge source to acquire the necessary information. However, the knowledge stored in large models (LMs) is often coarse-grained and inaccurate, causing questions requiring finer-grained information to be answered incorrectly. In this work, we propose a variational expectation-maximization (EM) framework that bootstraps the VQA performance of LMs with its own answer. In contrast to former VQA pipelines, we treat the outside knowledge as a latent variable. In the E-step, we approximate the posterior with two components: First, a rough answer, e.g., a general description of the image, which is usually the strength of LMs, and second, a multi-modal neural retriever to retrieve question-specific knowledge from an external knowledge base. In the M-step, the training objective optimizes the ability of the original LMs to generate rough answers as well as refined answers based on the retrieved information. Extensive experiments show that our proposed framework, BootLM, has a strong retrieval ability and achieves state-of-the-art performance on knowledge-based VQA tasks.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return