Instruction-ViT: Multi-modal Prompts For Instruction Learning In Vision Transformer
Abstract
Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.
Recommended Citation
Z. Xiao and Y. Chen and J. Yao and L. Zhang and Z. Liu and Z. Wu and X. Yu and Y. Pan and L. Zhao and C. Ma and X. Liu and W. Liu and X. Li and Y. Yuan and D. Shen and D. Zhu and D. Yao, "Instruction-ViT: Multi-modal Prompts For Instruction Learning In Vision Transformer," Information Fusion, vol. 104, article no. 102204, Elsevier, Apr 2024.
The definitive version is available at https://doi.org/10.1016/j.inffus.2023.102204
Department(s)
Computer Science
Keywords and Phrases
Instruction learning; Multi-modal information fusion; Multi-modal prompt; Vision transformer
International Standard Serial Number (ISSN)
1566-2535
Document Type
Article - Journal
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2025 Elsevier, All rights reserved.
Publication Date
01 Apr 2024

Comments
National Natural Science Foundation of China, Grant 61933003