Pixels to Phrases: Evolution of Vision Language Models

Jay Oza; Gitesh Kambli

doi:10.36227/techrxiv.171078045.57266373/v1

loading page

Pixels to Phrases: Evolution of Vision Language Models

Jay Oza,
Gitesh Kambli,
Abhijit Patil

Abstract

Vision language models (VLMs) are transforming how we perceive and interact with visual data by bridging the gap between natural language understanding and visual perception. This paper provides a comprehensive overview of VLMs and their applications in text-based video retrieval and manipulation. It examines how these models leverage transformer architectures and self-attention mechanisms to learn joint representations of text and visual inputs. The paper traces the evolution of VLM pretraining techniques like ViLBERT, LXMERT, VisualBERT, and approaches for key tasks such as cross-modal retrieval, text-driven video manipulation, and zeroshot learning to generalize to unseen domains. It explores how VLMs can be integrated with other technologies like GANs and reinforcement learning to enhance capabilities. The survey also covers emerging areas like multimodal architectures that combine multiple modalities, video-language pretraining on video and text data, and generative VLMs for applications like text-to-image synthesis. Additionally, it discusses challenges and future directions related to multimodal fusion, interpretability, reasoning, and the ethical implications of developing and deploying VLMs. Overall, this paper provides a timely and comprehensive perspective on the rapidly evolving field of VLMs and their role in enabling multimodal AI systems.

22 May 2024Submitted to TechRxiv

30 May 2024Published in TechRxiv

Abstract

Peer review timeline