Visual Captions: The Google tool that will transform your video conferences with relevant images

visual captions google.jpg
visual captions google.jpg

Google has introduced Visual Captions, an innovative solution that transforms synchronous video communication by offering relevant image suggestions in real time. This tool is part of the ARChat project, which aims to facilitate augmented communication with real-time transcription. Through the analysis of verbalizations, Visual Captions suggests suitable images during conversations, improving the comprehension and expression of content. Let’s get to know this revolutionary technology in detail.

Visual Captions: improving communication with images in real time

visual captions

Google’s ARChat project has given rise to Visual Captions, a system that seeks to enhance videoconferences by including images in real time. To develop it, studies were carried out with the participation of ten individuals from various fields, which made it possible to identify eight fundamental dimensions related to visual improvement in conversations. These dimensions include the timing of displaying images, their role in the expression and comprehension of verbal content, types and sources of visual content, meeting scale and setup considerations, privacy, interaction initiation and the methods of interaction.

Generating relevant images synchronously

Based on the valuable input obtained, Visual Captions was designed to generate synchronous images that are semantically relevant to the ongoing conversation. The system was tested in various scenarios, such as remote one-on-one conversations, presentations, and group discussions. To effectively train the system, a specific dataset called VC1.5K was created, consisting of language, visual content, type, and font pairs in different contexts. The model was trained using a wide language model and the dataset, outperforming keyword-based approaches and achieving high accuracy.

User Studies: Evaluating the Effectiveness of Visual Captions

User studies were conducted to assess the effectiveness of Visual Captions. Participants found the images to be informative, high-quality, and relevant. The type and source of the images precisely matched the context of the conversation. In addition, controlled laboratory studies and real-world deployments were conducted to evaluate the system. Real-time visuals were found to improve conversations by explaining concepts, resolving language ambiguities, and increasing engagement. Different levels of proactivity in suggesting images were preferred depending on the social scenarios.

Visual Captions on the ARChat platform

Visual Captions was developed on the ARChat platform, which integrates interactive widgets into video conferencing platforms like Google Meet. The system captures users’ verbalization, predicts visual intentions in real time, retrieves relevant images, and suggests them to users. It offers three levels of proactive image suggestion: auto-display, auto-suggest, and suggest-on-demand.

You have more information at

Previous articlemacOS Sonoma, easy porting of Windows games with Game Porting Toolkit
Next articleInstagram joins the wave of Artificial Intelligence: it will have its own chatbot
Brian Adam
Professional Blogger, V logger, traveler and explorer of new horizons.