In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
A picture is worth a thousand words, and you are no longer afraid of being confused by video conference!
In recent years, the proportion of "videoconferencing" in work has gradually increased, and manufacturers have also developed a variety of technologies such as real-time subtitles to facilitate communication between people in different languages in meetings.
But there is also a pain point, if some unfamiliar nouns are mentioned in the dialogue, and it is difficult to describe them in words, such as food "Shou Xiyao" or "went to a park for a holiday last week". It is difficult to describe the beautiful scenery to each other in words; it is even pointed out that "Tokyo is located in the Kanto region of Japan" and needs a map to display, etc., if only the language may make the other party more and more confused.
Recently, Google showed off a system Visual Captions at the top-level human-computer interaction conference ACM CHI (Conference on Human Factors in Computing Systems), introducing a new visual solution in teleconferencing that can generate or retrieve pictures in the context of a conversation to improve each other's understanding of complex or unfamiliar concepts.
Links to papers: https://research.google/ pubs / pub52074/
Code link: https://github.com/ google / archat
Based on a fine-tuned large language model, the Visual Captions system can actively recommend relevant visual elements in open vocabulary conversations and has been integrated into the open source project ARChat.
In the user survey, the researchers invited 26 participants in the laboratory and 10 participants outside the laboratory to evaluate the system. More than 80% of the users basically agreed that Video Captions can provide useful and meaningful visual recommendations in a variety of scenarios and improve the communication experience.
Before the design idea was developed, the researchers invited 10 internal participants, including software engineers, researchers, UX designers, visual artists, students and other practitioners from technical and non-technical backgrounds, to discuss specific needs and expectations for real-time visual enhancement services.
After two meetings, according to the existing text-to-image system, the basic design of the expected prototype system is established, which mainly includes eight dimensions (D1 to D8).
D1: timing, the visual enhancement system can be displayed synchronously or asynchronously with the dialogue
D2: theme, which can be used to express and understand voice content
D3: vision, which can use a wide range of visual content, visual types and visual sources
D4: scale, depending on the size of the meeting, the visual enhancement effect may be different
D5: space, whether the video conference is in the same location or in the remote setting
D6: privacy, these factors also affect whether visual effects should be displayed in private, shared among participants, or made public to all.
D7: in the initial state, participants also determine different ways in which they want to interact with the system during a conversation, for example, different levels of "initiative", that is, users can independently determine when the system is involved in chat D8: interaction, participants envision different interaction methods, such as using voice or gestures for input
Using dynamic visual effects to enhance the design space of language communication according to the preliminary feedback, the researchers designed the Video Caption system, which focuses on generating synchronous visual effects of semantic-related visual content, type and source.
While most of the ideas in exploratory meetings focus on the form of one-to-one remote conversations, Video Caption can also be used in one-to-many (for example, presentations to the audience) and many-to-many scenarios (discussions in many-person meetings).
In addition, the visual effect that can best complement the conversation depends to a large extent on the context of the discussion, so a specially made training set is needed.
The researchers collected 1595 quads, including language, visual content, types and sources, covering a variety of contextual scenarios, including daily conversations, lectures, travel guides, etc.
For example, the user said, "I really want to see it!" "(I would love to see it!) corresponds to the visual content of" face smiling ", the visual type of" emoji "and the visual source of" public search ".
"did she tell you about our trip to Mexico? "corresponds to the visual content of the photo from the trip to Mexico, the visual type of the photo, and the visual source of the personal album.
The dataset VC 1.5K is currently open source.
Data link: https://github.com/ google / archat / tree / main / dataset Visual intention Prediction Model in order to predict which visual effects can complement the dialogue, the researchers used VC1.5K data sets to train a visual intention prediction model based on large language models.
During the training phase, each visual intention is parsed into a "of from" format.
Based on this format, the system can handle open vocabulary sessions and context prediction of visual content, visual sources and visual types.
This approach is also better than keyword-based methods in practice, because the latter cannot handle examples of open words, such as users may say, "your Aunt Amy will visit this Saturday," and there is no matching keyword. It is also impossible to recommend the relevant visual type or visual source.
The researchers used 1276 (80%) samples from the VC1.5K dataset to fine-tune the large language model, the remaining 319 (20%) samples as test data, and used the token accuracy index to measure the performance of the fine-tuning model, that is, the percentage of token correct in the samples correctly predicted by the model.
The final model can achieve 97% accuracy of training token and 87% accuracy of verification token.
In order to evaluate the practicability of the visual subtitle model of the training, the research team invited 89 participants to perform 846 tasks and asked to rate the results, 1 very disagreed (strongly disagree) and 7 very disagreed (strongly agree).
The experimental results showed that most participants preferred to see visual effects (Q1) in the conversation, and 83% gave 5-some somewhat agree or more comments.
In addition, participants considered the visual effects displayed to be useful and informative (Q2), with 82% giving ratings higher than 5 points; high-quality ones (Q3), 82% giving ratings higher than 5 points; and related to the original speech (Q4Jing 84%).
Participants also found that the predicted visual type (Q5Power87%) and visual source (Q6Magne86%) were accurate in the context of the corresponding conversation.
Based on the fine-tuned visual intention prediction model, the researchers developed Visual Captions on the ARChat platform, which can directly add new interactive widgets to the camera stream of video conferencing platforms such as Google Meet.
In the workflow of the system, Video Captions can automatically capture the user's voice, retrieve the last sentence, input the data into the visual intention prediction model every 100ms, retrieve the relevant visual effects, and then provide the recommended visual effects.
Visual Captions's system workflow Visual Captions offers three levels of initiative when recommending visual effects:
Automatic display (high initiative): the system searches independently and publicly displays visual effects to all meeting participants without user interaction.
Automatic recommendation (medium initiative): the recommended visual effects are displayed in a private scrolling view, and then the user clicks on a visual object for public display; in this mode, the system actively recommends visual effects, but the user decides when and what to display.
On-demand advice (low initiative): the system will not recommend visual effects until the user presses the spacebar.
The researchers evaluated the Visual Captions system in a control laboratory study (n = 26) and a test phase deployment study (n = 10). Participants found that real-time visual effects help explain unfamiliar concepts, resolve language ambiguities, and make conversations more attractive, thus facilitating live conversations.
Participants' task load index and Likert scale ratings, including no VC and three VC participants with different initiatives, also reported different system preferences in field interactions, that is, the use of different degrees of VC initiative in different meeting scenarios
Reference:
Https://ai.googleblog.com/2023/06/visual-captions-using-large-language.html
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.