Meta launches AI audio model Audiobox: supports simultaneous input of voice and text, and can generate multi-level sound 02/14 Update SLTechnology News&Howtos

Meta launches AI audio model Audiobox: supports simultaneous input of voice and text, and can generate multi-level sound

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

CTOnews.com, December 4, Meta recently launched an AI sound generation model Audiobox, which can receive voice and text input at the same time. Users can use voice and text description at the same time to generate the required audio.

It is reported that this model is based on the Voicebox AI model launched by Meta in June this year. It is said that Audiobox can generate a variety of ambient sounds and natural conversational voices, and integrates audio generation and editing capabilities so that users can freely generate the audio they need.

According to Meta, generating high-quality audio requires a large audio library and deep domain knowledge, but it is difficult for VW to obtain these resources, and the company has launched this model to lower the threshold for sound generation and make it easier for anyone to create sound effects in video, games and other application scenes.

CTOnews.com found that the Audiobox model is based on Voicebox's "bootstrap sound" mechanism to generate target audio, and works with the "traffic comparison (flow-matching)" diffusion model generation method to achieve the "sound fill (audio infilling)" function to generate multi-level audio.

The Meta test generated rain audio with thunderstorms and entered a series of prompts for demonstration, such as "running water accompanied by birdsong", "young women speaking at a high-pitched and fast-paced pace", etc., while testing both human voice and text cues to generate voice with emotion ("mournful and slow") and background sound (in church).

Meta claims that Audiobox has successfully outperformed AudioLDM2, VoiceLDM and TANGO in sound quality and "accuracy of generated content", surpassing the best existing audio generation models.

Audiobox is now open to specific researchers and academics to test the quality and safety of the model, and Meta says they plan to "make the model fully public in a few weeks' time."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.