What happens when you interact with the machine voice? 04/01 Update SLTechnology News&Howtos

What happens when you interact with the machine voice?

2026-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Intelligent voice interaction is a new generation of interaction mode based on voice input, and the feedback results can be obtained by speaking. It can be understood as the technology of information transmission between human beings and machines through natural language.

The complete flow of voice interaction, as shown in the following figure.

Generally speaking, there are two kinds of voice interaction scenarios according to the distance:

Near-field voice scene: usually activated by pressing keys, such as portable devices such as smartphones.

Far-field voice scene: usually activated by wake-up words, such as fixed devices such as smart speakers.

In far-field voice scenarios, two solutions are usually adopted in the product strategy to improve the accuracy of wake-up:

Increase the syllable length of the awakening word to 4 syllables. This is because the longer the syllable, the higher the accuracy of awakening. For example, the awakening accuracy of "Xiaoxing" is much higher than that of "Xiaoxing".

Only local check is done for wake-up words during the day, and cloud second check is added at night. This is a balance strategy between awakening speed and accuracy.

During the day, users pay more attention to the response speed, and the occasional false awakening can be understood or accepted by the users. At this time, only the local wake-up detection module is used for rapid detection to ensure a fast response to users within 700 ms.

Users have zero tolerance for false awakening when they go to bed at night. At this time, they should focus on the accuracy of awakening. The voice detected locally is uploaded to the cloud for second confirmation, and then the local response is decided.

The main function of the speech recognition phase is to collect and convert the speech into text, which mainly does two things:

1. Direction finding and noise reduction.

The function of direction finding is to judge the direction of the user, and the microphone of the direction of the user collects the voice data to ensure that the voice data is the clearest. Noise reduction is to eliminate the ambient sound and improve the recognition accuracy.

two。 Recognize speech and convert it to text.

In order to improve the recognition rate of specific content, hot word service is generally provided, and the configured hot word content takes effect in real time, and it will increase the recognition weight of ASR results, and improve the accuracy of ASR recognition to a certain extent.

Semantic understanding is an attempt to understand human language, that is, to transform the results of speech recognition into a structured language that machines can understand.

The working logic of NLU is to split the user's instructions into three levels of Domain (domain) → Intent (intention) → Slot (word slot).

For example: "set an alarm clock at 8 o'clock tomorrow morning". After NLU processing, the user's instruction is divided into the following three levels:

Field: "alarm clock"

Intention: "set alarm clock"

Word slot: "8 o'clock tomorrow morning"

Make a decision first. During the conversation, the machine constantly decides the best action to be taken in the next step according to the current state.

And then execute it. Such as providing results, asking for specific restrictions, clarifying or confirming requirements, and invoking various Skill skills (App in the AI era), so as to most effectively assist users in obtaining information or services.

The main purpose of NLG is to reduce the communication gap between humans and machines and to convert non-verbal data into language formats that humans can understand. A simple NLG can merge the data, while an advanced NLG can understand what the data is trying to express, consider the context, and present something that can be easily read.

At present, in some areas where there are obvious rules, such as sports news, it is possible to release news automatically with the help of NLG. Maybe the article you are reading is generated by the machine.

Convert the text into voice output and let the machine talk to us. This involves two processes:

Convert the text content into voice output and let the machine speak.

Synthetic speech: in a narrow sense, it refers to the generation of speech according to the phoneme sequence (as well as the marked start and end time, frequency change and other information). In a broad sense, it can also include the steps of text processing.

The main application scenarios of voice interaction in the home include: voice query information, voice control playback, voice hands-free dialing, voice control home appliances and so on.

Acronym:

ASR:Automatic Speech Recognition, automatic speech recognition technology

NLU:Natural Language Understanding, Natural language understanding

DM:Dialog Management, conversation Management

NLG:Natural Language Generation, natural language generation

TTS:Text To Speech, from text to voice

NLP:Natural Language Processing, Natural language processing

IPTV:Internet Protocol Television, Internet Protocol Television

OTT:Over The Top provides various application services to users through the Internet

IMS:Interactive Multimedia Service, interactive multimedia service

IOT:Internet of Things, Internet of things

This article comes from the official account of Wechat: ZTE documents (ID:ztedoc)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.