How to analyze the Application of SSML in DuerOS 07/15 Update SLTechnology News&Howtos

How to analyze the Application of SSML in DuerOS

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to analyze the SSML application in DuerOS. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

In conversational AI system, voice interaction is the main input and output mode. For speech output, there are two main methods, one is to make the audio in advance, and then play the audio according to the user's request, and the other is to convert the text into speech through the TTS technology in speech synthesis. In many cases, the audio produced is often better than the user experience of speech synthesis, because there are more "colors" in the human voice and more emotions in the voice intonation.

However, the workload of pre-production is often large, and because of the certainty of prefabrication, the dynamic of the output content is weak. On-demand customization, dynamic output is the power of TTS. So, how to make the expression of TTS more vivid? In the dialogic AI system DuerOS, the dynamic expression of content is realized through SSML.

What is SSML?

SSML is a standard, XML-based markup language that uses these identities to command speech synthesizers / services to convert text (input) into readable output. To put it simply, it is to convert the text language with a certain text logo format into speech output.

The original purpose of SSML design is to help developers improve the content of the synthesis results, through formatting and standardized marking to control a variety of speech output attributes, such as pronunciation, volume and other parameter settings. Therefore, several key elements of SSML design are as follows:

Consistency: provides predictable voice output control to support different voice synthesis service deployments

Compatibility: support for W3C standards, including but not limited to VoiceXML,ACSS and SMIL

Versatility: support for all kinds of voice content

Internationalization: supports voice output in various languages

Automation and readability: support automatic generation and handwritten text format, support good readability

Deployability: ability to support existing technologies and minimize the number of optional features.

How SSML works

The TTS system (speech synthesis processor) that supports SSML will be responsible for rendering the document as voice output and using the information contained in the tag to render the document in audio form as expected, the main principles are as follows:

1) XML parsing: the XML parser is used to extract document trees and content from incoming text documents. The structure, tags, and attributes obtained in this step affect each of the following steps.

2) structural analysis: the structure of the document will affect the way the document is read. For example, there are some common oral patterns related to paragraphs and sentences.

3) text standardization: all written languages have a special structure and need to be converted from written form to spoken form. Text normalization is an automatic process of the synthesis processor that performs this transformation. For example, when "$200" appears in a document, it can be called "$200". By the end of this step, the specific details of the composition of the text to be said have been fully converted to token,token are language-specific. Tags are usually separated by spaces, usually words. In general, tags in SSML cannot span other tags.

4) text-to-phoneme conversion: once the speech synthesis processor has determined the set of token to say, it must derive pronunciation for each token. Pronunciation can be easily described as a sequence of phonemes, which is the phonetic unit used to distinguish one word from another in a language. Each language has a specific set of phonemes.

5) prosodic analysis: prosody is a set of features of speech output, including tone (also known as intonation or melody), time (or rhythm), pause, speaking speed, emphasis on words, and many other features. Prosodic analysis is very important to make the language sound natural and correctly convey the meaning in the pronunciation.

6) Waveform generation: the speech synthesis processor uses phoneme and prosodic information to generate audio waveforms. There are many ways to handle this step, so there may be considerable processor-specific changes.

Examples of elements and attributes in SSML

SSML is a markup language, so it must have a certain file structure. All SSML files need the entry of Speak element tags. For more information about the syntax format of SSML, please refer to the W3C official documentation. The following is a description of the main tags of SSML.

SSML has a very powerful function support, a more typical function is the recording file playback function. The specific way to achieve this is to play the voice file through the URL path provided by an element tag.

The following is an example given in the W3C specification:

Please say your name after the tone.

What city do you want to fly from?

Welcome to the Voice Portal.

SSML in DuerOS

In the skill development of DuerOS, DuerOS will convert the text information in the response message returned by the skill into voice message for playback according to certain rules (refer to interface-oriented / protocol? Look at DuerOS's skill development, build AI application with JavaScript-DuerOS's skill development from Nodejs SDK and DuerOS's skill development from Java SDK). The transformed pronunciation has the same characteristics as expected, such as intonation, speed, pause and so on.

DuerOS supports both basic tags and extension tags. All tags in the underlying tags are SSML standard tags, which are equivalent to a subset of SSML tags. Extension tags refer to tags customized by DuerOS using the standard SSML language.

Base label

At present, there are 6 basic tags:

Speak: root label

Audio: synthesize existing audio according to url

Say-as: set the pronunciation of numbers, symbols, etc.

Sub: replace the target word

Silence: set mute and add mute clips at the beginning or end of the text broadcast, up to 10s

Phoneme: polysyllabic phonetic

For audio tags, audio is given at an address that can be accessed by the server, and currently supports 16K and 24K sampling, 16bit, mono, 44-byte header wave format files. Due to performance constraints, the corresponding audio files must be uploaded to Baidu Cloud bos platform, using the address provided by bos. A single request is limited to 3 juxtaposed audio resources, and the size of a single audio resource is limited to 3m.

You need to convert the audio to a supported format before using it. Ffmpeg is recommended. The command reference is as follows:

Ffmpeg-I-acodec pcm_s16le-BRV a 16k-ar 16000-ac 1-flags bitexact

Audio tags support single and double tags, and if they are double tags, nested text will be synthesized when audio is not accessible.

Extended label

At present, there are 4 kinds of extension tags:

Background: setting background sound

Say-as: add two new values to the attribute interpret-as, valid in English only

Poem: set poetry, attribute value "wuyan" represents five-character poem, "qiyan" represents seven-character poem, and "songci" represents Song ci.

Space: generates a pause in the space of the included text

The background tag is similar to the audio tag, which requires that the corresponding audio files must be uploaded to Baidu Cloud bos platform and use the resource url address provided by bos.

Use constraint

The SSML implementation in DuerOS is a subset of the W3C specification and has the following constraints in the application process:

Audio tags do not support nested audio/background (inner tags do not take effect)

Background/ tags do not support nesting themselves, and inner tags do not take effect.

Sub/say-as tags do not support nesting any other tags, which will cause parsing errors and cause tags to be read aloud alphabetically

The occurrence of tags inside non-Chinese text affects semantic conversion. It is recommended to use Chinese character form to request.

& and), "("),'(') are also recommended for escape before use.

The total length of request text (excluding SSML tags) should be less than 1024 bytes

It should be noted that the length of the text is calculated according to GBK encoding, the content of the text is encoded by UTF8, and the length of the text can be up to 4k without SSML.

On how to analyze SSML applications in DuerOS to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.