mVoice SSML Language Reference

SSML elements and attributes supported by MAMA AI mVoice TTS Engine API.

Last update: May 12, 2025. Valid for mVoice version 2.31+.

Quick Start

Copy & paste this to mVoice demo page and click “Speak!”
(If you don’t have access to mVoice, contact MAMA)


<speak>
  <voice name="en-UK_AdamU16">
    And now, listen!
    <lang xml:lang="cs-CZ">Adam mluví česky!</lang>
    <mvoice:background_audio
            src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3"
            volume="-12dB" duck="4" loop="on" fade_in="1" fade_out="1"
            clip_begin="40" clip_end="999">
      Here we go.
      <break time="2000ms"/>
      <prosody rate="90%">All of this</prosody>,
      including the following sound,
      <audio src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/ding.ogg" loud_norm="I=-23">ding ding</audio>,
      should be backlit by a cosy music.
    </mvoice:background_audio>
    How did you like it?
    <prosody pitch="110%">A bit higher voice.</prosody>`
    <prosody volume="-6dB">Lower volume.</prosody>`
    <p>
      <s>And by the way, most of the above can be freely nested.</s>
    </p>

  </voice>
</speak>

The following sections provide a full reference for supported SSML and SMIL elements and the Lexicon.

Note that mVoice does not strictly require all boilerplate SSML content. Non-recognized SSML elements are ignored.

Supported SSML Elements

`<speak>`

Root element. SSML content must be enclosed within this.

Example:

<speak>
  Text to be spoken.
</speak>

Attribute	Description
`mvoice:file_name`	Optional output file name.

When the mvoice:file_name is defined, the value is shared with the client as part of response headers (for REST API) or within a json message under “headers” key (for WS API).
Note: The attribute has no effect on the synthesis, and it does not define the output audio format.

`<voice>`

Changes voice.

Example:

<voice name="en-UK_AdamU16">
  English text.
</voice>

Attribute	Description
`name`	Name of the voice

Note the list of voices is available at API endpoint /v1/voices

`<break>`

Inserts silence of the given length. When <break> follows a sentence, its value replaces the sentence’s default trailing silence, so this element can be used to override model-generated pauses. Consequent <break> elements are concatenated, i.e. their values are summed.

Example:

<break time="2000ms"/>

Attribute	Description
`time`	Value in milliseconds `"100ms"` or seconds `"2.4s"`.

The attribute time is optional. The default value is 750ms.

`<p>`

A paragraph. Adds a logical structure to a document. After a paragraph, a suitable break is inserted automatically.

Example:

<p>
    A sentence. Another one.
</p>

`<s>`

A sentence. Adds a logical structure to a document. The text enclosed in <s> is treated as a single sentence (up to an internal limit).

Example:

<s>A sentence.</s>

`<mvoice:make_continuous>`

Attempt to make the enclosed text sound continuously, as a “single utterance” (up to an internal limit).

Example:

<mvoice:make_continuous>A sentence. Another one.</mvoice:make_continuous>

`<emphasis>`

Causes the enclosed text to be read with an emphasis. Suitable for whole sentences.

Example:

<emphasis>
    This is a title.
</emphasis>
And this is a body.

`<prosody>`

Means to control prosody: tempo, pitch, range, pitch contour, volume. More in W3C recommendation.

In addition, mvoice:modulate_pitch allows to modulate speech pitch by digital signal processing (DSP) means. This does not attempt to preserve the identity of the original voice - it aims at generating an unreal-sounding voice. The overall speech tempo does not change with this effect. Other prosody parameters can be used alongside.

Example:

<prosody rate="90%" pitch="120%" volume="-6dB">
    Slower, higher pitch and less volume.
</prosody>

Attribute	Description
`rate`	Speaking rate as a percentage of the default speaking rate, `"N%"` (range 50% to 200%) or a value relative to the default speaking rate, `"[+-]N%"` (range -50% to +100%) or any of: `"x-slow", "slow", "medium", "fast", "x-fast", "default"`.
`pitch`	Percentage: pitch as a percentage of the default pitch,`"N%"` (range 50% to 200%) or a value relative to the default pitch, `"[+-]N%"` (range -50% to +100%) or Semitones: Increase or decrease pitch by N semitones w.r.t. default pitch using `"+-Nst"` (sign is mandatory) or relative shift by N Hertz using `"+20Hz"` (range -50 to +50Hz).
`range`	Pitch range as a percentage of the default pitch range,`"N%"` (range 0% to 200%) or a value relative to the default pitch range, `"[+-]N%"` (range -100% to +100%).
`contour`	A list of control points, see below.
`volume`	Absolute value: `"silent", "x-soft", "soft", "medium", "loud", "x-loud", "default"` or Relative values w.r.t. the current setting (not w.r.t. default): `"+6.0dB"`, `"-3dB"`.
`mvoice:modulate_pitch`	DSP modulation factor: >1.0 = higher pitch (“helium voice”), <1.0 = lower pitch (“monster”). Defaults to `1.0` (range 0.5 to 2.0).

Note that relative volume changes are nested, i.e.

<prosody volume="+6dB">hello<prosody volume="-2dB">world</prosody></prosody>

will synthesise the word “world” +4dB above the default level.

Prosody contour is a list of control points, each point is a pair of time and pitch value. The time is a percentage of the time of each period of the enclosed text (period being typically a sentence or its part). The pitch value is a percentage of the common pitch (absolute or relative). The list is a space-separated sequence of pairs, e.g. "(0%,100%) (50%,-30%) (100%,100%)". The pitch is interpolated between the points. Boundary points are implicitly set to 100% when not specified. The pitch values must be in the range 50% to 200% absolute.

The prosody contour combines with other pitch modifiers, so for example

<prosody rate="70%" range="10%" pitch="120%" contour="(20%,+60%) (80%,-30%)">This is a sentence with a customized prosody.</prosody>

will first set the rate to 70%, range to 10% (to supress native pitch contour), pitch up to 120%, and then apply the contour on top of that. It means that a contour point (50%,100%) would set pitch at 100% of the current pitch, that is 120% of the model default, in the middle of the sentence. The contour point (80%,-30%) (or equally (80%,70%)) sets pitch at 70% of the current pitch, that is 120% * 70% = 84% of the model default pitch, at 80% of the sentence length.

`<audio>`

Insert external audio at the current position.

Example:

<audio src="https://my.site/audio.mp3">Audio not found.</audio>

Attribute	Description
`src`	Audio URL. The file must be accessible by mVoice. Many common formats are supported: MS wave, FLAC, Ogg vorbis, mp3, aac, ac4, raw, …
`mvoice:effects`	Optional audio effects chain, e.g. `"gain -2 highpass -1 120"`, more here.
`loud_norm`	Optional loudness normalization specification string, e.g. `"I=-23:TP=-1:LRA=7"`, more here.

Audio sampling rate and channel conversions are performed automatically to match the requested TTS output format.
Headerless raw format is assumed to be linear 16-bit PCM, sampling rate matching the TTS request.
When the audio link is not accessible, TTS will either synthesise an alternate text (if present in the request), else the synthesis raises an error.
mvoice:effects are applied before loud_norm.

`<phonemes>`

Insert a word defined by a sequence of phonemes.

Example:

<phonemes ph="a ɦ o j" alphabet="universal">Ahoj</phonemes>

Attribute	Description
`ph`	A sequence of phoneme symbols separated by spaces.
`alphabet`	Alphabet of the symbols: `"native"`, `"universal"` or `"globalphone"` (Czech only).
`mvoice:merge_left`	When set to `"true"`, merge phonemes with a previous word. Defaults to false.
`mvoice:merge_right`	When set to `"true"`, merge phonemes with a subsequent word. Defaults to false.

"universal" alphabet is a subset of IPA for the particular voice.
- Example: "a ɦ o j" for Czech
"native" alphabet is specific to the voice and can be easier to work with for some languages, rather than IPA.
- Example: a h o j for Czech (applies to older voices built on GlobalPhone phone set)
"globalphone" alphabet is special for Czech. It allows to conveniently write phonemes in GlobalPhone no matter if the voice internally uses GlobalPhone or the more modern IPA alphabet. So when the user has collected a custom dictionary of special words and their phoneme sequences in GlobalPhone, this allows to reuse the same sequences with IPA voices. Note that all new Czech voices use IPA, which is more expressive than GlobalPhone.
- Example: a h o j for Czech
The list of supported phoneme symbols for a given voice and alphabet is returned in an error message when phonemes contain unknown symbols. To get the error message, run the synthesis for example with:

<!-- Wrong phonemes - this SSML serves to get list of phonemes only! -->
<speak><phonemes ph="unknown" alphabet="native">word</phonemes></speak>

`<lang>`

Defines the language in which the enclosed text is spoken.

Every voice has a native language (defined by languageCodes property of every voice from the voice list available at /v1/voices endpoint). However, using <lang> we can force the voice to pronounce words from foreign languages (the list of supported foreign languages may vary). In this case, phonemes from the foreign language are mapped to the current voice’ phonemes, so the pronunciation is not perfect but typically is intelligible. The feature can be useful when pronouncing a foreign sentence when we don’t want to spell the words manually using <phonemes> element.

Example:

<!-- Current voice set for example to Czech cs-CZ_Jana -->
<speak>
  Nazdar Karle.
  <lang xml:lang="en-US">
    How are you feeling today?
  </lang>
</speak>

Attribute	Description
`xml:lang`	Target language locale in the form `ab-CD`, for example `"de-DE"`.

`<lexicon>`

Imports external Pronunciation Lexicon Specification (PLS) file.

mVoice accepts PLS files complying with the standard version 1.0, with some further restrictions, see section on Pronunciation Lexicon below.

Example:

<speak>
    <lexicon uri="https://my.site.eu/lexicon.pls"/>
    Words in this text will be processed by the lexicon before they are synthesized.
</speak>

Attribute	Description
`uri`	Lexicon file URI.

The element <lexicon> must be a direct child of a root <speak> element and there may be max. one lexicon in the request.
Standard says that <lexicon> should come before any other elements and text in a <speak> element but, in mVoice, the position of <lexicon> element does not matter, i.e. <speak>Text to be processed.<lexicon uri="https://example.com/lex.pls"/></speak> works as well.
Note that with SMIL extensions, every <media> in a <par> element can have its own independent lexicon.
The lexicon is applied on all text in the request (within one <media> element, if applicable).
All details on how the lexicon is applied are in the Pronunciation Lexicon section below.

`<mark>` (or `<bookmark>`)

An empty element that places a named marker into SSML at a specific location. When rendering the audio, mVoice will compute the time offset of the marker in the output audio stream and will inform the user about the marker position. It is guaranteed that the information about the location of the marker will arrive before the actual audio data. Note that the information is only available via WebSockets connection which allows to send metadata along with the audio stream.

The marker may be placed almost anywhere in the SSML. It supports all SSML features in streaming as well as SMIL extensions like parallel media.

Example:

<speak>
    Nazdar Karle,
    <mark name="1" />
    teď vyslovím slovo jedna. A 
    <break time="1s"/>
    <bookmark mark="2"/>
    teď slovo dva.
    <mark name="EOS"/>
</speak>

Attribute	Description
`name`	Name of the marker.

An equivalent to <mark name="beg-of-John"> is <bookmark mark="beg-of-John">, both elements are interchangeable; note that in <bookmark> the attribute is mark instead of name.

When processing the web socket messages, the marker information comes as a json message like {"marks": [{"mark": "EOS", "offset": 3.32}]}, meaning that the marker “EOS” appeared at 3.32 seconds in the audio stream.

`<mvoice:background_audio>`

Adds a background audio track to the synthesized audio at a specific location and for a specific length. Several attributes allow to fine-tune the result.

The element cannot be nested but there are no restrictions on the foreground audio content, i.e. it can span any SSML content except for another <mvoice:background_audio>.

Example:

<mvoice:background_audio 
        src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3" 
        volume="-12dB" duck="4" loop="on" fade_in="1s" fade_out="1.5s" 
        clip_begin="40s" clip_end="999s">
This text with be spoken with the above audio track playing in a loop in the background 
  through the length of this speech. 
</mvoice:background_audio>

Attribute	Description
`src`	Background audio URL.
`volume`	Audio volume of the background track in relative dB: `"0dB"` means no change.
`duck`	Strength of “ducker” effect: level of sidechain compression: `"1"` means no compression, higher values (max. 20) compress more.
`loop`	Whether the background audio should be played in a loop if it is shorter than the foreground. `"on"` or `"off"`.
`clip_begin`	Start playing N seconds into the audio, `"Ns"`.
`clip_end`	Stop playing when reaching N seconds into the audio, `"Ns"`.
`fade_in`	Apply fade-in to the background audio during N first seconds, `"Ns"`.
`fade_out`	Apply fade-out to the background audio during N last seconds, `"Ns"`.
`mvoice:effects`	Optional audio effects chain, e.g. `"gain -2 highpass -1 120"`, more here.
`loud_norm`	Optional loudness normalization specification string, e.g. `"I=-23:TP=-1:LRA=7"`, more here.

Similarly to <audio> element, the audio file must be accessible by mVoice via public URL.
When loop is on then trimming and fade-in/out is applied to the background audio before it is looped.
Loudness normalization is applied as the last step so that the background audio loudness meets the specification. mvoice:effects are applied before loudness normalization.

SMIL extensions:

`<par>`

Adds a parallel media container element, which allows to play multiple media simultaneously.

<par> can only contain a set of <media> elements. Each <media> element defines a position of its audio relative to the beginning of the <par> element or relative to another <media> element within this <par>. The default position of all <media> elements is zero, i.e. the beginning of the <par>. When <par> element is rendered, we first render audios of the individual <media> elements, then apply audio effects on them, then figure out absolute locations of the <media> elements and finally down-mix (add) the audio tracks together. The length of the resulting audio is determined by the longest track.

Note that negative media positions are not considered as errors: the relevant audio will simply be trimmed (when it begins before 0) or removed altogether (when it ends before or at 0).

Example (contents of <media> omitted for brevity):

<par>
    <media xml:id="speech" begin="400ms"></media>
    <media xml:id="ding" begin="speech.end-2s"></media>
    <media xml:id="speech2" end="ding.begin+3.5s"></media>
    <media xml:id="background_sound"></media>
</par>

The above will play a “background_sound”, then after 400ms it will add “speech” media, then 2 seconds before the “speech” ends it will add a “ding” media, and finally it will add a “speech2” media at a position computed so that it ends exactly 3.5 seconds after “ding” started.

There are no attributes in <par>.

`<media>`

Defines a media element within a <par> element. Its attributes define the position within the <par> element and allow to apply audio effects similar to those in <mvoice:background_audio> element. The content of the <media> element is not constrained, it can be any other element like <speak>, <par>, <audio>, a text etc., i.e. full recursion is supported. All audio effects are applied on a completely rendered audio content of the <media> element.

Example:

<media xml:id="media1" begin="0.03s">
  <speak>Example text.</speak>
</media>
<media xml:id="media2" begin="0.03s" trim_begin="0.35s" trim_end="0.25s" fade_in="0.5s" fade_out="1s" volume="3dB" repeat_count="9" duration_limit="5s">
This is a longer example text.
</media>

In the above, “media2” is rendered by the following sequence of operations: 1. Synthesize audio from text. 2. Trim the synthesized audio by initial 0.35 s and trailing 0.25 s. 3. Adjust volume by +3 dB. 4. Apply fade-in and fade-out effects. 5. Repeat the resulting clip 9 times. 6. If the audio is longer than 5 seconds, trim it to 5 seconds.

Attribute	Description
`xml:id`	A unique identifier within `<par>`.
`begin`	Begin time position of the media within `<par>`, see below.
`end`	End time position of the media within `<par>`, see below.
`trim_begin`	Trim off initial S seconds of audio, `"Ss"`.
`trim_end`	Trim off trailing S seconds of audio, `"Ss"`.
`fade_in`	Apply fade-in during S first seconds, `"Ss"`.
`fade_out`	Apply fade-out during S last seconds, `"Ss"`.
`volume`	Adjust sound volume by S dB relative, `"SdB"`, 0 means no change.
`repeat_count`	Repeat the audio N times, `"N"`. The result can be limited by `duration_limit`.
`duration_limit`	Limit the length of the audio (after applying previous effects) to at most S seconds, `"Ss"`.
`mvoice:effects`	Apply audio effects chain, e.g. `"gain -2 highpass -1 120"`, more here.
`loud_norm`	Loudness normalization specification, e.g. `"I=-23:TP=-1:LRA=7"`, more here.

All attributes are optional. S is a float number, N is an integer.

Audio effects are applied in the order as shown in the table above, e.g. trimming first, then volume & fading, then repeat, then duration limit, then mvoice:effects, and finally loudness normalization.

Only one of begin and end attributes may be specified. Their value defines either an absolute position within <par> or a relative position w.r.t. another media in the same <par>. When a relative position is used, the referenced media must have a valid xml:id attribute.

Absolute position
- string of a pattern +-(float)(h|min|s|ms), for example 300ms or -2.4s
- refers to the beginning of the superior <par> element
Relative position
- string of a pattern (xml:id).(begin|end)(+-float)(h|min|s|ms), for example speech1.end+2.0s or xx.end-400ms
- refers to either beginning or end a media element by the name xml:id

Note that <par> and <media> elements inherit the TTS state from the upstream element. For example, when you select a voice & set prosody rate to 110% and then insert a <par> element, then any text within the par will speak by default at 110% rate.

Example:

<speak>
<prosody pitch="+5st" rate="110%">
    <par>
        <!-- Nazdar Karle speaks at +5st, 110%.-->
        <media>Nazdar Karle.</media>
    </par>
</prosody>
</speak>

Loudness Normalization

Loudness normalization follows EBU R 128 standard.
Loudness normalization specification string consists of three parameters separated by a colon. At least one parameter must be provided.

I - target normalization level in LUFS, defaults to -23 (float)
TP - True Peak level in LU, defaults to -1 (float)
LRA - Loudness Range in LU, defaults to 7 (float)

Allowed values follow ffmpeg standard.

Example: I=-20:TP=-2 normalizes to -20 LUFS with a maximum true peak level of -2 LU.

Audio effects

Audio effects are defined via the attribute mvoice:effects. Its value follows sox effects syntax (see also “soxeffect” man page).

mVoice implements a subset of sox effects: highpass, lowpass, equalizer, treble, bass, gain, compand, pan (this may change in future).

Effects chaining is allowed.

Note that pan effect with a float arg between -1 (all to left) and +1 (all to right) works only with a stereo output audio.

Example of a complete ssml applying effects on a background audio, on an external audio, and on the contents of a <media> element.

<speak>
    <par>
        <media xml:id="all">
            First sentence.
            <mvoice:background_audio src="https://example.com/background.mp3"
                                     mvoice:effects="highpass -2 200">
                Let's play a jingle.
                <audio src="https://example.com/jingle.mp3"
                       mvoice:effects="compand 0.002,0.002 -60:-5,0,0 10 -60 0.002">
                    ding ding
                </audio>
            </mvoice:background_audio>>
        </media>
        <media xml:id="compressed" begin="all.end+0s" 
               mvoice:effects="lowpass -2 3000 compand 0.002,0.002 -60,-10,0,-5 -5 -60 0.002">
               This sentence was heavily processed.
        </media>
    </par>
    Final words.
</speak>

Pronunciation Lexicon Specification

mVoice supports Pronunciation Lexicon Specification v 1.0 which is enabled using <lexicon> SSML element as described in section <lexicon>.

File format

mVoice accepts all standard PLS files. An example file follows:

<lexicon version="1.0"
         xml:lang="cs-CZ"
         alphabet="native">
<!-- xml:lang much match the language of the voice currently in use.-->
<!-- Alphabet must be supported by mVoice, for example "universal" is close to IPA.-->
    <lexeme>
        <grapheme>ICQ</grapheme>
        <alias>aj sík jů</alias>
<!--Replace "ICQ" with "aj sík jů".    -->
    </lexeme>

    <lexeme>
        <grapheme>Eliska</grapheme>
        <grapheme>Karel</grapheme>
        <phoneme alphabet="universal">ɛː l i ʃ k a</phoneme>
        <phoneme prefer="true">j aa g u sh k a</phoneme>
<!--Both "Eliska" and "Karel" are replaced with a phoneme sequence "j aa g u sh k a".-->
    </lexeme>
</lexicon>

XML attributes: - <lexicon> - version, xml:lang and alphabet attributes are required - <lexeme> - no attributes - <grapheme> - no attributes - <alias> - optional prefer attribute - <phoneme> - optional prefer, alphabet attributes

How is Lexicon Applied

Upon loading the lexicon, we convert a possible many-to-many mapping between graphemes and either an alias or a phoneme, to one-to-one mapping. By this, every grapheme is transformed either to an alias text or to a phoneme sequence. The selection algorithm follows W3C standard:

For each grapheme in the lexicon:
- Find all projections which are either an alias or a phone sequence.
- Select the first projection with prefer="true" as the replacement.
- If none has the prefer flag, use the first projection in the document order.

This produces a map from string to an object (alias or phoneme). None of (grapheme, phoneme, alias) may be empty. Any boundary whitespace is stripped off. In addition:

<grapheme>
- Text must not start or end with a punctuation, e.g. <grapheme>~ahoj</grapheme> is an error. An exception is a trailing period as in abbreviations.
- Internal punctuation and/or whitespace is allowed, <grapheme>ABC- DEF</grapheme> is ok.
<alias>
- Text must contain plain verbalized text with no punctuation or digits. <alias>do-re-mi</alias> is an error.
<phoneme>
- Alphabet must be known to mVoice.
- Language must match the active language.
- Phonemes are space-separated and must be valid w.r.t. the active voice (see also <phonemes> SSML element).

Note that Lexicon is applied on standardized text. That is, all whitespace is replaced with a single space ’ ’ character and some other special characters are standardized, for example quotes „“« all become ". Graphemes from the lexicon are standardized, too, so there is no need to standardize lexicon entries by the user. However, in rare cases, this can result in the occurrence of collapsing entries that, upon standardization, become indistinguishable.

Lexicon is case-sensitive.

Pattern matching the lexicon to the standardized text cannot be done by simply finding graphemes because a grapheme but would match a part of the word butter. It works as follows:

Every grapheme must be surrounded by one of the following symbols or by an end-of-string on both sides: !?.,:;()+-/" @. For example:
- Grapheme ahoj will match 5 times in the text Blabla ahoj ahoj@me.com (ahoj); "ahoj", ahoj!.
- The same grapheme will not match the text Ahoj ~ahoj, nahoj, #ahoj..
Longer-spanning matches have priority. For example:
- Lexicon with graphemes [1, 1 2, 2 3 4] and text 1 2 3 4 5 will match 2 3 4 and then 1.
All matching graphemes are replaced, no recursion.
The replaced items are not further processed. That is why alias entries must not contain digits - they would not be verbalized. The alias text is pronounced as-is.
- For example, grapheme U^2 can be aliased to you up two but not to you up 2.

Lexicon Limitations

A generic IPA alphabet is not supported.
As of mVoice version 1.43, only Czech language is supported.
Maximum allowed lexicon file size is currently 100 kB.

Reference:

SSML-W3C specification
SMIL-W3C specification
PLS-W3C specification

mVoice SSML Language Reference

Quick Start

Supported SSML Elements

<speak>

<voice>

<break>

<p>

<s>

<mvoice:make_continuous>

<emphasis>

<prosody>

<audio>

<phonemes>

<lang>

<lexicon>

<mark> (or <bookmark>)

<mvoice:background_audio>

SMIL extensions:

<par>

<media>

Loudness Normalization

Audio effects

Pronunciation Lexicon Specification

File format

How is Lexicon Applied

Lexicon Limitations

Reference:

`<speak>`

`<voice>`

`<break>`

`<p>`

`<s>`

`<mvoice:make_continuous>`

`<emphasis>`

`<prosody>`

`<audio>`

`<phonemes>`

`<lang>`

`<lexicon>`

`<mark>` (or `<bookmark>`)

`<mvoice:background_audio>`

`<par>`

`<media>`