mVoice SSML Language Reference

SSML elements and attributes supported by MAMA AI mVoice TTS Engine API.

Last update: May 12, 2025. Valid for mVoice version 2.31+.

Quick Start

Copy & paste this to mVoice demo page and click “Speak!”
(If you don’t have access to mVoice, contact MAMA)


<speak>
  <voice name="en-UK_AdamU16">
    And now, listen!
    <lang xml:lang="cs-CZ">Adam mluví česky!</lang>
    <mvoice:background_audio
            src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3"
            volume="-12dB" duck="4" loop="on" fade_in="1" fade_out="1"
            clip_begin="40" clip_end="999">
      Here we go.
      <break time="2000ms"/>
      <prosody rate="90%">All of this</prosody>,
      including the following sound,
      <audio src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/ding.ogg" loud_norm="I=-23">ding ding</audio>,
      should be backlit by a cosy music.
    </mvoice:background_audio>
    How did you like it?
    <prosody pitch="110%">A bit higher voice.</prosody>`
    <prosody volume="-6dB">Lower volume.</prosody>`
    <p>
      <s>And by the way, most of the above can be freely nested.</s>
    </p>

  </voice>
</speak>

The following sections provide a full reference for supported SSML and SMIL elements and the Lexicon.

Note that mVoice does not strictly require all boilerplate SSML content. Non-recognized SSML elements are ignored.

Supported SSML Elements

<speak>

Root element. SSML content must be enclosed within this.

Example:

<speak>
  Text to be spoken.
</speak>
Attribute Description
mvoice:file_name Optional output file name.

<voice>

Changes voice.

Example:

<voice name="en-UK_AdamU16">
  English text.
</voice>
Attribute Description
name Name of the voice

Note the list of voices is available at API endpoint /v1/voices

<break>

Inserts silence of the given length. When <break> follows a sentence, its value replaces the sentence’s default trailing silence, so this element can be used to override model-generated pauses. Consequent <break> elements are concatenated, i.e. their values are summed.

Example:

<break time="2000ms"/>
Attribute Description
time Value in milliseconds "100ms" or seconds "2.4s".

The attribute time is optional. The default value is 750ms.

<p>

A paragraph. Adds a logical structure to a document. After a paragraph, a suitable break is inserted automatically.

Example:

<p>
    A sentence. Another one.
</p>

<s>

A sentence. Adds a logical structure to a document. The text enclosed in <s> is treated as a single sentence (up to an internal limit).

Example:

<s>A sentence.</s>

<mvoice:make_continuous>

Attempt to make the enclosed text sound continuously, as a “single utterance” (up to an internal limit).

Example:

<mvoice:make_continuous>A sentence. Another one.</mvoice:make_continuous>

<emphasis>

Causes the enclosed text to be read with an emphasis. Suitable for whole sentences.

Example:

<emphasis>
    This is a title.
</emphasis>
And this is a body.

<prosody>

Means to control prosody: tempo, pitch, range, pitch contour, volume. More in W3C recommendation.

In addition, mvoice:modulate_pitch allows to modulate speech pitch by digital signal processing (DSP) means. This does not attempt to preserve the identity of the original voice - it aims at generating an unreal-sounding voice. The overall speech tempo does not change with this effect. Other prosody parameters can be used alongside.

Example:

<prosody rate="90%" pitch="120%" volume="-6dB">
    Slower, higher pitch and less volume.
</prosody>
Attribute Description
rate Speaking rate as a percentage of the default speaking rate, "N%" (range 50% to 200%)
or a value relative to the default speaking rate, "[+-]N%" (range -50% to +100%)
or any of: "x-slow", "slow", "medium", "fast", "x-fast", "default".
pitch Percentage: pitch as a percentage of the default pitch,"N%" (range 50% to 200%)
or a value relative to the default pitch, "[+-]N%" (range -50% to +100%)
or Semitones: Increase or decrease pitch by N semitones w.r.t. default pitch using "+-Nst" (sign is mandatory) or
relative shift by N Hertz using "+20Hz" (range -50 to +50Hz).
range Pitch range as a percentage of the default pitch range,"N%" (range 0% to 200%)
or a value relative to the default pitch range, "[+-]N%" (range -100% to +100%).
contour A list of control points, see below.
volume Absolute value: "silent", "x-soft", "soft", "medium", "loud", "x-loud", "default" or
Relative values w.r.t. the current setting (not w.r.t. default): "+6.0dB", "-3dB".
mvoice:modulate_pitch DSP modulation factor: >1.0 = higher pitch (“helium voice”), <1.0 = lower pitch (“monster”). Defaults to 1.0 (range 0.5 to 2.0).

Note that relative volume changes are nested, i.e. 

<prosody volume="+6dB">hello<prosody volume="-2dB">world</prosody></prosody>

will synthesise the word “world” +4dB above the default level.

Prosody contour is a list of control points, each point is a pair of time and pitch value. The time is a percentage of the time of each period of the enclosed text (period being typically a sentence or its part). The pitch value is a percentage of the common pitch (absolute or relative). The list is a space-separated sequence of pairs, e.g. "(0%,100%) (50%,-30%) (100%,100%)". The pitch is interpolated between the points. Boundary points are implicitly set to 100% when not specified. The pitch values must be in the range 50% to 200% absolute.

The prosody contour combines with other pitch modifiers, so for example

<prosody rate="70%" range="10%" pitch="120%" contour="(20%,+60%) (80%,-30%)">This is a sentence with a customized prosody.</prosody>

will first set the rate to 70%, range to 10% (to supress native pitch contour), pitch up to 120%, and then apply the contour on top of that. It means that a contour point (50%,100%) would set pitch at 100% of the current pitch, that is 120% of the model default, in the middle of the sentence. The contour point (80%,-30%) (or equally (80%,70%)) sets pitch at 70% of the current pitch, that is 120% * 70% = 84% of the model default pitch, at 80% of the sentence length.

<audio>

Insert external audio at the current position.

Example:

<audio src="https://my.site/audio.mp3">Audio not found.</audio>
Attribute Description
src Audio URL. The file must be accessible by mVoice.
Many common formats are supported: MS wave, FLAC, Ogg vorbis, mp3, aac, ac4, raw, …
mvoice:effects Optional audio effects chain, e.g. "gain -2 highpass -1 120", more here.
loud_norm Optional loudness normalization specification string, e.g. "I=-23:TP=-1:LRA=7", more here.

<phonemes>

Insert a word defined by a sequence of phonemes.

Example:

<phonemes ph="a ɦ o j" alphabet="universal">Ahoj</phonemes>
Attribute Description
ph A sequence of phoneme symbols separated by spaces.
alphabet Alphabet of the symbols: "native", "universal" or "globalphone" (Czech only).
mvoice:merge_left When set to "true", merge phonemes with a previous word. Defaults to false.
mvoice:merge_right When set to "true", merge phonemes with a subsequent word. Defaults to false.
<!-- Wrong phonemes - this SSML serves to get list of phonemes only! -->
<speak><phonemes ph="unknown" alphabet="native">word</phonemes></speak>

<lang>

Defines the language in which the enclosed text is spoken.

Every voice has a native language (defined by languageCodes property of every voice from the voice list available at /v1/voices endpoint). However, using <lang> we can force the voice to pronounce words from foreign languages (the list of supported foreign languages may vary). In this case, phonemes from the foreign language are mapped to the current voice’ phonemes, so the pronunciation is not perfect but typically is intelligible. The feature can be useful when pronouncing a foreign sentence when we don’t want to spell the words manually using <phonemes> element.

Example:

<!-- Current voice set for example to Czech cs-CZ_Jana -->
<speak>
  Nazdar Karle.
  <lang xml:lang="en-US">
    How are you feeling today?
  </lang>
</speak>
Attribute Description
xml:lang Target language locale in the form ab-CD, for example "de-DE".

<lexicon>

Imports external Pronunciation Lexicon Specification (PLS) file.

mVoice accepts PLS files complying with the standard version 1.0, with some further restrictions, see section on Pronunciation Lexicon below.

Example:

<speak>
    <lexicon uri="https://my.site.eu/lexicon.pls"/>
    Words in this text will be processed by the lexicon before they are synthesized.
</speak>
Attribute Description
uri Lexicon file URI.

<mark> (or <bookmark>)

An empty element that places a named marker into SSML at a specific location. When rendering the audio, mVoice will compute the time offset of the marker in the output audio stream and will inform the user about the marker position. It is guaranteed that the information about the location of the marker will arrive before the actual audio data. Note that the information is only available via WebSockets connection which allows to send metadata along with the audio stream.

The marker may be placed almost anywhere in the SSML. It supports all SSML features in streaming as well as SMIL extensions like parallel media.

Example:

<speak>
    Nazdar Karle,
    <mark name="1" />
    teď vyslovím slovo jedna. A 
    <break time="1s"/>
    <bookmark mark="2"/>
    teď slovo dva.
    <mark name="EOS"/>
</speak>
Attribute Description
name Name of the marker.

An equivalent to <mark name="beg-of-John"> is <bookmark mark="beg-of-John">, both elements are interchangeable; note that in <bookmark> the attribute is mark instead of name.

When processing the web socket messages, the marker information comes as a json message like {"marks": [{"mark": "EOS", "offset": 3.32}]}, meaning that the marker “EOS” appeared at 3.32 seconds in the audio stream.

<mvoice:background_audio>

Adds a background audio track to the synthesized audio at a specific location and for a specific length. Several attributes allow to fine-tune the result.

The element cannot be nested but there are no restrictions on the foreground audio content, i.e. it can span any SSML content except for another <mvoice:background_audio>.

Example:

<mvoice:background_audio 
        src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3" 
        volume="-12dB" duck="4" loop="on" fade_in="1s" fade_out="1.5s" 
        clip_begin="40s" clip_end="999s">
This text with be spoken with the above audio track playing in a loop in the background 
  through the length of this speech. 
</mvoice:background_audio>
Attribute Description
src Background audio URL.
volume Audio volume of the background track in relative dB: "0dB" means no change.
duck Strength of “ducker” effect: level of sidechain compression:
"1" means no compression, higher values (max. 20) compress more.
loop Whether the background audio should be played in a loop if it is shorter than the foreground. "on" or "off".
clip_begin Start playing N seconds into the audio, "Ns".
clip_end Stop playing when reaching N seconds into the audio, "Ns".
fade_in Apply fade-in to the background audio during N first seconds, "Ns".
fade_out Apply fade-out to the background audio during N last seconds, "Ns".
mvoice:effects Optional audio effects chain, e.g. "gain -2 highpass -1 120", more here.
loud_norm Optional loudness normalization specification string, e.g. "I=-23:TP=-1:LRA=7", more here.

SMIL extensions:

<par>

Adds a parallel media container element, which allows to play multiple media simultaneously.

<par> can only contain a set of <media> elements. Each <media> element defines a position of its audio relative to the beginning of the <par> element or relative to another <media> element within this <par>. The default position of all <media> elements is zero, i.e. the beginning of the <par>. When <par> element is rendered, we first render audios of the individual <media> elements, then apply audio effects on them, then figure out absolute locations of the <media> elements and finally down-mix (add) the audio tracks together. The length of the resulting audio is determined by the longest track.

Note that negative media positions are not considered as errors: the relevant audio will simply be trimmed (when it begins before 0) or removed altogether (when it ends before or at 0).

Example (contents of <media> omitted for brevity):

<par>
    <media xml:id="speech" begin="400ms"></media>
    <media xml:id="ding" begin="speech.end-2s"></media>
    <media xml:id="speech2" end="ding.begin+3.5s"></media>
    <media xml:id="background_sound"></media>
</par>

The above will play a “background_sound”, then after 400ms it will add “speech” media, then 2 seconds before the “speech” ends it will add a “ding” media, and finally it will add a “speech2” media at a position computed so that it ends exactly 3.5 seconds after “ding” started.

There are no attributes in <par>.

<media>

Defines a media element within a <par> element. Its attributes define the position within the <par> element and allow to apply audio effects similar to those in <mvoice:background_audio> element. The content of the <media> element is not constrained, it can be any other element like <speak>, <par>, <audio>, a text etc., i.e. full recursion is supported. All audio effects are applied on a completely rendered audio content of the <media> element.

Example:

<media xml:id="media1" begin="0.03s">
  <speak>Example text.</speak>
</media>
<media xml:id="media2" begin="0.03s" trim_begin="0.35s" trim_end="0.25s" fade_in="0.5s" fade_out="1s" volume="3dB" repeat_count="9" duration_limit="5s">
This is a longer example text.
</media>

In the above, “media2” is rendered by the following sequence of operations: 1. Synthesize audio from text. 2. Trim the synthesized audio by initial 0.35 s and trailing 0.25 s. 3. Adjust volume by +3 dB. 4. Apply fade-in and fade-out effects. 5. Repeat the resulting clip 9 times. 6. If the audio is longer than 5 seconds, trim it to 5 seconds.

Attribute Description
xml:id A unique identifier within <par>.
begin Begin time position of the media within <par>, see below.
end End time position of the media within <par>, see below.
trim_begin Trim off initial S seconds of audio, "Ss".
trim_end Trim off trailing S seconds of audio, "Ss".
fade_in Apply fade-in during S first seconds, "Ss".
fade_out Apply fade-out during S last seconds, "Ss".
volume Adjust sound volume by S dB relative, "SdB", 0 means no change.
repeat_count Repeat the audio N times, "N". The result can be limited by duration_limit.
duration_limit Limit the length of the audio (after applying previous effects) to at most S seconds, "Ss".
mvoice:effects Apply audio effects chain, e.g. "gain -2 highpass -1 120", more here.
loud_norm Loudness normalization specification, e.g. "I=-23:TP=-1:LRA=7", more here.

All attributes are optional. S is a float number, N is an integer.

Audio effects are applied in the order as shown in the table above, e.g. trimming first, then volume & fading, then repeat, then duration limit, then mvoice:effects, and finally loudness normalization.

Only one of begin and end attributes may be specified. Their value defines either an absolute position within <par> or a relative position w.r.t. another media in the same <par>. When a relative position is used, the referenced media must have a valid xml:id attribute.

Note that <par> and <media> elements inherit the TTS state from the upstream element. For example, when you select a voice & set prosody rate to 110% and then insert a <par> element, then any text within the par will speak by default at 110% rate.

Example:

<speak>
<prosody pitch="+5st" rate="110%">
    <par>
        <!-- Nazdar Karle speaks at +5st, 110%.-->
        <media>Nazdar Karle.</media>
    </par>
</prosody>
</speak>

Loudness Normalization

Loudness normalization follows EBU R 128 standard.
Loudness normalization specification string consists of three parameters separated by a colon. At least one parameter must be provided.

Allowed values follow ffmpeg standard.

Example: I=-20:TP=-2 normalizes to -20 LUFS with a maximum true peak level of -2 LU.

Audio effects

Audio effects are defined via the attribute mvoice:effects. Its value follows sox effects syntax (see also “soxeffect” man page).

mVoice implements a subset of sox effects: highpass, lowpass, equalizer, treble, bass, gain, compand, pan (this may change in future).

Effects chaining is allowed.

Note that pan effect with a float arg between -1 (all to left) and +1 (all to right) works only with a stereo output audio.

Example of a complete ssml applying effects on a background audio, on an external audio, and on the contents of a <media> element.

<speak>
    <par>
        <media xml:id="all">
            First sentence.
            <mvoice:background_audio src="https://example.com/background.mp3"
                                     mvoice:effects="highpass -2 200">
                Let's play a jingle.
                <audio src="https://example.com/jingle.mp3"
                       mvoice:effects="compand 0.002,0.002 -60:-5,0,0 10 -60 0.002">
                    ding ding
                </audio>
            </mvoice:background_audio>>
        </media>
        <media xml:id="compressed" begin="all.end+0s" 
               mvoice:effects="lowpass -2 3000 compand 0.002,0.002 -60,-10,0,-5 -5 -60 0.002">
               This sentence was heavily processed.
        </media>
    </par>
    Final words.
</speak>

Pronunciation Lexicon Specification

mVoice supports Pronunciation Lexicon Specification v 1.0 which is enabled using <lexicon> SSML element as described in section <lexicon>.

File format

mVoice accepts all standard PLS files. An example file follows:

<lexicon version="1.0"
         xml:lang="cs-CZ"
         alphabet="native">
<!-- xml:lang much match the language of the voice currently in use.-->
<!-- Alphabet must be supported by mVoice, for example "universal" is close to IPA.-->
    <lexeme>
        <grapheme>ICQ</grapheme>
        <alias>aj sík jů</alias>
<!--Replace "ICQ" with "aj sík jů".    -->
    </lexeme>

    <lexeme>
        <grapheme>Eliska</grapheme>
        <grapheme>Karel</grapheme>
        <phoneme alphabet="universal">ɛː l i ʃ k a</phoneme>
        <phoneme prefer="true">j aa g u sh k a</phoneme>
<!--Both "Eliska" and "Karel" are replaced with a phoneme sequence "j aa g u sh k a".-->
    </lexeme>
</lexicon>

XML attributes: - <lexicon> - version, xml:lang and alphabet attributes are required - <lexeme> - no attributes - <grapheme> - no attributes - <alias> - optional prefer attribute - <phoneme> - optional prefer, alphabet attributes

How is Lexicon Applied

Upon loading the lexicon, we convert a possible many-to-many mapping between graphemes and either an alias or a phoneme, to one-to-one mapping. By this, every grapheme is transformed either to an alias text or to a phoneme sequence. The selection algorithm follows W3C standard:

This produces a map from string to an object (alias or phoneme). None of (grapheme, phoneme, alias) may be empty. Any boundary whitespace is stripped off. In addition:

Note that Lexicon is applied on standardized text. That is, all whitespace is replaced with a single space ’ ’ character and some other special characters are standardized, for example quotes „“« all become ". Graphemes from the lexicon are standardized, too, so there is no need to standardize lexicon entries by the user. However, in rare cases, this can result in the occurrence of collapsing entries that, upon standardization, become indistinguishable.

Lexicon is case-sensitive.

Pattern matching the lexicon to the standardized text cannot be done by simply finding graphemes because a grapheme but would match a part of the word butter. It works as follows:

Lexicon Limitations

Reference: