SSML elements and attributes supported by MAMA AI mVoice TTS Engine API.
Last update: May 12, 2025. Valid for mVoice version 2.31+.
Copy & paste this to mVoice demo page and click “Speak!”
(If you don’t have access to mVoice, contact MAMA)
speak>
<voice name="en-UK_AdamU16">
<
And now, listen!lang xml:lang="cs-CZ">Adam mluví česky!</lang>
<mvoice:background_audio
< src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3"
volume="-12dB" duck="4" loop="on" fade_in="1" fade_out="1"
clip_begin="40" clip_end="999">
Here we go.break time="2000ms"/>
<prosody rate="90%">All of this</prosody>,
<
including the following sound,audio src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/ding.ogg" loud_norm="I=-23">ding ding</audio>,
<
should be backlit by a cosy music.mvoice:background_audio>
</
How did you like it?prosody pitch="110%">A bit higher voice.</prosody>`
<prosody volume="-6dB">Lower volume.</prosody>`
<p>
<s>And by the way, most of the above can be freely nested.</s>
<p>
</
voice>
</speak> </
The following sections provide a full reference for supported SSML and SMIL elements and the Lexicon.
Note that mVoice does not strictly require all boilerplate SSML content. Non-recognized SSML elements are ignored.
<speak>
Root element. SSML content must be enclosed within this.
Example:
speak>
<
Text to be spoken.speak> </
Attribute | Description |
---|---|
mvoice:file_name |
Optional output file name. |
mvoice:file_name
is defined, the value is
shared with the client as part of response headers (for REST API) or
within a json message under “headers” key (for WS API).<voice>
Changes voice.
Example:
voice name="en-UK_AdamU16">
<
English text.voice> </
Attribute | Description |
---|---|
name |
Name of the voice |
Note the list of voices is available at API endpoint
/v1/voices
<break>
Inserts silence of the given length. When <break>
follows a sentence, its value replaces the sentence’s default
trailing silence, so this element can be used to override
model-generated pauses. Consequent <break>
elements
are concatenated, i.e. their values are summed.
Example:
break time="2000ms"/> <
Attribute | Description |
---|---|
time |
Value in milliseconds "100ms" or seconds
"2.4s" . |
The attribute time
is optional. The default value is
750ms.
<p>
A paragraph. Adds a logical structure to a document. After a paragraph, a suitable break is inserted automatically.
Example:
p>
<
A sentence. Another one.p> </
<s>
A sentence. Adds a logical structure to a document. The text enclosed
in <s>
is treated as a single sentence (up to an
internal limit).
Example:
s>A sentence.</s> <
<mvoice:make_continuous>
Attempt to make the enclosed text sound continuously, as a “single utterance” (up to an internal limit).
Example:
mvoice:make_continuous>A sentence. Another one.</mvoice:make_continuous> <
<emphasis>
Causes the enclosed text to be read with an emphasis. Suitable for whole sentences.
Example:
emphasis>
<
This is a title.emphasis>
</ And this is a body.
<prosody>
Means to control prosody: tempo, pitch, range, pitch contour, volume. More in W3C recommendation.
In addition, mvoice:modulate_pitch
allows to modulate
speech pitch by digital signal processing (DSP) means. This does not
attempt to preserve the identity of the original voice - it aims at
generating an unreal-sounding voice. The overall speech tempo does not
change with this effect. Other prosody parameters can be used
alongside.
Example:
prosody rate="90%" pitch="120%" volume="-6dB">
<
Slower, higher pitch and less volume.prosody> </
Attribute | Description |
---|---|
rate |
Speaking rate as a percentage of the default speaking rate,
"N%" (range 50% to 200%)or a value relative to the default speaking rate, "[+-]N%" (range -50% to
+100%)or any of: "x-slow", "slow", "medium", "fast", "x-fast", "default" . |
pitch |
Percentage: pitch as a percentage of the default
pitch,"N%" (range 50% to 200%)or a value relative to the default pitch, "[+-]N%" (range -50% to +100%)or Semitones: Increase or decrease pitch by N semitones w.r.t. default pitch using "+-Nst" (sign is mandatory) orrelative shift by N Hertz using "+20Hz" (range -50 to +50Hz). |
range |
Pitch range as a percentage of the default pitch
range,"N%" (range 0% to 200%)or a value relative to the default pitch range, "[+-]N%" (range -100% to
+100%). |
contour |
A list of control points, see below. |
volume |
Absolute value:
"silent", "x-soft", "soft", "medium", "loud", "x-loud", "default"
orRelative values w.r.t. the current setting (not w.r.t. default): "+6.0dB" , "-3dB" . |
mvoice:modulate_pitch |
DSP modulation factor: >1.0 = higher pitch (“helium voice”),
<1.0 = lower pitch (“monster”). Defaults to 1.0 (range
0.5 to 2.0). |
Note that relative volume changes are nested, i.e.
prosody volume="+6dB">hello<prosody volume="-2dB">world</prosody></prosody> <
will synthesise the word “world” +4dB above the default level.
Prosody contour is a list of control points, each
point is a pair of time and pitch value. The time is a percentage of the
time of each period of the enclosed text (period being typically a
sentence or its part). The pitch value is a percentage of the common
pitch (absolute or relative). The list is a space-separated sequence of
pairs, e.g. "(0%,100%) (50%,-30%) (100%,100%)"
. The pitch
is interpolated between the points. Boundary points are implicitly set
to 100% when not specified. The pitch values must be in the range 50% to
200% absolute.
The prosody contour
combines with other pitch modifiers,
so for example
prosody rate="70%" range="10%" pitch="120%" contour="(20%,+60%) (80%,-30%)">This is a sentence with a customized prosody.</prosody> <
will first set the rate to 70%, range to 10% (to supress native pitch
contour), pitch up to 120%, and then apply the contour on top of that.
It means that a contour point (50%,100%)
would set pitch at
100% of the current pitch, that is 120% of the model default, in the
middle of the sentence. The contour point (80%,-30%)
(or
equally (80%,70%)
) sets pitch at 70% of the current pitch,
that is 120% * 70% = 84% of the model default pitch, at 80% of the
sentence length.
<audio>
Insert external audio at the current position.
Example:
audio src="https://my.site/audio.mp3">Audio not found.</audio> <
Attribute | Description |
---|---|
src |
Audio URL. The file must be accessible by mVoice. Many common formats are supported: MS wave, FLAC, Ogg vorbis, mp3, aac, ac4, raw, … |
mvoice:effects |
Optional audio effects chain,
e.g. "gain -2 highpass -1 120" , more here. |
loud_norm |
Optional loudness normalization specification string,
e.g. "I=-23:TP=-1:LRA=7" , more here. |
mvoice:effects
are applied before
loud_norm
.<phonemes>
Insert a word defined by a sequence of phonemes.
Example:
phonemes ph="a ɦ o j" alphabet="universal">Ahoj</phonemes> <
Attribute | Description |
---|---|
ph |
A sequence of phoneme symbols separated by spaces. |
alphabet |
Alphabet of the symbols: "native" ,
"universal" or "globalphone" (Czech
only). |
mvoice:merge_left |
When set to "true" , merge phonemes with a previous
word. Defaults to false. |
mvoice:merge_right |
When set to "true" , merge phonemes with a subsequent
word. Defaults to false. |
"universal"
alphabet is a subset of IPA for the
particular voice.
"a ɦ o j"
for Czech"native"
alphabet is specific to the voice and can be
easier to work with for some languages, rather than IPA.
a h o j
for Czech (applies to older voices
built on GlobalPhone phone set)"globalphone"
alphabet is special for Czech. It allows
to conveniently write phonemes in GlobalPhone no matter if the voice
internally uses GlobalPhone or the more modern IPA alphabet. So when the
user has collected a custom dictionary of special words and their
phoneme sequences in GlobalPhone, this allows to reuse the same
sequences with IPA voices. Note that all new Czech voices use IPA, which
is more expressive than GlobalPhone.
a h o j
for Czech<!-- Wrong phonemes - this SSML serves to get list of phonemes only! -->
speak><phonemes ph="unknown" alphabet="native">word</phonemes></speak> <
<lang>
Defines the language in which the enclosed text is spoken.
Every voice has a native language (defined by
languageCodes
property of every voice from the voice list
available at /v1/voices
endpoint). However, using
<lang>
we can force the voice to pronounce
words from foreign languages (the list of supported foreign languages
may vary). In this case, phonemes from the foreign language are mapped
to the current voice’ phonemes, so the pronunciation is not perfect but
typically is intelligible. The feature can be useful when pronouncing a
foreign sentence when we don’t want to spell the words manually using
<phonemes>
element.
Example:
<!-- Current voice set for example to Czech cs-CZ_Jana -->
speak>
<
Nazdar Karle.lang xml:lang="en-US">
<
How are you feeling today?lang>
</speak> </
Attribute | Description |
---|---|
xml:lang |
Target language locale in the form ab-CD , for example
"de-DE" . |
<lexicon>
Imports external Pronunciation Lexicon Specification (PLS) file.
mVoice accepts PLS files complying with the standard version 1.0, with some further restrictions, see section on Pronunciation Lexicon below.
Example:
speak>
<lexicon uri="https://my.site.eu/lexicon.pls"/>
<
Words in this text will be processed by the lexicon before they are synthesized.speak> </
Attribute | Description |
---|---|
uri |
Lexicon file URI. |
<lexicon>
must be a direct child of a
root <speak>
element and there may be max. one
lexicon in the request.<lexicon>
should come before any other
elements and text in a <speak>
element but, in
mVoice, the position of <lexicon>
element does not
matter,
i.e. <speak>Text to be processed.<lexicon uri="https://example.com/lex.pls"/></speak>
works as well.<media>
in
a <par>
element can have its own independent
lexicon.<media>
element, if applicable).<mark>
(or
<bookmark>
)An empty element that places a named marker into SSML at a specific location. When rendering the audio, mVoice will compute the time offset of the marker in the output audio stream and will inform the user about the marker position. It is guaranteed that the information about the location of the marker will arrive before the actual audio data. Note that the information is only available via WebSockets connection which allows to send metadata along with the audio stream.
The marker may be placed almost anywhere in the SSML. It supports all SSML features in streaming as well as SMIL extensions like parallel media.
Example:
speak>
<
Nazdar Karle,mark name="1" />
<
teď vyslovím slovo jedna. A break time="1s"/>
<bookmark mark="2"/>
<
teď slovo dva.mark name="EOS"/>
<speak> </
Attribute | Description |
---|---|
name |
Name of the marker. |
An equivalent to <mark name="beg-of-John">
is
<bookmark mark="beg-of-John">
, both elements are
interchangeable; note that in <bookmark>
the
attribute is mark
instead of name
.
When processing the web socket messages, the marker information comes
as a json message like
{"marks": [{"mark": "EOS", "offset": 3.32}]}
, meaning that
the marker “EOS” appeared at 3.32 seconds in the audio stream.
<mvoice:background_audio>
Adds a background audio track to the synthesized audio at a specific location and for a specific length. Several attributes allow to fine-tune the result.
The element cannot be nested but there are no restrictions on the
foreground audio content, i.e. it can span any SSML content except for
another <mvoice:background_audio>
.
Example:
mvoice:background_audio
< src="https://mvoice-tests.s3.eu-central-1.amazonaws.com/test-data/song.mp3"
volume="-12dB" duck="4" loop="on" fade_in="1s" fade_out="1.5s"
clip_begin="40s" clip_end="999s">
This text with be spoken with the above audio track playing in a loop in the background
through the length of this speech. mvoice:background_audio> </
Attribute | Description |
---|---|
src |
Background audio URL. |
volume |
Audio volume of the background track in relative dB:
"0dB" means no change. |
duck |
Strength of “ducker” effect: level of sidechain
compression:"1" means no compression, higher values
(max. 20) compress more. |
loop |
Whether the background audio should be played in a loop if it is
shorter than the foreground. "on" or
"off" . |
clip_begin |
Start playing N seconds into the audio, "Ns" . |
clip_end |
Stop playing when reaching N seconds into the audio,
"Ns" . |
fade_in |
Apply fade-in to the background audio during N first seconds,
"Ns" . |
fade_out |
Apply fade-out to the background audio during N last seconds,
"Ns" . |
mvoice:effects |
Optional audio effects chain,
e.g. "gain -2 highpass -1 120" , more here. |
loud_norm |
Optional loudness normalization specification string,
e.g. "I=-23:TP=-1:LRA=7" , more here. |
<audio>
element, the audio file must
be accessible by mVoice via public URL.mvoice:effects
are applied before loudness
normalization.<par>
Adds a parallel media container element, which allows to play multiple media simultaneously.
<par>
can only contain a set of
<media>
elements. Each <media>
element defines a position of its audio relative to the beginning of the
<par>
element or relative to another
<media>
element within this <par>
.
The default position of all <media>
elements is zero,
i.e. the beginning of the <par>
. When
<par>
element is rendered, we first render audios of
the individual <media>
elements, then apply audio
effects on them, then figure out absolute locations of the
<media>
elements and finally down-mix (add) the audio
tracks together. The length of the resulting audio is determined by the
longest track.
Note that negative media positions are not considered as errors: the relevant audio will simply be trimmed (when it begins before 0) or removed altogether (when it ends before or at 0).
Example (contents of <media>
omitted for
brevity):
par>
<media xml:id="speech" begin="400ms"></media>
<media xml:id="ding" begin="speech.end-2s"></media>
<media xml:id="speech2" end="ding.begin+3.5s"></media>
<media xml:id="background_sound"></media>
<par> </
The above will play a “background_sound”, then after 400ms it will add “speech” media, then 2 seconds before the “speech” ends it will add a “ding” media, and finally it will add a “speech2” media at a position computed so that it ends exactly 3.5 seconds after “ding” started.
There are no attributes in <par>
.
<media>
Defines a media element within a <par>
element.
Its attributes define the position within the <par>
element and allow to apply audio effects similar to those in
<mvoice:background_audio>
element. The content of the
<media>
element is not constrained, it can be any
other element like <speak>
, <par>
,
<audio>
, a text etc., i.e. full recursion is
supported. All audio effects are applied on a completely rendered audio
content of the <media>
element.
Example:
media xml:id="media1" begin="0.03s">
<speak>Example text.</speak>
<media>
</media xml:id="media2" begin="0.03s" trim_begin="0.35s" trim_end="0.25s" fade_in="0.5s" fade_out="1s" volume="3dB" repeat_count="9" duration_limit="5s">
<
This is a longer example text.media> </
In the above, “media2” is rendered by the following sequence of operations: 1. Synthesize audio from text. 2. Trim the synthesized audio by initial 0.35 s and trailing 0.25 s. 3. Adjust volume by +3 dB. 4. Apply fade-in and fade-out effects. 5. Repeat the resulting clip 9 times. 6. If the audio is longer than 5 seconds, trim it to 5 seconds.
Attribute | Description |
---|---|
xml:id |
A unique identifier
within <par> . |
begin |
Begin time position of the media within <par> ,
see below. |
end |
End time position of the media within <par> , see
below. |
trim_begin |
Trim off initial S seconds of audio, "Ss" . |
trim_end |
Trim off trailing S seconds of audio, "Ss" . |
fade_in |
Apply fade-in during S first seconds, "Ss" . |
fade_out |
Apply fade-out during S last seconds, "Ss" . |
volume |
Adjust sound volume by S dB relative, "SdB" , 0 means no
change. |
repeat_count |
Repeat the audio N times, "N" . The result can be
limited by duration_limit . |
duration_limit |
Limit the length of the audio (after applying previous effects) to
at most S seconds, "Ss" . |
mvoice:effects |
Apply audio effects chain,
e.g. "gain -2 highpass -1 120" , more here. |
loud_norm |
Loudness normalization specification,
e.g. "I=-23:TP=-1:LRA=7" , more here. |
All attributes are optional. S
is a float number,
N
is an integer.
Audio effects are applied in the order as shown in the table above,
e.g. trimming first, then volume & fading, then repeat, then
duration limit, then mvoice:effects
, and finally loudness
normalization.
Only one of begin
and end
attributes may be
specified. Their value defines either an absolute position within
<par>
or a relative position w.r.t. another media in
the same <par>
. When a relative position is used, the
referenced media must have a valid xml:id
attribute.
+-(float)(h|min|s|ms)
, for example
300ms
or -2.4s
<par>
element(xml:id).(begin|end)(+-float)(h|min|s|ms)
, for example
speech1.end+2.0s
or xx.end-400ms
xml:id
Note that <par>
and <media>
elements inherit the TTS state from the upstream element. For example,
when you select a voice & set prosody rate to 110% and then insert a
<par>
element, then any text within the par will
speak by default at 110% rate.
Example:
speak>
<prosody pitch="+5st" rate="110%">
<par>
<<!-- Nazdar Karle speaks at +5st, 110%.-->
media>Nazdar Karle.</media>
<par>
</prosody>
</speak> </
Loudness normalization follows EBU R 128
standard.
Loudness normalization specification string consists of three parameters
separated by a colon. At least one parameter must be provided.
I
- target normalization level in LUFS, defaults to -23
(float)TP
- True Peak level in LU, defaults to -1 (float)LRA
- Loudness Range in LU, defaults to 7 (float)Allowed values follow ffmpeg standard.
Example: I=-20:TP=-2
normalizes to -20 LUFS with a
maximum true peak level of -2 LU.
Audio effects are defined via the attribute
mvoice:effects
. Its value follows sox effects syntax
(see also “soxeffect” man page).
mVoice implements a subset of sox effects:
highpass, lowpass, equalizer, treble, bass, gain, compand, pan
(this may change in future).
Effects chaining is allowed.
Note that pan
effect with a float arg between -1 (all to
left) and +1 (all to right) works only with a stereo output audio.
Example of a complete ssml applying effects on a background audio, on
an external audio, and on the contents of a <media>
element.
speak>
<par>
<media xml:id="all">
<
First sentence.mvoice:background_audio src="https://example.com/background.mp3"
< mvoice:effects="highpass -2 200">
Let's play a jingle.audio src="https://example.com/jingle.mp3"
< mvoice:effects="compand 0.002,0.002 -60:-5,0,0 10 -60 0.002">
ding dingaudio>
</mvoice:background_audio>>
</media>
</media xml:id="compressed" begin="all.end+0s"
< mvoice:effects="lowpass -2 3000 compand 0.002,0.002 -60,-10,0,-5 -5 -60 0.002">
This sentence was heavily processed.media>
</par>
</
Final words.speak> </
mVoice supports Pronunciation
Lexicon Specification v 1.0 which is enabled using
<lexicon>
SSML element as described in section
<lexicon>
.
mVoice accepts all standard PLS files. An example file follows:
lexicon version="1.0"
< xml:lang="cs-CZ"
alphabet="native">
<!-- xml:lang much match the language of the voice currently in use.-->
<!-- Alphabet must be supported by mVoice, for example "universal" is close to IPA.-->
lexeme>
<grapheme>ICQ</grapheme>
<alias>aj sík jů</alias>
<<!--Replace "ICQ" with "aj sík jů". -->
lexeme>
</
lexeme>
<grapheme>Eliska</grapheme>
<grapheme>Karel</grapheme>
<phoneme alphabet="universal">ɛː l i ʃ k a</phoneme>
<phoneme prefer="true">j aa g u sh k a</phoneme>
<<!--Both "Eliska" and "Karel" are replaced with a phoneme sequence "j aa g u sh k a".-->
lexeme>
</lexicon> </
XML attributes: - <lexicon>
-
version
, xml:lang
and alphabet
attributes are required - <lexeme>
- no attributes -
<grapheme>
- no attributes -
<alias>
- optional prefer
attribute -
<phoneme>
- optional prefer
,
alphabet
attributes
Upon loading the lexicon, we convert a possible many-to-many mapping between graphemes and either an alias or a phoneme, to one-to-one mapping. By this, every grapheme is transformed either to an alias text or to a phoneme sequence. The selection algorithm follows W3C standard:
prefer="true"
as the
replacement.prefer
flag, use the first projection
in the document order.This produces a map from string to an object (alias or phoneme). None of (grapheme, phoneme, alias) may be empty. Any boundary whitespace is stripped off. In addition:
<grapheme>
<grapheme>~ahoj</grapheme>
is an error. An
exception is a trailing period as in abbreviations.<grapheme>ABC- DEF</grapheme>
is ok.<alias>
<alias>do-re-mi</alias>
is an
error.<phoneme>
<phonemes>
SSML element).Note that Lexicon is applied on standardized text. That is,
all whitespace is replaced with a single space ’ ’ character and some
other special characters are standardized, for example quotes
„“«
all become "
. Graphemes from the lexicon
are standardized, too, so there is no need to standardize lexicon
entries by the user. However, in rare cases, this can result in the
occurrence of collapsing entries that, upon standardization, become
indistinguishable.
Lexicon is case-sensitive.
Pattern matching the lexicon to the standardized text cannot be done
by simply finding graphemes because a grapheme but
would
match a part of the word butter
. It works as follows:
!?.,:;()+-/" @
. For
example:
ahoj
will match 5 times in the text
Blabla ahoj ahoj@me.com (ahoj); "ahoj", ahoj!
.Ahoj ~ahoj, nahoj, #ahoj.
.1
, 1 2
,
2 3 4
] and text 1 2 3 4 5
will match
2 3 4
and then 1
.U^2
can be aliased to
you up two
but not to you up 2
.