Web Speech API integration

This page describes platform-specific JSSCxml extensions related to speech interaction in supported Web browsers. These extensions are not intended to be always interoperable with other SCXML implementations. JSSCxml currently supports only speech synthesis out of the box.

the <speak> element

The <speak> element can be used wherever executable content is allowed. It connects SCXML to the Web Speech Synthesis API newly implemented in several Web browsers, and exposes most of its functionnality.

Datamodel fields

In the datamodel, _x.voices is the list of voices supported by the platform, as returned by the SpeechSynthesis.getVoices() method. Items of that list are not voiceURIs, but the full SpeechSynthesisVoice objects with a voiceURI property.

Namespace

The namespace for <speak> must be "http://www.jsscxml.org", for which I suggest the shorthand "jssc". Thus:

<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:jssc="http://www.jsscxml.org">
…
<jssc:speak text="Hello world!" xml:lang="en-US"… />
…

The reason JSSCxml is not using the SSML namespace is that there are some new attributes on the element, and that using inline SSML content would be good-looking but completely inflexible. Instead, SSML documents can be created (or parsed from actual SSML content in a <data> element) and manipulated by ECMAScript code and finally passed to the <speak> element.

Attribute detail

NameRequiredDefault valueValid valuesDescription
textyes*, and no more than one of those twononetext with optional SSML tagsThe text or SSML that will be spoken
exprnonean expression evaluating to a string or a SSML DocumentEvaluates when the <speak> element is executed, used as if there had been a text attribute with the resulting (linearized) value.
xml:langno, and only one of those twoplaform-specificany RFC 3066 language code supported by the platformthe language of the text to be read
langexprnoneEvaluates when the <speak> element is executed, used as if there had been an xml:lang attribute with the resulting value.
voicenoplaform-specificexpression that must evaluate to a member of _x.voicesthe voice used to read the text
volumeno10 – 1how loud the text will be spoken
rateno10.1 – 10how fast the text will be spoken
pitchno10 – 2pitch modifier for the synthesized voice
interrupt / nomoreno*falsebooleanstops speaking and cancel queued utterances

* The boolean nomore (or interrupt) may appear alone in the <speak> tag.

Note that the DOMParser API used by JSSCxml to parse all this may reject SCXML documents where boolean attributes are written without a value. It is therefore advised to give them the value "true" (although the interpreter will be happy with any value at all, as long as it gets well-formed XML).

Children

None.

Behavior

When executed, the <speak> element causes its text or SSML content to be read by the platform's SpeechSynthesis implementation, using the supplied parameters. When the synthesizer has reached the end of the utterrance, a speak.end event will be placed in the externa queue, with its data field containing a reference to the underlying SpeechSynthesisUtterance object.

If the nomore or interrupt attribute is present, current and queued utterances will be cancelled first, so the new utterance (if supplied) will be spoken immediately, no matter what. The interpreter will place a speak.error event in the external queue for each cancelled utterance.

At this time, Chrome and Safari's implementations disagree on the way to select a voice. Chrome's utterance objects have a voiceURI property which can be set to the voiceURI value of a voice, whereas Safari's utterance objects have a voice property which accepts only references to whole SpeechSynthesisVoice objects. In order to hide this misbehavior from authors, the voice attribute defined here always takes a reference, and JSSCxml will ensure that each browser gets what it expects.

If no voice is specified, the xml:lang attribute will cause the platform to choose the default voice for that language, if any is available, or at least for another geographical variation of that language. The language defined by xml:lang higher in the document hierarchy (typically on the root element) is inherited by <speak> elements, so there is no need to repeat it all the time.

speak events

speak.* events queued when using speech synthesis have the DOM Event origintype, but their origin is the corresponding SpeechSynthesisUtterance object rather than a node. There is no reason to <send> any event back to those objects (and the interpreter won't take them as a valid target anyway), but their text property allows you to track which utterance it is that has started, ended, or been cancelled.

The event's data will contain the elapsedTime, charIndex, and name properties of the original DOM event instead of a copy of the event itself, as would be the case for DOM events converted in the usual way by JSSCxml.