Not only can Rhapsode read pages aloud to you via eSpeak NG and it’s own CSS engine, but now you can speak aloud to it via Voice2JSON! All without trusting or relying upon any internet services, except ofcourse for bogstandard webservers to download your requested information from. Thereby completing my vision for Rhapsode’s reading experience!
This speech recognition can be triggered either using the space key or by calling Rhapsode’s name
(Okay, by saying Hey Mycroft
because I haven’t bothered to train it).
Voice2JSON is exactly what I want from a speech-to-text engine!
Accross it’s 4 backends (CMU PocketSphinx, Dan Povey’s Kaldi, Mozilla DeepSpeech, & Kyoto University’s Julius) it supports 18 human languages! I always like to see more language support, but this is impressive.
I can feed it (lightly-preprocessed) whatever random phrases I find in link elements, etc to use as voice commands. Even feeding it different commands for every webpage, including unusual words.
It operates entirely on your device, only using the internet initially to download
an appropriate profile
for your language.
And when I implement webforms it’s slots
feature will be invaluable.
The only gotcha is that I needed to also add a JSON parser to Rhapsode’s dependencies.
To operate Voice2JSON you rerun voice2json train-profile
everytime you edit sentences.ini
or
any of it’s referenced files to update the list of supported voice commands.
This prepares a language model
to guide the output of
voice2json transcribe-stream
or transcribe-wav
,
who’s output you’ll probably pipe into
voice2json recognize-intent
to determine which intent
from sentences.ini
it matches.
If you want this voice recognition to be triggered by some wake word
run voice2json wait-wake
to determine when that keyphrase has been said.
voice2json train-profile
For every page Rhapsode outputs a sentences.ini
file & runs voice2json train-profile
to compile this mix of INI &
Java Speech Grammar Format syntax into an appropriate
NGram-based
language model
for the backend chosen by the
downloaded profile.
Once it’s parsed sentences.ini
Voice2JSON optionally normalizes the sentence casing and
lowers any numeric ranges, slot references
from external files or programs, & numeric digits
via num2words before reformatting it into a
NetworkX graph
with weighted edges. This resulting
Nondeterministic Finite Automaton (NFA)
is saved & gzip‘d
to the profile before lowering it further to an OpenFST
graph which, with a handful of opengrm commands,
is converted into an appropriate language model.
Whilst lowering the NFA to a language model Voice2JSON looks up how to pronounce every unique word in that NFA, consulting Phonetisaurus for any words the profile doesn’t know about. Phonetisaurus in turn evaluates the word over a Hidden Markov n-gram model.
voice2json transcribe-stream
voice2json transcribe-stream
pipes 16bit 16khz mono WAVs
from a specified file or profile-configured record command
(defaults to ALSA)
to the backend & formats it’s output sentences with metadata inside
JSON Lines objects. To determine when a voice command
ends it uses some sophisticated code extracted
from the WebRTC implementation (from Google).
That 16khz audio sampling rate is interesting, it’s far below the 44.1khz sampling rate typical for digital audio. Presumably this reduces the computational load whilst preserving the frequencies (max 8khz per Nyquist-Shannon) typical of human speech.
voice2json recognize-intent
To match this output to the grammar defined in sentences.ini
Voice2JSON provides
the voice2json recognize-intent
command. This reads back in the compressed
NetworkX NFA to find the best path, fuzzily or not, via
depth-first-search which matches
each input sentence. Once it has that path it iterates over it to resolve & capture:
The resulting information from each of these passes is gathered & output as JSON Lines.
In Rhapsode I apply a further fuzzy match, the same I’ve always used for keyboard input, via Levenshtein Distance.
voice2json wait-wake
To trigger Rhapsode to recognize a voice command you can either press a key <aside>(spacebar)</aside>
or, to stick to pure voice control, saying a wakeword
<aside>(currently Hey Mycroft
).
For this there’s the voice2json wait-wake
command.
voice2json wait-wake
pipes the same 16bit 16khz mono WAV audio as voice2json transcribe-stream
into (currently) Mycroft Precise
& applies some edge detection
to the output probabilities. Mycroft Precise, from the Mycroft
opensource voice assistant project, is a Tensorflow
neuralnet converting
spectograms (computed via
sonopy or legacy
speechpy) into probabilities.
Interpreting audio input into voice commands is a non-trivial task, combining the efforts of many projects. Last I checked Voice2JSON used the following projects to tackle various components of this challenge:
And for the raw text-to-speech logic you can choose between:
Rhapsode’s use of Voice2JSON shows two things.
First the web could be a fantastic auditory experience if only we weren’t so reliant on JavaScript.
Second there is zero reason for Siri, Alexa, Cortana, etc to offload their computation to the cloud. Voice recognition may not be a trivial task, but even modest consumer hardware are more than capable enough to do a good job at it.