I want to add good Text-To-Speech (TTS) support to a snap. The best path forward seems to be to use the Speech Dispatcher server of the host system. However, this is not supported in snapd.
What is it?
Speech Dispatcher is a device independent layer for speech synthesis that provides a common easy to use interface for both client applications (programs that want to speak) and for software synthesizers (programs actually able to convert text to speech).
Speech-dispatcher has a client-server architecture.
libspeechdis the client.
spd-sayis a CLI client tool using
speech-dispatcheris the server.
Clients communicate to the server using the SSIP protocol over
/run/user/$UID/speech-dispatcher/speechd.sock. If the server is not running, the client autospawns it using the binary
Why not put it in the snap?
Technically, every snap using TTS could ship
speech-dispatcher, run it inside of the snap and use the
audio-playback interface. This approach has a number of downsides:
- When multiple snaps use TTS at the same time, the output will not be properly demuxed.
speech-dispatcherhas internal queues and syncs messages to make sure everything stays understandable.
speech-dispatchersupports an incredible amount of synthesisers, languages, voices etc. It’s hard to know beforehand which combination of language and synthesiser the user wants, and snaps would not respect the user’s TTS preferences.
What needs to be fixed
- Snaps currently search for the socket in
However, the actual socket is
You can work around this using
export SPEECHD_ADDRESS=unix_socket:/run/user/1000/speech-dispatcher/speechd.sock. However, I’m not sure if all applications respect this environment variable.
- Access to
/run/user/$UID/speech-dispatcher/speechd.sock. This currently only works in devmode. I think the best path forward is a new interface. These permissions could be added to
audio-playback, but that would increase its attack surface.
/usr/bin/speech-dispatchercommand needs to start the server on the host. Could snapd provide a shim in that location?
- Is something like this on the roadmap? I saw @oSoMoN looked at this a while ago?
- I’m not entirely sure why
libspeechdinside of a snap prefixes the socket name. Does anyone know why unix socket paths are prefixed inside of the snap?
- Is the
speech-dispatchershim a good approach? Is there already an interface that provides a shim binary which instructs snapd do do something outside of the sandbox?