I want to add good Text-To-Speech (TTS) support to a snap. The best path forward seems to be to use the Speech Dispatcher server of the host system. However, this is not supported in snapd.
What is it?
Speech Dispatcher is a device independent layer for speech synthesis that provides a common easy to use interface for both client applications (programs that want to speak) and for software synthesizers (programs actually able to convert text to speech).
Speech-dispatcher has a client-server architecture.
libspeechd is the client. spd-say is a CLI client tool using libspeechd.
speech-dispatcher is the server.
Clients communicate to the server using the SSIP protocol over /run/user/$UID/speech-dispatcher/speechd.sock. If the server is not running, the client autospawns it using the binary /usr/bin/speech-dispatcher.
When multiple snaps use TTS at the same time, the output will not be properly demuxed. speech-dispatcher has internal queues and syncs messages to make sure everything stays understandable.
speech-dispatcher supports an incredible amount of synthesisers, languages, voices etc. Itâs hard to know beforehand which combination of language and synthesiser the user wants, and snaps would not respect the userâs TTS preferences.
What needs to be fixed
Snaps currently search for the socket in /run/user/$UID/snap.$SNAP_NAME/speech-dispatcher/speechd.sock
However, the actual socket is /run/user/$UID/speech-dispatcher/speechd.sock
You can work around this using export SPEECHD_ADDRESS=unix_socket:/run/user/1000/speech-dispatcher/speechd.sock. However, Iâm not sure if all applications respect this environment variable.
Access to /run/user/$UID/speech-dispatcher/speechd.sock. This currently only works in devmode. I think the best path forward is a new interface. These permissions could be added to audio-playback, but that would increase its attack surface.
The /usr/bin/speech-dispatcher command needs to start the server on the host. Could snapd provide a shim in that location?
My questions
Is something like this on the roadmap? I saw @oSoMoN looked at this a while ago?
Iâm not entirely sure why libspeechd inside of a snap prefixes the socket name. Does anyone know why unix socket paths are prefixed inside of the snap?
Is the speech-dispatcher shim a good approach? Is there already an interface that provides a shim binary which instructs snapd do do something outside of the sandbox?
Sorry for crossposting, but since this is a great summary of the problem combining snaps with TTS/speechd, so hereâs how I was able to solve/work around the problem:
What if you just built a single speech-dispatcher snap that other snaps could connect to, like network-manager, bluez, etc. ? It would also need work to run strictly confined, but from @lotharsmâs work in scummvm, it seems that speech-dispatcher can be made to work from inside strict confinement
I also think thatâs possible, but I think itâs difficult to do well because speech-dispatcher is so customisable. As an example: it supports the following synthesisers: Festival, Espeak, Flite, Pico, and a âgenericâ module, which you can use to connect speech-dispatcher to any CLI-based synthesiser. Each synthesiser supports many different languages, and many different voices.
The first issue is space. If we only look at the Espeak Mbrola languages and Festival:
$ apt-cache show mbrola-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
661010
$ apt-cache show festival-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
40193
$ apt-cache show festvox-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
394100
$ apt-cache show festlex-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
50701
Thatâs more than 1GB uncompressed data. Putting all this in one snap isnât a great user experience. Iâm assuming that this will be installed on almost every system, given that we want TTS to work ootb for accessibility reasons. We could combine a greedy plug declaration with a snap for every language, however. Users can then install the languages and synthesisers they want to use. (similar to the theme snap[s] proposal).
The second issue is config/setup duplication. I donât know how to solve this one. Ideally, the user should only have to setup their TTS preferences once. Iâm not sure how to do that with a Speech Dispatcher snap. It seems that, unless we use the Speech Dispatcher from the host system, the user will have to configure it both on their host system and in the Speech Dispatcher snap. Any languages and synthesisers used will also have to be installed by both the package manager of the host system and snap.
Yes I think the interface system would be a great way to solve this problem.
The main reason I ask is that if we can get this to work well confined as a snap, Ubuntu Desktop as a whole would benefit from this I think because itâs less maintenance for the team to fix bugs in older Ubuntu releases if those releases are using more systemy things like this as snaps. I donât know of any effort underway by the Ubuntu Desktop team to confine specifically Speech Dispatcher, but this is one area that perhaps they could help with is figuring out how to integrate Speech Dispatcher into already pre-existing configuration GUIs that are already available like Settings, GNOME control center, etc. I think itâs a solvable problem, but also Iâm not the Ubuntu Desktop team and Iâm barely aware of how this works so this is just my humble opinion as an observer. Perhaps @jamesh or @kenvandine could speak better on this configuration integration.
We donât have anything specifically tied to text to speech.
The usual way to make a desktop application usable to blind people is via the accessibility framework (libatk, at-spi, and tools that plug in to the framework):
You will need to plug the desktop-legacy interface to allow your app to talk to at-spi. Itâs not part of desktop because the protocol is not particularly confinement friendly: it doesnât really distinguish between applications that provide information about themselves, and tools that consume that information and control the applications. This should allow a text to speech a11y tool running on the host to read the confined appâs gui as the user navigates it.
If your application is built out of standard widgets, that might be all thatâs necessary. You might be able to improve things by adding descriptions to some widgets, but most things should just work. If youâve written custom widgets, things will be a bit more difficult. The above guide has some pointers on where to start though.
Thanks for the info! I didnât know about at-spi.
So if I understand it correctly, at-spi is specifically tailored towards a11y with a screen reader. On Ubuntu, the Orca screen reader is at the other end of the bus. Orca then decides what to read and then uses Speech Dispatcher to actually read it.
The app Iâm working on is Foliate, an e-book reader which uses Speech Dispatcher directly to read the books aloud. Not only for a11y reasons, but mainly for people who want to listen to a book. This requires access to Speech Dispatcher directly.
@jamesh what do you think is the best way forward to allow snaps to talk to Speech Dispatcher directly? Should I look into creating an interface which connects to Speech Dispatcher on the host system, or is there interest in the desktop team to maintain a Speech Dispatcher snap?
From my perspective, creating an interface which allows access to Speech Dispatcher on the host system is preferred, because then you donât run into synchronization issues between the server in the snap and the server on the host system.
As is, I donât think youâll be able to reliably communicate with the hostâs speech-dispatcher process from confinement, even with an interface granting access. Looking at the client library, we have the following code:
In short, itâs doing:
try to connect to speech-dispatcherâs unix domain socket. If that succeeds, weâre done.
if that fails, fork and exec a copy of speech-dispatcher.
try to connect to speech-dispatcherâs unix domain socket a second time.
It is likely that the first connect will fail, since users who donât rely on screen readers wonât be running anything that tries to start it.
This autospawn strategy would result in a copy of speech-dispatcher running in your snapâs sandbox, which is what you were trying to avoid. We also wouldnât want host system apps or other snaps to talk to such an instance due to privacy/security concerns.
The modern way to handle this kind of âstart on demandâ behaviour is systemd socket activation, so the daemonâs socket is always available but it only starts on the first attempt to connect. That doesnât appear to be supported by any versions of speech-dispatcher though, so isnât something we can rely on.
For now, the best option is probably to pursue the âprivate copy of speech dispatcherâ strategy rather than trying to talk to the host systemâs version.
And following up, it doesnât look like speech-dispatcher offers much isolation between clients. It looks like it allows any client to retrieve the text any other client has asked it to speak since the daemon started:
On a blind personâs desktop this could include every keystroke the user has typed for however long the history log is, assuming key echo is enabled.
This is similar to the design issue with at-spi: mixing the control/management protocol in with the regular client protocol. This was fine when all a userâs apps were considered to run within a single security domain, but breaks down when you have untrusted confined applications.
These arenât insurmountable problems, but it would probably be better to work on them from the speech-dispatcher side. Socket activation would probably be fairly easy to implement, and be transparent to existing clients.
Adding a security policy suitable for untrusted clients (donât leak existence of other clients, block history access or limit it to the clientâs own history, etc) is probably a bigger job.
Thanks for the detailed reply @jamesh! I created an issue in the Speech Dispatcher repo about the security policy: https://github.com/brailcom/speechd/issues/335. Iâm interested to see what the maintainer thinks of this.
One thing thatâs not yet clear to me is, when the socket activation is implemented, how snapd will handle distributions which have an older version of Speech Dispatcher?