Text-To-Speech support in snapd

galgalesh · May 1, 2020, 4:34pm

I want to add good Text-To-Speech (TTS) support to a snap. The best path forward seems to be to use the Speech Dispatcher server of the host system. However, this is not supported in snapd.

What is it?

Speech Dispatcher is a device independent layer for speech synthesis that provides a common easy to use interface for both client applications (programs that want to speak) and for software synthesizers (programs actually able to convert text to speech).

Speech-dispatcher has a client-server architecture.

libspeechd is the client. spd-say is a CLI client tool using libspeechd.
speech-dispatcher is the server.

Clients communicate to the server using the SSIP protocol over /run/user/$UID/speech-dispatcher/speechd.sock. If the server is not running, the client autospawns it using the binary /usr/bin/speech-dispatcher.

Why not put it in the snap?

Technically, every snap using TTS could ship speech-dispatcher, run it inside of the snap and use the audio-playback interface. This approach has a number of downsides:

When multiple snaps use TTS at the same time, the output will not be properly demuxed. speech-dispatcher has internal queues and syncs messages to make sure everything stays understandable.
speech-dispatcher supports an incredible amount of synthesisers, languages, voices etc. It’s hard to know beforehand which combination of language and synthesiser the user wants, and snaps would not respect the user’s TTS preferences.

What needs to be fixed

Snaps currently search for the socket in /run/user/$UID/snap.$SNAP_NAME/speech-dispatcher/speechd.sock However, the actual socket is /run/user/$UID/speech-dispatcher/speechd.sock You can work around this using export SPEECHD_ADDRESS=unix_socket:/run/user/1000/speech-dispatcher/speechd.sock. However, I’m not sure if all applications respect this environment variable.
Access to /run/user/$UID/speech-dispatcher/speechd.sock. This currently only works in devmode. I think the best path forward is a new interface. These permissions could be added to audio-playback, but that would increase its attack surface.
The /usr/bin/speech-dispatcher command needs to start the server on the host. Could snapd provide a shim in that location?

My questions

Is something like this on the roadmap? I saw @oSoMoN looked at this a while ago?
I’m not entirely sure why libspeechd inside of a snap prefixes the socket name. Does anyone know why unix socket paths are prefixed inside of the snap?
Is the speech-dispatcher shim a good approach? Is there already an interface that provides a shim binary which instructs snapd do do something outside of the sandbox?

References

Speed Dispatcher docs

Previous discussions:

lotharsm · May 1, 2020, 5:03pm

Sorry for crossposting, but since this is a great summary of the problem combining snaps with TTS/speechd, so here’s how I was able to solve/work around the problem:

ijohnson · May 1, 2020, 5:16pm

What if you just built a single speech-dispatcher snap that other snaps could connect to, like network-manager, bluez, etc. ? It would also need work to run strictly confined, but from @lotharsm’s work in scummvm, it seems that speech-dispatcher can be made to work from inside strict confinement

galgalesh · May 1, 2020, 5:48pm

I also think that’s possible, but I think it’s difficult to do well because speech-dispatcher is so customisable. As an example: it supports the following synthesisers: Festival, Espeak, Flite, Pico, and a “generic” module, which you can use to connect speech-dispatcher to any CLI-based synthesiser. Each synthesiser supports many different languages, and many different voices.

The first issue is space. If we only look at the Espeak Mbrola languages and Festival:

$ apt-cache show mbrola-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
661010
$ apt-cache show festival-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
40193
$ apt-cache show festvox-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
394100
$ apt-cache show festlex-* | grep Installed-Size | cut -d ' ' -f2 | paste -sd+ - | bc
50701

That’s more than 1GB uncompressed data. Putting all this in one snap isn’t a great user experience. I’m assuming that this will be installed on almost every system, given that we want TTS to work ootb for accessibility reasons. We could combine a greedy plug declaration with a snap for every language, however. Users can then install the languages and synthesisers they want to use. (similar to the theme snap[s] proposal).

The second issue is config/setup duplication. I don’t know how to solve this one. Ideally, the user should only have to setup their TTS preferences once. I’m not sure how to do that with a Speech Dispatcher snap. It seems that, unless we use the Speech Dispatcher from the host system, the user will have to configure it both on their host system and in the Speech Dispatcher snap. Any languages and synthesisers used will also have to be installed by both the package manager of the host system and snap.

ijohnson · May 1, 2020, 6:28pm

Yes I think the interface system would be a great way to solve this problem.

The main reason I ask is that if we can get this to work well confined as a snap, Ubuntu Desktop as a whole would benefit from this I think because it’s less maintenance for the team to fix bugs in older Ubuntu releases if those releases are using more systemy things like this as snaps. I don’t know of any effort underway by the Ubuntu Desktop team to confine specifically Speech Dispatcher, but this is one area that perhaps they could help with is figuring out how to integrate Speech Dispatcher into already pre-existing configuration GUIs that are already available like Settings, GNOME control center, etc. I think it’s a solvable problem, but also I’m not the Ubuntu Desktop team and I’m barely aware of how this works so this is just my humble opinion as an observer. Perhaps @jamesh or @kenvandine could speak better on this configuration integration.

jamesh · May 5, 2020, 4:54am

We don’t have anything specifically tied to text to speech.

The usual way to make a desktop application usable to blind people is via the accessibility framework (libatk, at-spi, and tools that plug in to the framework):

https://developer.gnome.org/accessibility-devel-guide/stable/gad-how-it-works.html.en

You will need to plug the desktop-legacy interface to allow your app to talk to at-spi. It’s not part of desktop because the protocol is not particularly confinement friendly: it doesn’t really distinguish between applications that provide information about themselves, and tools that consume that information and control the applications. This should allow a text to speech a11y tool running on the host to read the confined app’s gui as the user navigates it.

If your application is built out of standard widgets, that might be all that’s necessary. You might be able to improve things by adding descriptions to some widgets, but most things should just work. If you’ve written custom widgets, things will be a bit more difficult. The above guide has some pointers on where to start though.

galgalesh · May 5, 2020, 4:51pm

Thanks for the info! I didn’t know about at-spi.

So if I understand it correctly, at-spi is specifically tailored towards a11y with a screen reader. On Ubuntu, the Orca screen reader is at the other end of the bus. Orca then decides what to read and then uses Speech Dispatcher to actually read it.

The app I’m working on is Foliate, an e-book reader which uses Speech Dispatcher directly to read the books aloud. Not only for a11y reasons, but mainly for people who want to listen to a book. This requires access to Speech Dispatcher directly.

@jamesh what do you think is the best way forward to allow snaps to talk to Speech Dispatcher directly? Should I look into creating an interface which connects to Speech Dispatcher on the host system, or is there interest in the desktop team to maintain a Speech Dispatcher snap?

From my perspective, creating an interface which allows access to Speech Dispatcher on the host system is preferred, because then you don’t run into synchronization issues between the server in the snap and the server on the host system.

jamesh · May 6, 2020, 1:51am

As is, I don’t think you’ll be able to reliably communicate with the host’s speech-dispatcher process from confinement, even with an interface granting access. Looking at the client library, we have the following code:

github.com

brailcom/speechd/blob/master/src/api/c/libspeechd.c#L486-L497


	if (ret == -1) {
		/* Suppose server might not be running, try to autospawn (autostart) it */
		if (autospawn) {
			spawn_err =
			    spawn_server(address, is_localhost, &spawn_report);
			if (!spawn_err)
				spawn_report =
				    g_strdup("Server successfully autospawned");
			ret =
			    connect(connection->socket, sock_address,
				    sock_address_len);
		} else {

In short, it’s doing:

try to connect to speech-dispatcher’s unix domain socket. If that succeeds, we’re done.
if that fails, fork and exec a copy of speech-dispatcher.
try to connect to speech-dispatcher’s unix domain socket a second time.

It is likely that the first connect will fail, since users who don’t rely on screen readers won’t be running anything that tries to start it.

This autospawn strategy would result in a copy of speech-dispatcher running in your snap’s sandbox, which is what you were trying to avoid. We also wouldn’t want host system apps or other snaps to talk to such an instance due to privacy/security concerns.

The modern way to handle this kind of “start on demand” behaviour is systemd socket activation, so the daemon’s socket is always available but it only starts on the first attempt to connect. That doesn’t appear to be supported by any versions of speech-dispatcher though, so isn’t something we can rely on.

For now, the best option is probably to pursue the “private copy of speech dispatcher” strategy rather than trying to talk to the host system’s version.

jamesh · May 6, 2020, 5:19am

And following up, it doesn’t look like speech-dispatcher offers much isolation between clients. It looks like it allows any client to retrieve the text any other client has asked it to speak since the daemon started:

On a blind person’s desktop this could include every keystroke the user has typed for however long the history log is, assuming key echo is enabled.

This is similar to the design issue with at-spi: mixing the control/management protocol in with the regular client protocol. This was fine when all a user’s apps were considered to run within a single security domain, but breaks down when you have untrusted confined applications.

These aren’t insurmountable problems, but it would probably be better to work on them from the speech-dispatcher side. Socket activation would probably be fairly easy to implement, and be transparent to existing clients.

Adding a security policy suitable for untrusted clients (don’t leak existence of other clients, block history access or limit it to the client’s own history, etc) is probably a bigger job.

galgalesh · May 6, 2020, 10:39pm

Thanks for the detailed reply @jamesh! I created an issue in the Speech Dispatcher repo about the security policy: https://github.com/brailcom/speechd/issues/335. I’m interested to see what the maintainer thinks of this.

galgalesh · May 7, 2020, 11:51pm

@jamesh

Great news: history and client enumeration are not actually implemented in Speech Dispatcher, they won’t implement it and will remove the documentation mentioning it. This means the socket activation is the only known blocker. I created an issue for that.

One thing that’s not yet clear to me is, when the socket activation is implemented, how snapd will handle distributions which have an older version of Speech Dispatcher?