Files missing in installed snaps

Hi, we’re encountering a weird issue on some of our UbuntuCore iot devices - a file is missing in a few cases. The file in question (pyproject.toml) is part of 6 python snaps that are each installed on ~2k devices and has so far failed for 3 different snaps on 5 different devices.

The file is bundled in the .snap (app/pyproject.toml) and I verified that it’s indeed in the files. We use the following layout to expose it to the snap (I assume this was set up like that to follow best practices/due to some suggestion but I don’t know):

layout:
  /opt/app/pyproject.toml:
    symlink: $SNAP/app/pyproject.toml

We then access the file under /opt/app/pyproject.toml. In these 5 cases, the snap fails to read the file continually (and not just because it tries to read it “too soon”).

I can resolve the issue with disable/enable of the snap but I don’t know what causes it - presumably either the file is missing on the file system or the layout somehow failed. Unfortunately I can’t tell whether the file is actually on the file system or not.

Any ideas how to investigate/fix this?

Additional info that might be relevant:

  • We recently changed the snaps from core20/python plugin to core22/uv plugin
  • Before, we didn’t bundle pyproject.toml (but another file with similar metadata) where we never encountered/noticed this behavior
  • Snapd revision is 24509

I think the layout may be the culprit?

You can check if the file is on-disk, I would expect (given the description) for it to be in /snap/<snap name>/current/app/pyproject.toml. I assume that the file is appearing correctly on the majority of working devices; you could confirm the layout not being applied correctly by inspecting in the snap’s context on a working device and a not-working device:

sudo snap run --shell <snap name>.<app name>
ls /opt/app/pyproject.toml

@zyga may have more knowledge about troubleshooting this.

Hi, thanks for your reply. Unfortunately I can’t check either, I don’t have root access to the devices and also can’t run commands in the broken snaps’ contexts (as they’re not receiving commands).

I can run commands in the context of other snaps, though I couldn’t think of something useful to do that way.

Why/how would the layout not be applied correctly? Could this be related to core22 (as we didn’t observe this with core20)?

I can’t quite guess what may be wrong without access to snapcraft.yaml or the snap. However, if you’re mixing content and layouts, it is entirely possible for this to go wrong.

Note, snapd 24509 is 2.68.4. The latest release in stable is 2.73 and the candidate is 2.74.1. Specifically 2.74 introduces a change around snap mount namespaces, which makes them more robust across snap refreshes. I would strongly suggest to update in near future.

1 Like

What do you mean by “mixing content and layouts”? Which parts of snapcraft.yaml would be relevant/helpful, beside the layout in question?

Note, snapd 24509 is 2.68.4. The latest release in stable is 2.73 and the candidate is 2.74.1. Specifically 2.74 introduces a change around snap mount namespaces, which makes them more robust across snap refreshes. I would strongly suggest to update in near future.

Thanks for the heads-up! We’re currently holding 3rd-party snap updates until an ongoing deployment is finished (so that devices come online quicker and aren’t busy refreshing initially) but I’ll keep an eye out for that. Is there a way to verify (or at least get reasonable confidence) that the changes in 2.74 would prevent this issue going forward?

1 Like

I mean mixing plugs using interface: content and layout where parts of the tree may overlap. This works most of the time, but it’s not impossible to hit an edge case where the state observed by an application is not what you would expect. More edge cases, e.g. changing symlink → bind in layouts is likely an issue as well.

As I mentioned, snapd 2.74 brings in a change where we discard the mount namespace if possible during refresh. What it means is that after the refresh completes application starts from a clean state, thus some possibly buggy scenarios are now entirely avoided.

Ah, I see. No there’s no overlapping plug for the missing file - I do use the combination of plug and layout but not for the file that’s missing. Here’s a sample of what I’m using:

plugs:
  device-identity:
    interface: content
    content: device-identity
    target: $SNAP_COMMON/device

layout:
  /opt/app/pyproject.toml:
    symlink: $SNAP/app/pyproject.toml
  /opt/device:
    symlink: $SNAP_COMMON/device

(There are two more plugs/layouts equivalent to the device directory.)

Do I understand you correctly that the snapd change would help prevent issues in this scenario, i.e. for files in that directory? Do you think this could interfere with the layout for pyproject.toml?

With the snippet you pasted, the changes under opt seem to be triggered only by layouts. Since /opt/device is a symlink I suspect it should be fine, since it’s mostly transparent.

With an older snapd, should the problem occur, it would happen during a refresh of your snap. You can try to run this scenario, even locally, and see if you can reproduce it. Af this point, if you notice that the file may be missing, you can try this to join the snap’s mount namespace and dump all the mounts:

nsenter -m /run/snapd/ns/<your-snap-name>.mnt /usr/bin/findmnt

or even

nsenter -m /run/snapd/ns/<your-snap-name>.mnt /bin/bash

and have a look around. You can inspect the symlink’s presence and/or see if there are any mounts hiding it it.

There isn’t much magic behind layouts. The basic idea is, if your location is e.g. /opt/app/pyproject.toml, and but /opt/app/pyproject.tmol is found at runtime not exist to be read-only (which is true for say /opt, as it’s part of the coreXX snap), snap-update-ns creates something which we refer to as a mimic. It’s kind of a scratch tmpfs location (which is read-write) which will be mounted in place of the target directory. We recreate all of the original location’s content inside it and then add the new entry you specified in the layout, and finally mount it on top of the original location.

In this case, e.g. core22 /opt is empty. so in order to create /opt/app/pyproject.tomlI would be expecting snap-update-ns to create a tmpfs on the side, create app/pyproject.toml symlink inside it, and bind mount that scratch location on top of /opt. This is in theory quite simple, but if you start considering bind mounts and propagation it quickly stops being so obvious.

1 Like

Thanks, good point about trying to reproduce. I’ve set up 4 devices to constantly re-refresh the same revision.

In this case, e.g. core22 /opt is empty. so in order to create /opt/app/pyproject.tomlI would be expecting snap-update-ns to create a tmpfs on the side, create app/pyproject.toml symlink inside it, and bind mount that scratch location on top of /opt. This is in theory quite simple, but if you start considering bind mounts and propagation it quickly stops being so obvious.

I don’t quite get this part - but, do you really mean /opt/app/pyproject.toml, without being in the context of the single snap? I’m asking because all those 6 snaps map their own version of this file and if this is in a shared context I expect concurrent refreshes could potentially interfere with each other. Could this be the cause of my issue?

Something else that just occurred to me: the snaps are based on core22 but the boot base is still core20. I don’t think that should matter but maybe it does.

If the revision and snap content do not change then most of operations will be a noop. It’s unlikely you would be able to reproduce it. I would suggest trying refreshing between 2 known revisions.

Each snap gets its own mount namespace, so all mappings are private to the snap.

It doesn’t. Any changes in behavior of how a snap are made by looking at the base: declared in its snap.yaml.

Thanks for the insights. I’m just writing to let you know I’m still investigating as this is still a serious issue for us. Maybe you have additional ideas for possible causes or tests to run.

I’ve had the devices refresh between two revisions of a single snap for around 10k cycles now, without being able to reproduce it.

To see if it may after all be related to concurrent updates, I changed the cycle to first downgrade 4 snaps separately and then refresh them together (sudo snap refresh) - this is also what we do with the update that led to the error. Cycle time is much longer now (I’m still below 2k cycles) but I haven’t encountered the issue yet unfortunately.

Any thoughts?

This sounds ok. If you weren’t able to reproduce it by refreshing the snap itself after that many attempts then I don’t think there’s any point in carrying on with simple refresh. Likely something else could be at play. As I mentioned before, snapd 2.74.1 has hit stable, which should ‘resolve’ one class of bugs.

One more thing that pops to mind, is the connection to the content snap done automatically, or do you need to establish it yourself on first installation? is there any correlation between the problem appearing and refreshes of content snaps connected to the affected snap?

Hm I’m not sure what you mean regarding the content snap. They source of the layout symlink is a file within the snap ($SNAP/app/pyproject.toml) and not shared via content interface. However, the issue may correlate with refreshes of the snap that provides other directories via content interface. Those don’t seem to be affected though.

I had the concurrent refresh test running some more but only got up to ~1.2k cycles so far. That’s not conclusive, but still quite a few cycles given the prevalence we observed. I don’t really see what else to do but update snapd to 2.74.1 and hope we don’t encounter the same issue during or next deployments unfortunately.