Issue with repackaged core and testing

Issue with repackaged core and testing

During spread testing we build a native package of snapd (for the current OS and architecture) and copy several files over from the native package to the core snap (the core snap is uncompressed and recompressed for this operation). This introduces several kinds of issues.

Impact

  • Native C code has assumptions about directory structure of the host that may be incompatible with the fixed structure inside the core snap
  • Native C code was compiled and linked with the host toolchain which may use incompatible set of libraries, including more recent libc, additional libraries (e.g. selinux) that are not present in the core snap. This happened in Fedora 29 with libc 2.27 which is not present in Ubuntu 16.04-based core snap.
  • Native package may have performed interpreter path mangling that is not compatible with the core snap. This specifically includes rewriting /bin/sh to /usr/bin/sh which affects the snap-device-helper shell script. While the script itself uses absolute path names compatible with the core snap for the commands it invokes (and resorts to shell builtins for everything else) the interpreter path itself is broken.

Possible solutions

  • A simple but costly solution is to restore the correct build environment. Snapd must be built with the Ubuntu 16.04 toolchain and build configuration if it ends up in a repackaged core snap.
  • We could adjust the core snap to have paths compatible with what is used in Fedora. This would also require updating the C library as the world moves on to ever-more-recent version of libc.

Actual solutions:

To fix the issue we need to perform several changes that prevent us from using incompatible executables from the host distribution inside the guest namespace.

  1. Stop repackaging core the way we do now unless we are testing against a ubuntu 16.04 which is ABI compatible with the core snap. Unless on Ubuntu 16.04 we must only replace snap-exec. Please see the list of executables below for detailed analysis.
  2. When snap-confine is invoked with —base it bind mounts /usr/lib/snapd from the host. This must be changed to a bind mount from /snap/core/current/usr/lib/snapd or /snap/snapd/current/usr/lib/snapd instead. This prevents us from using incompatible binaries from the host distribution.
  3. This also means that to run a snap with arbitrary base we need to either have snapd or core snap installed. This is a new prerequisite that needs to be changed in the snap manager.

snapctl

Dynamically linked go but that’s fine. No need to repackage.
Until it radically changes. (Document this in snapctl/main.go)

snap-gdb-shim

Dynamically linked C can stay as is with caveat.

snap-confine

Dynamically linked C. Can stay as-is with caveat.

snap-device-helper

Shell script. Can stay as-is with caveat.

INFO

We don’t need to repackage it as the rest of the core snap is old.

The caveat above is that if any of those files changes significantly we won’t see the new versions inside the core and inside the execution environment until said changes are released to stable. To fix this we need the full solution that involves building snapd twice, once for the host and once for the core snap repackaging (or snapd snap repackaging).

In the first part it’s not very clear which executables are actually invoked/affected, what about snap-exec?

The set of executables invoked in the initial mount namespace (aka with future libc)

  • snapd (because we re-execute into it)
  • snap (because we re-execute into it)
  • snap-confine (because we run the version from core explicitly)
  • snap-exec (because it is executed for classic snaps without using pivot_root)
  • snap-device-helper (because it is executed by udev rules)
  • snapctl (because it is executed by hooks in classic snaps)

The set of executables invoked in the per-app snap mount namespace (aka, with old libc)

  • snapctl (because it is executed by hooks in non-classic snaps)
  • snap-confine (because it may be executed from snaps in devmode, CE relies on this for testing)
  • snap-exec (because it is executed for non-classic snaps after pivot_root)
  • snap-device-helper (because it is executed by snap-confine and snap-confine may execute)

Out of those, the executables that use static linking are not affected. Currently this includes:

  • snap-exec
  • snap-update-ns

Those are always safe to use, even if they were built with future toolchains and executed in past environments. They don’t perform any essential IPC or rely on any wire protocol.

The problem as we are experiencing it now is strictly limited to snap-device-helper. To a lesser extent snap-confine may stop working in the future as it links to libc and libudev from the future.

this is a bit confusing to me, also because we have the case where there is no re-execution, what is the initial namespace? the one with the host filesystem?

not sure I understand this bit

afaiu we cannot do this generally because of the non-reexec case, I’m also not sure this is what we do currently either, there is this comment atm in the code:

            // bind mount the current $ROOT/usr/lib/snapd path,
            // where $ROOT is either "/" or the "/snap/{core,snapd}/current"
            // that we are re-execing from

Whatever we do we would like snapctl and snapd to match.

It’s also unclear whether you are saying we have a non-test issue related to this or not,

@zyga-snapd I’m still confused whether we do something incorrectly as well in the non-test case, or the issues are indeed as per topic title only for when in test we use a repackaged core.

In theory when we don’t reexec we could remove or put a broken binary into core for any non-used binary.

Do we have distros where the issue is present and reexecing is the default?

We don’t have any issues that I know of in the places where we reexecute. IMO this is strictly confined to testing environment.

Can’t you just make a UsrMerged core snap for 16.04 and 18.04?

No one told me this was a problem. This is the first time I’ve heard of it. brp-mangle-shebangs supports being told which files to exclude from mangling.

I don’t understand how that fixes the situation. Perhaps the topic is confusing and my explanation insufficient. If you want I’m happy to have a call to brainstorm this.

EDIT: To be clear, this is not a packaging bug. It is a bug in the test infrastructure

Sure, I’m open for a bit today, and most of the afternoon tomorrow.