Snap refresh core hangs until reboot timeout

When refreshing the core snap, I noticed that it never seems to finish the update until the 10 minute timer expires and forces the reboot. Because of this, control is never returned to the user, or in my case, the calling script. Because this is something I’m using for some automated tests, I want to have a little more control over when the system reboots, so that I can refresh multiple snaps, then force the reboot and wait for the system to come back. But because of this long delay in the core snap without ever returning control, it gets part way through the list of commands to run over ssh and then times out until reboot - then the subsequent ssh commands fail because it’s unexpectedly down.

It also seems that once it hits this point, any other attempt to ssh to the system fails.

This appears to be a known issue, and is documented in https://bugs.launchpad.net/snapd/+bug/1687608

Is there any reason why the core snap behaves this way? Is something blocking it from completion, or is it just really taking longer than 10 minutes to finish the refresh?

FWIW we are using the same infrastructure to run part of our tests and the test cases that involve a core refresh or revert work consistently well if we take the following approach:

  • Run the refresh/revert command that will result in a reboot in a separate job step: this step connects to the testbed, runs the eventually hanging command and keeps the connection open for about 10min until the testbed reboots, when the connection is broken.
  • Run the rest of your commands as spread tasks in the next job step: spread takes care of waiting for the connection to be up.

As an example of this, during the core validation at beta we run a test scenario in which we begin from a stable image, then refresh core to beta and then run the full test suite. This is the job definition: http://paste.ubuntu.com/24645545/ (we call refresh on line 9 and right after spread on line 10) and these are the results: https://paste.ubuntu.com/24633121/ (the reboot takes place around line 5320 and the job continues with a refreshed image), this job has been consistently working for the last three releases, four executions in each release.

This is by no means a solution to the underlying problem, but if you can layout your tests in this way it will work after the final solution is in place, please let me know if you need any help wrapping your test actions in spread tasks.

It would be also possible to run the whole job as a single spread task and orchestrate from there as we do in this more complex refresh+revert scenario https://github.com/fgimenez/snappy/blob/e670e4359a18f16f05a53a52a01781e00cf73fb7/tests/nested/core-revert/task.yaml, this seems to me as a longer term solution, let me know if this would fit better your use case.

The current behavior was the result of fixing some more serious bugs but we do plan to improve on this:

  • the command will terminate saying we are about to reboot, and reboot will happen more quickly (seconds, not minutes)

  • because we need that for our own tests there will be some way to control the reboot (haven’t thought through which way exactly yet)

1 Like

Hi, Is there a way to control the reboot now? Is that capability implemented?