Conjure-up and lxd, present issues and future resolutions

adam.stokes · May 9, 2017, 9:18pm

In the conjure-up world we span multiple substrates to install big software. One of the most popular substrate is our Localhost(LXD) provider. It is also one of the main source of bug reports we get due to LXD configuration and installation issues. This is further complicated because on Ubuntu Xenial 16.04 you can have multiple LXD’s running at the same time via apt install or snap install.

A lot of our issues come in the form of the following:

A user has previously setup LXD and then attempted to run conjure-up against it. A problem we ran into here are with custom zfs storage options.
Users who have previously installed the initial debian package of conjure-up which setup some custom LXD bridges for use with our single machine OpenStack deployments which leads to the next point.
A user has upgraded from LXD 2.0.9 to the PPA version of 2.13 (at the time of this writing) and has several bridges and and the migration from using /etc/default/lxd-bridge to managing that within LXD itself.
A user installs conjure-up and can do all things LXD lxc list;lxc launch ubuntu:16.04 u1 but when it comes to juju it fails to connect to the default 8443 port. One issue was where someone used lxc-gui to configure their environment.
A user enables ipv6 on LXD.

Now some of these items we can address in the LXD profiles themselves. This is our current approach where conjure-up spells have the ability to perform LXD profile edits prior to deployment of big software. For example, mapping a custom default storage pool for use with conjure-up.

And in conjure-up we do some checking to validate a LXD environment:

Install the latest LXD from snap store or PPA during conjure-up installation to make sure we do have latest LXD.
Use lxc network to create a custom network bridge for use with OpenStack single machine.
Verify that a lxdbr0 default bridge exists.
Verify that ipv6 is not enabled.

Even with the above checks we still continuously hit issues revolving around LXD. Additionally, conjure-up makes modifications to your existing LXD environment that may or may not be acceptable to users who rely on LXD for things outside of conjure-up. Even if these changes were reversable we are unable to do that because snapd has no concept of uninstall hooks.

One solution we keep coming back to is to bundle LXD within conjure-up and isolate this environment from the rest of the system. Some of the pros with this is:

We manage LXD and can control the known versions that will work with Juju. At times there have been api changes on either side that may affect the other.
Any additions required for LXD to work with conjure-up can be done outside of the users normal LXD installation. We realize we can’t get away with not making some modifications on the host system in order for us to provide a ootb experience, but, this would at least isolate those changes.
Any conjure-up snap upgrades we can verify that LXD is configured properly and make better assumptions as to what conjure-up can expect when doing Localhost(LXD) deployments.
Repeatable and known good releases with our testing harness.

Cons being:

Security
More maintenance burden
Upgrading existing deployments would mean we’d have to attempt to migrate those over to the bundled LXD or socialize the fact that you would have to redeploy. This would be a one-time thing.
Containers would only be visible via conjure-up.lxc
Probably a lot of other things…

Also, I think this problem would still surface regardless if we had interfaces to LXD or snap dependency resolution.

I wanted to get this out there to get peoples feedback and any alternative suggestions to solving this problem. The idea of packaging everything we need inside a single snap is appealing, however, we are definitely open to other avenues.

kwmonroe · May 10, 2017, 2:52pm

+1 to bundling lxd with conjure-up. I imagine something similar drove the decision to include juju alongside c-u. Even with your cons list, it makes sense to me to have tighter control over all the moving parts.

Fwiw, the biggest con to me will be solved with docs. I’m envisioning lots of people doing lxc list and wondering where all their stuff is. Gonna need big red letters so people know how to interact with conjured containers.

Edit: one question @adam.stokes - what does this do to macos users? I’m assuming you can adjust the brew formula so it doesn’t include any lxd.

adam.stokes · May 10, 2017, 3:08pm

All the building and packaging of conjure-up on Linux is done through snapcraft. The macOS builds will continue to build without support for LXD.

lazypower · May 10, 2017, 4:28pm

You’ve clearly put some thought into this.

We have run into similar situations with complimentary container technology in the kubernetes stack and decided to go with apt-packaged docker for now while we sort out the issues moving forward with externally maintained docker packages (snap, or upstream PPA).

having run into similar situations, I myself wish we could more tightly control that pipeline at the cost of maintenance burden to ourselves, but provide a better experience to users.

If you’re prepared to encapsulate that level of testing and make it clear on the spell summary overview that it’s in a contained LXD I think this is a fantastic idea!

At a minimum, we should explore this avenue and accrue experience in why this is good/bad. As I’m not present in most of the support conversations regarding this, however, I can see from the few interactions I’ve had with respect to CDK on LXD that this sounds like an easier coupling in your test/release cycle.

+1 from me.

petevg · May 10, 2017, 8:27pm

This makes a lot of sense. My one counterargument is that, right now, juju is not a tool that generates any lock in. If you want to kill your controller and stop using juju, you still have all the machines that it deployed, accessible in a normal way.

With this approach, conversely, you must keep the conjure-up snap installed and interact with lxd via that snap.

(You can still ignore the rest of conjure-up/juju, of course. But you can’t make quite as clean a break as you would had conjure-up simply put things on your system’s lxd.)

Would it make sense to give users an option? Or offer them the option to use the bundled lxd if we detect an issue using the system lxd?

adam.stokes · May 10, 2017, 8:40pm

That’s pretty much what we do today, we attempt to detect what LXD’s are installed and make use of them, however, the pain points still stand. Also, we try to make the deployment as seamless as possible so giving us the ability to make assumptions is a big plus.

mjfs · May 11, 2017, 3:10am

Given that the localhost/LXD provider is really only going to be used for developer and “let’s try this out” scenarios, I think it’s totally appropriate for conjure-up to bundle its own lxd.

petevg · May 11, 2017, 2:29pm

@mjfs: That is a very good point. And telling people about conjure-up.lxd gives them to tools to poke at auxiliary services in other snaps, which might help the snap ecosystem as a whole, in terms of giving people the mental toolkit to successfully troubleshoot snap issues.

I am +1 on the change.

drpaneas · June 17, 2017, 10:05am

I ran into all sorts of childish problems last night, exactly with that is discussed here. I’ve installed conjure-up from snaps, but when I tried to run conjure-up kubernetes it threw me some python traceback complaining about the lxd. The lxd package was not even installed in my system. I would have expected that a bundle, like what snaps claims to be, to take care of its dependencies. Then I ran into other problems also, but I suppose you all are already familiar with those.

Given that Canonical is actively boosting Kubernetes, you have to either quickly change the documentation (by default it doesn’t work) for installing Kubernetes localhost (see conjure-up with snaps, and LXD container issues) or somehow take care of the lxd infrastructure as a part of the snaps installation.

+1 on the change

adam.stokes · June 19, 2017, 2:37pm

As of 2.2 we do bundle LXD. So hopefully the childish problems you’ve run into are gone. If not, feel free to open a bug report and help us get those remaining issues fixed. Documentation has been updated just waiting for the auto sync to happen so they go live.