great companies doing silly things / ubuntu

Canonical Autopilot crashes and burns

I recently had a chance to spend some free research time, and decided to check out Canonical Autopilot. Autopilot is what Canonical (the Ubuntu Linux people) calls their quick and easy managed OpenStack install, software Canonical wants us to try, like, and then purchase support.

I started out with the Autopilot How To page. It was very inviting. Just set up hardware, type in these commands, and you have an OpenStack cluster. I configured four servers to be compute nodes, with dual-quad CPU, 48 to 96GB RAM servers with 2 disks. In this case the “disk” being a RAID partition of 100GB boot and the rest (usually about 1.4TB) as a second disk. Yes, I have access to Hardware, with a capital “H”.

The how-to page links directly to the Download Ubuntu Server page, which has a big download button for 16.04LTS amd64, so that’s what I downloaded. I installed it on a dual socket/dual-quad-core, 32GB RAM server with 1.5TB RAID, again in 100GB + the rest configuration. That went well, next was the MAAS (Metal as a Service) install.

Error: Failed to fetch http://ppa.launchpad.net/maas/stable/ubuntu/dists/xenial/main/binary-amd64/Packages  404  Not Found

The recommended repo in Step 2, ppa:maas/stable cannot work, it doesn’t support Xenial/ 16.04LTS. First I tried replacing “stable” with “devel”. That almost worked. An hour later, I groped around and found ppa:maas-maintainers/experimental3, and was able to move on. In between I did this:

sudo apt-get purge maas maas-dhcp maas-dns
sudo apt-get clean
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install maas maas-dhcp maas-dns

Moving on to step 3, I created a maas region admin user and got the MAAS web UI up, and added my SSH key, a nice touch for future compute node administration. I imported an 16.04LTS ISO for use as a PXE boot image, a seeming recommendation that I was on the right path with 16.04LTS, despite the repo issues.

I went to Step 4 to set up the compute-side subnet, and… no Clusters Tab on the web UI. These instructions seems to be superseded by a more scattered approach of going back and forth between the Networks tab and the Nodes->Controller tab/link. Missing from the new approach in MAAS 2.0 is a direct way to invoke DNS and DHCP. More on that, later.

PXE Boot and Commissioning with 16.06LTS and maas-maintainers/experimental3 went very smoothly, with 2 of the 4 systems doing IMPI powerups commanded over Ethernet, very slick. For the other two I edited the Power node properties to Manual in the UI, and just switched the servers on myself. Once there, none of the systems had any storage configured, or IP addresses. Hmm.

I struggled a bit getting a compute-side subnet up, then realized I had to set a fixed IP on the second Controller Ethernet port (not mentioned in the setup text, I left it out on purpose because I just didn’t know how far auto-configuration went). So I did that, rebooted, and with an IP associated with the second interface, a menu appeared to set up a subnet on the second interface. So I tried that… and couldn’t. I entered this bug, and found I was using the wrong repo. Again.

Trying again, I removed and purged maas, and reinstalled from the maas/next PPA. That went OK getting the MAAS web UI back, went through basic configs (again), it found the image download, and adding SSH keys (again). This time I could edit subnets and add the necessary DHCP information.

What didn’t go so well is using the DHCP on 16.06+maas/next. While I was able to commission the nodes on the experimental3 PPA (but not able to add subnets), DHCP firmly decided to break when I reinstalled MAAS 2.0 from the maas/next PPA. Apparmor broke it. (Nothing new there, but really?) Once I fixed Apparmor, the DHCPD config files in /var/lib/maas config were gone. Once replaced, something in maas kept finding the dhcpd service up, and within 30 seconds would turn it back off again with systemd.

The MAAS 2.0 web UI has no direct control for turning subnet DHCP and DNS on and off, unlike the MAAS 1.9 UI dropdown under the now-missing Clusters Tab. Turning to the CLI to fix DHCP, I found the MAAS 2.0 documentation asked me to use MAAS commands that didn’t exist, so I entered this bug.

PXE boot worked fine and I Commissioned all my compute nodes, but none of them had their storage recognized.

Figuring that might come later, I moved on to the sudo openstack-install, which doesn’t exist, so (after Googling) I found the command conjure-up openstack, which failed quickly and  miserably.

Undaunted, I used the 30 seconds it would stay after service maas-dhcp start to PXE boot the compute nodes. That worked, but still no storage configured. Moving on, conjure-up openstack failed again, so I burned it all down.

Screen Shot 2016-06-16 at 4.18.50 PM

Starting all over again

I blew away 16.04LTS installed Ubuntu 14.04LTS on the controller, and started the how-to all over again from the start.

MAAS 1.9 started out well on 14.04LTS, until I went to PXE boot the compute nodes. Unlike the first two tries, it took many attempts to get each node to load an O/S, then half of them failed on reboot. In the MAAS web UI, I couldn’t edit any of the nodes, so I couldn’t set manual power mode on two of the nodes. It was very slow, but the other two finished commissioning. In any power state or New/Commissioning/Ready on any of the compute nodes, clicking on the Nodes/FQDN yielded a blank page.

Pushing on, I decided two working compute nodes were enough, so I tried installing openstack. While 16.06 failes with conjure-up, 14.04 fails with sudo openstack-install.

Screen Shot 2016-06-17 at 11.50.01 AM

In the logs, there was this badly formatted mess. It seems a trend nowadays, stuffing /var/log/syslog or messages with XML (very inappropriate), formatting codes (even more inappropriate), and/or just plain ugly dumps like this, backlash-N’s and all:

 Problem during bootstrap: '{'output': '', 'err': 'Bootstrapping environment "maas"\nStarting new instance for initial state server\nLaunching instance\nWARNING no architecture was specified, acquiring an arbitrary node\nBootstrap failed, destroying environment\nERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 400 BAD REQUEST ({"network": ["Node must be configured to use a network"]})\n', 'status': 1}'

Another bad trend is try these few commands, exemplified by the conjure-up io web page. No context, no prerequisites, just try it and watch it fail.

The current state of the Canonical Autopilot web page is a weird amalgam of 14.04 and 16.04 instructions and repo’s, that just doesn’t work. Canonical’s MAAS 2.0 documentation reflects almost none of the changes made since 2.0, and there are a lot. Autopilot is in a state of transition at the moment, a broken state. I wish them well, but can’t recommend it. I’m done wasting my time.

 

 

Advertisements

One thought on “Canonical Autopilot crashes and burns

  1. I totally agree with you, spent last two days with this mess. I was able to get it work, but it’s not suitable for the advanced network architecture. Moving to mirantis fuel, it looks like the most mature product from all of those orchestration tools.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s