Marek Skrobacki

Deploying CoreOS cluster with etcd secured by TLS/SSL

With Docker growing into stable release, I decided that it's time to finally migrate some of my projects into docker based production environment.

Having tested different deployment systems, some time ago I've settled for CoreOS mainly because it's lightweight, secure and seems to be most mature at the moment. One of biggest benefits for me is "batteries included" approach - with default installation you get:

  • docker hypervisor
  • etcd - for storing arbitrary information that is available to whole cluster
  • fleet- for container management

During last few months I have been working on developing the application that had docker support in mind from the very beginning. For purpose of this article, let's just assume it's a typical web app that requires multiple settings which are usually exposed to the containers through environment variables. Given that the app is still in alpha, it was perfect candidate to be tested on CoreOS cluster.

While I confirmed that it works well on local Docker hypervisor, it is completely different story to deploy it into production on the remote cluster securely. The development version was launched through fleet and all settings (including credentials) were hard-coded into either container filesystem or fleet's/systemd unit files. This is obviously big no-no for any sort of production apps.

Solution: storing credentials in etcd, non-secure way

After evaluating my options, I decided that best option to store all those settings would be to simply put them into etcd cluster during deployment phase. Specially crafted fleet unit files would later retrieve them by using etcdctl and inject relevant values to the containers.

Example unit file would look something like this:

[Unit]
Description=myApp
After=docker.service
Requires=docker.service

[Service]
EnvironmentFile=/etc/environment
TimeoutStartSec=20m
ExecStartPre=/bin/sh -c "IMAGE=docker.example.com:6000/myApp; docker history $IMAGE >/dev/null || docker pull $IMAGE"
ExecStartPre=/bin/sh -c "docker inspect myApp-%i >/dev/null && docker rm -f myApp-%i || true"
ExecStart=DBPASS=`etcdctl get /services/myApp/dbpass`; /usr/bin/docker run --name myApp-%i --rm=true -p %i:3000 -e HOST=$COREOS_PRIVATE_IPV4 -e DB_PASSWORD=$DBPASS docker.example.com:6000/myApp

[X-Fleet]
X-Conflicts=myApp@%i.service

In excerpt above, the systemd executes etcdctl in order to retrieve value stored in etcd under /services/myApp/dbpass key. In next step, this value is passed into the docker container as an environment variable (-e DB_PASSWORD).

While this approach may seem overly complex, it eliminates the need of exposing whole etcd cluster to the container. You may think that it will be easier to just access etcd directly from the container, but it will be clear violation of least privilege principle. Also, the authorization in etcd is practically non-existent.

Securing etcd with TLS

Method described above works well, but has two major security problems:

  • anyone with network access to etcd cluster can retrieve your precious secrets simply by sending the query to etcd. In some cases this can be partially mitigated by employing strict firewall rules, but there is no easy way to prevent unauthorized access if you don't have control of the network.
  • If communication between etcd nodes happens over shared / unsecure network, the traffic can be sniffed and decoded.

There is a mechanism that addresses both of those problems. You guessed it. It's TLS. Authors of etcd decided to use it for authenticating the cluster nodes with each other, as well as encrypting the communication.

The app which I talked about in previous paragraphs uses credentials to multiple different systems and requires maximum security in production, so I decided to try deploying CoreOS cluster with TLS based security.

After spending several hours reading through minimal (and often incorrect) etc security model documentation, I managed to get cluster up and working.

For my lab purposes I used couple of Openstack instances. To achieve additional security I opted for in-house discovery cluster rather than using discovery.etcd.io. The idea was to keep this deployment completely isolated from the Internet.

Unfortunately I was not able to fully automate the provisioning, but I got pretty close. In high level overview it consists of two phases:

  • provisioning cloud instances with newest CoreOS
  • generating and uploading TLS certificates to each of the nodes

Provisioning cloud instances with CoreOS

In order to create new CoreOS instance, you will need following information:

  • ID of the CoreOS image. In my case, it was afb5ee19-4e6e-42c3-841c-9663e99b83ba. You can get list of images by running nova image-list
  • Flavor number that you are going to use. Use nova flavor-list to get a list.
  • Name of your public SSH key that will get injected to new hosts. If you uploaded it before, use the name reported by nova keypair-list. If not, use nova keypair-add to upload it.
  • Discovery address that nodes will use to discover each other. Generate new on at https://discovery.etcd.io/new or your internal etcd cluster.

This information will be required in your cloudinit user-data file and script to boot the instances.

Mass boot script

I ended up with following script to provision the instances:

#!/usr/bin/env bash
#
# Usage: ./provision-rs-cluster.sh <key pair name> [flavor]
#

set -e
if [ -z "$1" ]; then
  echo 'Usage: provision-rackspace-inova-cluster.sh <key pair name> [flavor]'
  exit 1
fi
if [ -z "$2" ]; then
  FLAVOR="3" #1024MB instance
else
  FLAVOR=$2
fi

if ! which nova > /dev/null; then
  echo 'Please install nova client and ensure that it is in $PATH.'
  exit 1
fi
if [ -z "$NUM_INSTANCES" ]; then
    NUM_INSTANCES=5
fi
i=1 ; while [[ $i -le $NUM_INSTANCES ]] ; do \
    echo "Provisioning coreos-$i..."
    nova boot --image afb5ee19-4e6e-42c3-841c-9663e99b83ba --flavor $FLAVOR --key-name $1 --user-data user-data --config-drive true coreos-$i ; \
    sleep 3
    ((i = i + 1)) ; \
done
echo "Your cluster has been successfully deployed."

cloud-init

Obviously, this script wouldn't be of any use without user-data file that takes care of post-installation setup. You can get detailed info about cloud-config in CoreOS documentation chapter about customization , but in summary some of its responsibilities are:

  • setting up etcd's discovery address and other parameters
  • deploing additional systemd units

Please note that CoreOS does not implement full cloud-config syntax.

My user-data file looks like this:

#cloud-config
---
coreos:
  etcd:
    # use unique, per-cluster value here
    discovery: http://10.XXX.XXX.38:4001/v2/keys/_etcd/registry/280F0E1C-B9FE-4D71-B9DF-CBAD07DC42B9
    addr: $private_ipv4:4001
    peer-addr: $private_ipv4:7001
  units:
  - name: docker.service
    content: |
      [Unit]
      Description=Docker Application Container Engine
      Documentation=http://docs.docker.io

      [Service]
      ExecStartPre=/bin/mount --make-rprivate /
      # Run docker but don't have docker automatically restart
      # containers. This is a job for systemd and unit files.
      ExecStart=/usr/bin/docker -d -s=btrfs -r=false -H fd://

      [Install]
      WantedBy=multi-user.target
  - name: docker-tcp.socket
    command: start
    content: |
      [Unit]
      Description=Docker Socket for Remote API

      [Socket]
      ListenStream=4243
      Service=docker.service
      BindIPv6Only=both

      [Install]
      WantedBy=sockets.target
  - name: etcd.service
    command: start
  - name: fleet.service
    command: start
    content: |
      [Unit]
      Description=fleet

      [Service]
      Environment=FLEET_PUBLIC_IP=$private_ipv4
      ExecStart=/usr/bin/fleet

This is pretty vanila user-data that will give you standard, unsecure cluster deployment.
In order to enable TLS functionality in newly deployed nodes we have to:

  • reconfigure etcd unit with following environment variables:
    • ETCD_CA_FILE - location of CA certificate used to sign client certificates.
    • ETCD_CERT_FILE - location of ceritificate used for communication with clients
    • ETCD_KEY_FILE - location of private key used for communication with clients
    • ETCD_CA_FILE - location of CA certificate used for signing peer certificates. In my case it was the same CA.
    • ETCD_PEER_CERT_FILE - location of ceritificate used for communication with other etcd nodes
    • ETCD_PEER_KEY_FILE - location of private key used for communication with other etcd nodes
  • deliver files referenced by above variables to those nodes (see below)

Unfortunately CoreOS does not support SaltStack (and probably other CM systems), so I had to modify user-data file so that etcd gets booted immediately with above environment variables. This can be achieved by use of systemd drop-ins which are just extensions to existing unit files. I added following to my user-data at the end in write_files section :

write_files:
  - path: /run/systemd/system/etcd.service.d/30-certificates.conf
    permissions: 0644
    content: |
      [Service]
      # Client Env Vars
      Environment=ETCD_CA_FILE=/etc/ssl/etcd/certs/ca.pem
      Environment=ETCD_CERT_FILE=/etc/ssl/etcd/certs/etcd-client.pem
      Environment=ETCD_KEY_FILE=/etc/ssl/etcd/private/etcd-client.pem
      # Peer Env Vars
      Environment=ETCD_PEER_CA_FILE=/etc/ssl/etcd/certs/ca.pem
      Environment=ETCD_PEER_CERT_FILE=/etc/ssl/etcd/certs/etcd-peer.pem
      Environment=ETCD_PEER_KEY_FILE=/etc/ssl/etcd/private/etcd-peer.pem
  - path: /etc/ssl/etcd/certs/ca.pem
    permissions: 0644
    content: |
      -----BEGIN CERTIFICATE-----
      MIIDXjCCAsegAwIBAgIJAJXzVr07dOwSMA0GCSqGSIb3DQEBBQUAMIHGMQswCQYD
      VQQGEwJVUzELMAkGA1UECAwCVFgxFDASBgNVBAcMC1NhbiBBbnRvbmlvMRIwEAYD
      ........... TRUNCATED ............
      Fswf5tfAmQviftvXd/wA8/DcsRWe/75xVF6UA3IpntHux0vVU1RUPvg+At/1urUJ
      d5A=
      -----END CERTIFICATE-----
  - path: /etc/ssl/etcd/certs/etcd-client.pem
    permissions: 0644
    content: |
      Please generate new certificate with cluster/ca/new_node_cert.sh
  - path: /etc/ssl/etcd/private/etcd-client.pem
    permissions: 0644
    content: |
      Please generate new key with cluster/ca/new_node_cert.sh
  - path: /etc/ssl/etcd/certs/etcd-peer.pem
    permissions: 0644
    content: |
      Please generate new certificate with cluster/ca/new_node_cert.sh
  - path: /etc/ssl/etcd/private/etcd-peer.pem
    permissions: 0644
    content: |
      Please generate new key with cluster/ca/new_node_cert.sh

Having completed that, I went ahead and provisioned 5 instances in the lab. The CoreOS booted without problem, but etcd obviously failed as it did not have required certificate files.

Generating certificates

One of the solutions for problem of generating appropriate ceritifcates suggested by authors of CoreOS is to use project called etcd-ca. To be honest, I did not try it because I don't have Go compiler on my machine and I thought it will be really easy to complete with simple OpenSSL. Oh boy, was I wrong...

The cert generation process turned out to be most complex element of whole project. This was mainly due to very poor documentation on CoreOS project page. I'll tell you why it's poor just in a second.

Client certificate requirements

  • IP address of the client has to be included as subjectAltName on the certificate. Some of the docs mention that it can be just in CN as well, but it didn't work for me. In order to get subjectAltName you need to enable relevant X509.3 extension
  • Certificate has to have Extended key usage extension enabled and allow TLS Web Client Authentication.

Peer certificate requirements

  • Similarly to client certificate, the IP address has to be included in SAN. See above for details.
  • Certificate has to have Extended key usage extension enabled and allow TLS Web Server Authentication. There is no mention of this in CoreOS docs.

To simplify things, I enabled `TLS Web Client Authentication and TLS Web Server Authentication for both client and peer certs.


Now that you know what are the requirements for certificates, you should realise that embedding them directly into cloud-init user-data file is not going to be an option. Despite what example on CoreOS documentation page says you cannot have the certificates pre-generated as you don't know what the IP address of future instance is going to be. This creates bit of chicken and egg problem, because IP address is embedded in the certificate.

I solved it by putting placeholders into the relevant PEM files and once the nodes are provisioned, replacing them with actual certificates.

Generating and uploading TLS certificates

There are plenty of good tutorials on how to create your own CA with OpenSSL, but I wasn't able to find any that were specific to etcd and strict requirements listed in previous section.

After some experimentation I ended up with following config: openssl.cnf
If compared with default template it contains two major modifications:

  • related to Extended Key Usage - in line 90
  • related to Subject Alternative Name in lines 85 and 101

To generate new certificates for each node I wrote following script:

#!/bin/bash
NAME=$1
if [ -z "$NAME" ]; then
    echo "Please provide NAME of the node as an argument."
    echo "For example: ./new_node_cert.sh coreos-5 10.1.200.30"
    exit 1
fi
IP=$2

if [ -z "$IP" ]; then
    echo "Please provide NAME of the node as an argument."
    echo "For example: ./new_node_cert.sh coreos-5 10.1.200.30"
    exit 1
fi

# Generate CSR for CLIENT version of certificate
export SAN="IP:${IP}"
openssl req -config openssl.cnf -new -nodes -keyout private/node-$NAME-client.key \
    -subj "/C=US/ST=TX/L=San Antonio/O=NetOps/OU=Strategic/CN=$NAME Client/emailAddress=marek.skrobacki@example.com" \
    -batch \
    -out csr/node-$NAME-client.csr -days 3650

# Sign CLIENT version of certificate
openssl ca -config openssl.cnf -policy policy_anything \
    -extensions ssl_client -batch \
    -out certs/node-$NAME-client.crt -infiles csr/node-$NAME-client.csr

# Peer2Peer cert
export SAN="IP:${IP}"
openssl req -config openssl.cnf -new -nodes -keyout private/node-$NAME-peer.key \
    -subj "/C=US/ST=TX/L=San Antonio/O=NetOps/OU=Strategic/CN=$NAME Peer/emailAddress=marek.skrobacki@example.com" \
    -out csr/node-$NAME-peer.csr -days 3650

# Sign peer version of certificate
openssl ca -config openssl.cnf -policy policy_anything \
    -extensions ssl_client -batch \
    -out certs/node-$NAME-peer.crt -infiles csr/node-$NAME-peer.csr

Please note that IP address used has to be the one which is used on the interface etcd listens on. If you use different interface to communicate with peers and different interface for client communication, you will likely need to modify it.

I have also wrote little helper script that automatically SCPs the generated certificates to relevant nodes, but this is out of scope for this article and environment specific so I'm not going to include them here.

The last step that you need to execute is restart of etcd. The etcd unit is most likely in failed state as it wasn't able to start due to missing certificates, so something like systemctl start etcd.service on each node should do the trick.

Verification

After you performed all steps above, ensure that etcd has actually started by checking the logs with journalctl -u etcd or systemctl status etcd. If all went well, the daemon should have started and the leader election should take place.

If you wanted to verify if cluster works by simply running etcdctl ls you may be in for very unpleasant surprise. It turns out that etcdctl binary included with current alpha release of CoreOS does not support TLS.... This is a bummer, but for now you can simply use curl as a client.

I used following example command on one of the nodes:

curl --cert /etc/ssl/etcd/certs/etcd-client.pem \
        --cacert /etc/ssl/etcd/certs/ca.pem  \
        --key /etc/ssl/etcd/private/etcd-client.pem -v \
         https://127.0.0.1:4001/v2/keys/something

If you don't get TLS errors but any sort of HTTP response, you are all set.

Bonus: Getting new etcdctl with TLS support to your cluster

As mentioned above, I need etcdctl to be invoked in fleetctl unit files and it has to work with TLS encryption. It looks like it was added in version 2, which is not included with CoreOS yet. But that does not stop us from downloading and using it. In order to automatically download new version of etcdctl when nodes are provisioned I extended my user-data file by following purpose-built unit:

  - name: update-etcdctl.service
    command: start
    content: |
        [Unit]
        Descritpion=updates etcdctl to v2.0.0rc1

        [Service]
        Type=oneshot
        ExecStart=curl -L  https://github.com/coreos/etcd/releases/download/v2.0.0-rc.1/etcd-v2.0.0-rc.1-linux-amd64.tar.gz -o /tmp/etcd-v2.0.0-rc.1-linux-amd64.tar.gz
        ExecStart=tar zxf /tmp/etcd-v2.0.0-rc.1-linux-amd64.tar.gz -C /tmp
        ExecStart=cp /usr/bin/etcdctl /usr/bin/etcdctl.backup
        ExecStart=cp /tmp/etcd-v2.0.0-rc.1-linux-amd64/etcdctl /usr/bin

Unfortunately I think it will not survive across the CoreOS updates. I still need to figure out what to do about it.