

As part of our continuous efforts to bring economical computing while guaranteeing 100% availability into containerized workloads, such as Kubernetes, Amazon Elastic Container Service & Docker Swarm. We are excited to share our Elastigroup & etcd2 integration guide.
CoreOS-etcd
etcd is an open-source distributed key value store that provides shared configuration and service discovery for Container Linux clusters. etcd runs on each machine in a cluster and gracefully handles leader election for network partitions and the loss of the current leader.
Application containers running on your cluster can read and write data into etcd. Common examples are storing database connection details, cache settings, feature flags, and more.
Clustering etcd on spot instances
As you may already know, the etcd2 discovery service does not play well with post-bootstrap members joining and leaving the cluster. In this tutorial we will use a different method to reduce dependencies on external systems.
Automate
Our script starts off by querying the Spotinst API to fetch the Elastigroup ID (for members discovery) and using AWS Instance Metadata to get the instance-id
and instance-ip
. This information will be written to a file and instruct etcd2 to load this information on startup.
Cluster Membership
Etcd will expect instances to remove themselves from a cluster prior to termination. Since we are working with Spot instances it was necessary to add some additional logic to make the solution more robust so that Spot instances can be replaced without a problem.
Cleanup of “bad” members
Once added to our user data, our script will handle the cleanup in two methods to make sure no instance left behind.
- The instances will ask the Spotinst API for their status every 30 seconds and will deregister themselves from the etcd cluster if the “TERMINATING” status appears.
- When a new instance comes up, this process will compare the list of members reported by etcd to the list of running machines in the Elastigroup. Once the bad host(s) are found we can go ahead and send a REST call to a healthy cluster member to perform a cleanup task of the cluster. This will remove instances that have been replaced.
#cloud-config coreos: etcd2: advertise-client-urls: "http://$private_ipv4:2379" initial-advertise-peer-urls: "http://$private_ipv4:2380" listen-client-urls: "http://0.0.0.0:2379,http://0.0.0.0:4001" listen-peer-urls: "http://$private_ipv4:2380" units: - name: etcd2.service command: stop - name: spotinst-etcd-discovery.service command: start content: | [Unit] Description=Spotinst Elastigroup discovery [Service] ExecStartPre=/bin/bash -c '/home/core/spotinst_etcd/discovery.sh' ExecStart=/usr/bin/systemctl start etcd2 - name: fleet.service command: start - name: spotinst-etcd-termination.service content: | [Unit] Description=Validate spot server status [Service] EnvironmentFile=/etc/environment Type=oneshot ExecStart=/bin/bash -c '/home/core/spotinst_etcd/termination.sh' - name: spotinst-etcd-termination.timer command: start content: | [Unit] Description=Check spot instance status [Timer] OnCalendar=*:*:0/30 Persistent=true write_files: - content: | [Service] EnvironmentFile=/home/core/spotinst_etcd/peers path: /run/systemd/system/etcd2.service.d/30-etcd_peers.conf permissions: "0644" - content: | #!/bin/bash pkg="spotinst_etcd_termination" version="0.0.1" spotinst_token="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" # Create directory to write output to mkdir -p /home/core/spotinst_etcd/ ec2_instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) if [[ ! $ec2_instance_id ]]; then echo "$pkg: failed to get instance id from instance metadata" exit 2 fi if !(etcdctl member list | grep -q $COREOS_PRIVATE_IPV4); then echo "machine not found in etcd restarting etcd2" rm -rf /var/lib/etcd2/* systemctl restart etcd2 fi ec2_instance_status=$(curl -X GET -H "Content-Type: application/json" -H "Authorization: Bearer ${spotinst_token}" "https://help.spotinst.io/aws/ec2/instance/${ec2_instance_id}" | jq '.response | .items[0]|.lifeCycleState') echo "ec2_instance_status=$ec2_instance_status" if [[ $ec2_instance_status = *"TERMINATING"* ]]; then etcd_member_id=$(etcdctl member list | grep $COREOS_PRIVATE_IPV4 | awk '{print $1}'| awk -F':' '{print $1}') echo "removing etcd member from cluster: $etcd_member_id" etcdctl member remove $etcd_member_id fi path: /home/core/spotinst_etcd/termination.sh permissions: "0777" - content: | #!/usr/bin/env bash curl -fsSL https://s3.amazonaws.com/spotinst-labs/etcd-cluster/elastigroup-discovery.sh | \ SPOTINST_TOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \ bash path: /home/core/spotinst_etcd/discovery.sh permissions: "0777"
Note: This script cannot handle scale up at this time and will only maintain the initial cluster size
Conclusion
In this write up we have managed to create a functional etcd cluster running on spot instances with automatic recovery. If you are interested in testing out your own Spot based etcd cluster please give it a try and please share your results in the comments below!
-The Spotinst Team