A significant piece of our Alpine Platform is the deployment service we call Sherpa. It’s capable of deploying to both on premise and AWS with a blue-green deployment strategy. The cloud is particularly well suited to this deployment strategy because you don’t need to keep the inactive servers around, however on ECS (Amazon Elastic Container Service) there’s not a documented way of performing deployments with that strategy, so we had to figure it out through some trial and error. Hopefully our experience is helpful for others.
The Alpine ECS Architecture
We have a fairly standard setup for deploying ECS services. We have two ALBs for each app, one for production and one for all non-production environments and we use host based routing to direct the traffic to the right target group. Each ALB has its own security group and so does each ECS cluster so we are able to minimize the number of open ports to the ECS cluster.
The Sherpa Deployment Lifecycle
A Sherpa deployment (whether it’s targeting AWS or On Prem) starts with what we call the
STAGE lifecycle phase.
Every deployment is given a unique ID which is used in many places.
For AWS targeted deployments, at a high level, we:
- Create an IAM role for this application / environment if necessary
- Create the ALB if necessary
- Create the Target Group with a name specific to this deployment using the Deployment ID
- Create the ALB listener configured to listen to this application’s load balanced port
- Add a Host routing rule for the ALB and create a DNS entry if necessary
- Create a new Task Definition with this deployment’s configuration, each application / environment gets its own Task Definition Family
- Create a new ECS Service, uniquely named with the deployment id, targeting the newly created Target Group
- Wait for the ECS Service to reach a steady state
The key thing that’s unique for our blue/green deployment strategy is that we create a brand new ECS service and ALB Target Group for every single deployment.
Our overall deployment pipeline is orchestrated by Jenkins and the next step after staging a
deployment is to run a set of tests on the newly created stage cluster. If those pass we move on to the
phase. The activation of a deployment is the easy part and happens very fast. We:
- Update the Host routing rules to remove the stage route and add or update the rule for the active cluster
- Create the DNS entry if necessary
Once a deployment has been activated we are able to teardown the previously active deployment. We’ve learned the hard way the connections to the previously active cluster can take a couple of minutes to completely drain off so we wait a few minutes before we:
- Delete the ECS Service for the deployment
- Deactivate the Task Definition which essentially just hides it in the UI
- Delete the Target Group
We’ve been successfully deploying micro-services to production with this strategy for about a year and it’s worked very well for us. While ECS is more limited than something like Kubernetes we’ve been very happy with how reliable it’s been even if we’ve had to do some creative things to make our deployment strategy work.