How We Built A Microservices Platform
At REI, like many companies these days we’ve been on a journey to refactor our monolithic application into microservices. A huge accelerator for us in this endeavor has been our homegrown micro-services platform we’ve named Alpine. Eventually we hope to open source significant pieces of this but in the meantime we’d like to share some of our lessons learned along the way.
A Little History
Once upon a time there was only one established path to production for web applications: our monolithic e-commerce site. This led to many unrelated admin tools and services being added on because teams didn’t want to blaze a new path to production. Several years ago we realized this was unsustainable and set out to make it easier to make new separate applications instead of tacking them onto our existing monolith.
Our initial goal was to make developing internal administration tools easier to create because this was the most significant source of code in the monolith unrelated to it’s main purpose.
Stepping into Alpine
Our initial platform solved the “path to production” problem for internal administrative tools. These are the keys:
Pre-provisioning shared application servers
Provisioning servers at the time came with an extended lead time and couldn’t be automated. To work around this we pre-provisioned a pool of servers that all of these applications shared. We used a configuration management tool to make it easy to add new Application Server configs for new application. It wasn’t completely automated but had a fairly quick turnaround.
Standardizing the Continuous Deployment Pipeline
At the time we’d been practicing continuous delivery (not deployment) of our monolith and doing daily deployments. We wanted to take lessons learned from this pipeline and step it up to full on continuous deployment for the lower risk applications we were deploying to this platform. We chose to use most of the same steps as our monolith but instead of a daily release window we made it deploy to production after the integration tests passed in the lower environments.
Automating the creation of the Pipeline
Automate everything. We quickly realized that creating 10+ Jenkins Jobs for every application needed to be automated. We discovered the excellent Jenkins Job DSL plugin to do the job. This allowed us to generate all the necessary jobs from a central configuration
Creating a base application framework
By building something inside of the existing monolith teams got a bunch of stuff for free. They didn’t have to configure Spring, Hibernate, RESTEasy, and all of our internal integrations. Enter Spring Boot. We built a base application framework on top of Spring Boot called Crampon (get it :D) which configured all of this base stuff automatically and consistently for all of our services.
Application Registry
One of our biggest keys to success was creating a central registry of all of our applications along with metadata about them such as what environments they have, who are their maintainers, database connection information, etc. This allowed us to drive lots of automation based on that data. Our generated pipelines, deployment tooling, and many scripts were able to leverage that data and created a single place to change things.
Client Jars and Easy Inter-Service Communication
Every micro-service produces a client jar to make consuming services it exposes simple. Crampon uses RESTEasy and it’s Client Proxy capability along with sharing annotated interfaces between client/server to make publishing a client jar almost automatic. Crampon is then able to automatically create Spring Beans for all annotated interfaces and using information in the Application Registry to automatically target them to the correct URL.
Evolution of Alpine
One of the key goals we had from the get-go was the idea of a button someone could push that would create and entire application and everything it needed to run and operate in a real production environment out of the box. There was too many manual steps involved in the setup of new applications at this point so we set out to change that.
Docker
When Docker came on the scene we decided to rebase our entire platform around it. We switched our core application framework to use an embedded container with Spring Boot instead of deploying to an application server. Now instead of trying to manage a bunch of application server clusters that required a manual setup per application we were able to create large common cluster of Docker servers where we deployed containers.
Pave As You Go
As we grew we realized that there was no need to provision environments until they were needed. As we made creating application easier more of them got created, many of which were experimental. We knew we couldn’t sustain creating production environments for them until they were really necessary. In our new world we created databases, load balancers, and DNS records on the fly when an application was first deployed to an environment. Given our infrastructure at the time this at time required some creative solutions to get access to some systems.
Build in Security at the Start
We realized that to get security right it’s best to build it in early on. From the start we only ever exposed our Alpine applications over HTTPS. In order to encourage limiting access to least privilege we built in a permissions management tool, service-to-service authentication, and encryption tooling as out of the box components. Our goal was to make it easier to do the right thing than the wrong thing.
Self-Service Application Creation
In order to accelerate the growth of our platform we created “the button”: a self-service web page where engineers can create a new application from scratch. This page asks a few questions about the application and is then able to provision a new source repository, a base project template using EZ-Up, build pipeline, app model, and base permissions. Having this capability really unlocked experimentation and adoption of our platform. Teams were able to stand up new applications entirely on their own and our team only knew about it because of the notifications we get for new application production deployments.
Make Resiliency Automatic
Services consuming Crampon client jars automatically get setup with Hystrix circuit breakers for the target application. We automatically setup each rest service with sane default settings and give service owners the ability to customize as needed. In conjunction with the circuit breakers with have two built in fallback mechanisms: fallback spring bean which implements the same interface as the client interface, and a fallback cache on S3. In the case of an S3 fallback cache (only for services in which it’s specifically enabled) previously successful responses get written to S3 by the server, the client is then able to fallback to reading that if the service is down.
Monitor Everything
From the very beginnings of our platform we baked in monitoring. We use a combination of StatsD and NewRelic for our application monitoring. Out of the box Crampon based services record a whole host of basic information such as response times/codes for all rest services it exposes. We are then able to automatically generate a Grafana dashboard with alerts for the most basic error conditions such as 500 responses.
Key Takeaways
To summarize what we’ve learned over the course of several years building this platform:
- Automate Everything
- Having a central place to configure common things enables lots of automation
- Make doing the right thing the easy thing
- Build as much “out of the box” as possible
We spent a lot of effort over the last several years getting to where we are and hopefully some of our lessons learned can be helpful in your journey.