A story about moving big things in a safe way.
History (Very abridged)
Way back around 2015, REI was well underway in our microservice journey. A newly formed team was tasked with re-developing the REI Outlet web experience, and page designs had a completely different UI than existing REI.com pages. New development frameworks were desired and the work required to add those to the existing monolith would be daunting and risky. We decided to try creating a front-end microservice by bolting Thymeleaf, React, and a front-end build onto our existing microservice framework. Getting everything up and running was remarkably easy, and the site was created in short order.
Several interesting problems emerged, the most interesting of which was getting selected Outlet traffic routed to the new front-end microservice. At the time, all traffic routing to our existing monolith, which needed to continue for several reasons: it was the only approved ingress point, contained numerous security filters, and processed URL rewrites and redirects, using our home-grown (and ancient) “shortcut” tool. Adding a routing mechanism in front of the monolith would bypass the coveted shortcut tool, so a reverse proxy in the monolith was chosen instead, extending the open-source Smiley’s Proxy Servlet.
The reverse proxy+front-end microservice combination was experimental but very successful. It wasn’t intended to be a generalized pattern for others to follow, but, within months, several other front-end microservices were created and using the proxy. And that was only the beginning. Usage quickly grew to dozens over the next couple of years. This solution had clearly solved a problem that other teams were experiencing.
Problems (Tell me about it)
Over time, there were two main issues we began to discover:
- A move to a microservice architecture should decrease reliance on the monolith, but this pattern actually increased it.
- The monolith used HTTP pools, and some calls ended up in loops. For example, the monolith often proxied to a front-end microservice that, in turn, called a REST service on the monolith. As these loops began to be called more frequently, we had to increasingly monitor and adjust HTTP threads to avoid pool starvation.
Solution (A new tool!)
We clearly needed to extract routing out of the monolith into a new tool or service. Because the shortcut tool’s rewrites and redirects are so critical to business functionality, it needed to be extracted. Additionally, existing filters (for security, etc.) would need to be rehomed. Some basic criteria the new tool would need to fulfill:
- Path-based routing
- Better if it can update routes while running
- Allow URL rewriting and redirects
- Must be able to change while running
- Extendable by code or provides plugin architecture
- Hooks for custom filters for security, logging, metrics, etc.
- Performant and memory effective
- Scalable and capable of handling all traffic to REI.com
- Preferably stateless
Commercial-off-the-shelf products, frameworks, and custom in-house development were all considered. Because all our microservice development tooling is based on Spring Boot, one solution presented itself as an obvious candidate to solve our routing needs: The relatively new (at the time) Spring Cloud Gateway project. Our rewriting and redirecting requirements are very complex and we were not able to find an existing tool to replace it. However, because Spring Cloud Gateway extends Spring Boot, it was fairly easy to combine its routing functionality with custom code to facilitate the shortcut tool.
Every project needs a good name, and most REI projects are named after the outdoors. “Switchback” was chosen for this project, because switchbacks change direction and also make it easier to climb up a steep slope.
Implementation (Do the smallest possible thing)
We knew introducing a new service that would carry all REI.com traffic was potentially risky, so we took several steps to mitigate risk. In the DevOps spirit of “do the smallest possible thing,” we created the service as a simple, straightforward Spring Cloud Gateway implementation, with additions for logging and metrics, but with no routing, URL rewrites, or security filters activated. It continued to route everything to the monolith. Initial deployment of the service was uneventful (but nail-biting), and we spent quite a while monitoring and ensuring we weren’t degrading the delivery of the site in any way. Once this was in place, we began to iterate on implementing additional functionality.
Security filters were slowly and carefully migrated to Switchback. Development on the shortcut tool replacement took a while, and it required a complex data migration. Even though we tested it thoroughly in pre-production environments, we were still very concerned that some of the rewrites might not function correctly when serving actual customer traffic. To reduce the risk, we used a technique we learned from the REI Search Team, called “Shadow Mode.”
Shadow mode (Hero of the day!)
In shadow mode, instead of directly replacing an existing service, we run the new service quietly alongside the existing service and compare the results. For Switchback, this meant calculating all shortcuts, storing them in the request context, then passing traffic downstream. The monolith then computed the shortcut, performed the rewrite or redirect, and added the result to a response header. As the response flowed back through Switchback, it compared its calculated shortcut with the one returned on the response header. We wrote metrics to identify our hit-miss ratio and logs to find rewrites that needed to be fixed. Admittedly, there were quite a few that would have broken if we had gone live without shadow mode. Once we reached 100% hit rate, we were ready to go live.
Activation (Wait, we’re done already?)
Activating Switchback’s shortcut tool was one of the most anti-climactic product deliveries of my life. We flipped the switch, and it was just on. There were no alarms and no last-second scrambles. No superheroes were required to save the day. To be honest, no one even noticed. It was a perfectly humble ending to a long story.
Despite its quiet beginnings, Switchback has become a pivotally important part of our ecosystem, routing to around forty microservices, providing a much-needed update with its new URL rewriting/redirecting tool, writing sitewide logs and metrics, and providing important security filters.