Our Path to Kubernetes

We recently pulled the switch on migrating 100% of our traffic to Kubernetes. This was a long process for us for a number of reasons, including supreme fault intolerance. Some of our customers have to make all of their revenue for the year in quite a small number of hours, and their load profile makes configuration of load balancers and failovers more than a little problematic. We had several false starts in this conversion, as a result. Today, I am going to discuss how we got from Here to There.

Starting Points

When I joined Bypass just 15 months ago, we served a fraction of the traffic we do today, had about a third of the server inventory we have today, and had none of the sophistication we have today.

The typical application looked like this:

  • a pet EC2 instance, hand crafted from artisanal bits
  • a Ruby on Rails project
  • a bunch of completely custom-for-that-application Capistrano recipes
  • Samson to glue it all together

There were some bits of Capistrano that could be reused between projects. We did not do this consistently, though, so there were maybe three ways to migrate schemas across a dozen applications. Some apps had code to ensure zero-downtime deployments, while others did not.

Worst of all, horizontal scale required one to manually provision and configure a new, artisanal EC2 instance, the shepard a PR through git to add it to the Capistrano inventory, and then deploy once done.

We had two nascent services running on AWS EC2 Elastic Container Service, but our developers found it very challenging to troubleshoot these services due to ECS’s arcane (some might say profane) user experience.

Our Rails Monolith

The biggest challenge to date has been migrating our Ruby on Rails monolith. This application is responsible for a very, very large fraction of our business logic and is integral to our business operations. I hesitate to say that it is “in” our critical path, as in many ways it is the critical path.

If you have ever run a Rails application in production (or, truthfully, any app of the rails/django webdev generation), it follows a pattern you will recognize: a big monolithic codebase with an incredibly messy many-to-many object model that will survive as technical debt for another decade. While we are working to “break up the monolith” into something approaching microservices, this monolith is our reality today.

We have traditionally run it using Phusion Passenger, backed with MySQL, and running async jobs using Sidekiq and redis. It’s a stack you have probably seen a million times by now.

The Goal State

As of this weekend, something very close to 100% of our traffic is served by projects running in Kubernetes. We have a few low-traffic endpoints on the long trail to officially retire our pet server fleet, and hope to complete that process this quarter. The most important major straggler, which we transferred over this weekend, was our main Ruby monolith.

We aren’t doing anything particularly interesting in Kubernetes. No custom schedulers, for instance. We use the base APIs, and we use them effectively. In fact, our production stack is (currently) still running Kubernetes 1.4 at the moment.

We utilize the deployment API to define each service and, because we’re in AWS, we use LoadBalancer services to hook up to the outside world. The HorizontalPodAutoscaler API allows us to react to traffic spikes relatively instantly. This turned out to be a big deal, and a huge win, because our 100% Rollout Weekend was not as quiet as it should be.

Pre-Gaming It

Our initial Kubernetes-based services were a small handful of non-critical services. We started with a timeclock microservice used by a subset of our customers, an integration to a specific, major partner, and a couple of internal tools.

As we were wrapping up evaluating Kubernetes in production with these services, we had an outage to an unrelated system. The practical “upshot”, as it were, was that our background Sidekiq queues got about 2 orders of magnitude over “completely out of control.”

Kif Kroker: Captain, may I have a word with you?

Captain Zapp Brannigan: No.

Kif Kroker: It’s an emergency, sir.

Captain Zapp Brannigan: Come back when it’s a catastrophe.

[a huge rumbling is heard]

Captain Zapp Brannigan: Oh, very well.

As it happened, we had already begun testing the monolith in our integration and dev Kubernetes environments. We had it Dockerized already, though quite inelegantly. As the outage progressed, I started putting together the checklist I needed to run through to add inventory to the fleet to bring on more Sidekiq instances. If you look back up there, earlier in this very post, you’ll see why I was unenthused by this prospect.

But I had a container image, and I had a podspec. I could literally launch as many Sidekiq workers as needed at an instant’s notice. So I made the executive decision, cross-checked all the relevant environment variables, and spun up the deployment.

This was our first big – and completely unplanned – win with Kubernetes. Before this, it was still seen as a bit of a tech experiment. Afterwards, we haven’t just drank the Kool-aid, we’re practically swimming in it.

Making the Move

It took us about 6 months to fully make the move to Kubernetes. We had a lot of trouble predicting load and usage levels, tweaking memory and cpu limits on our deployments, generating test loads to verify, and even validating results from those tests we could do.

Nighthawks

Our first attempts to get a handle on load profiles was to have a cross-functional team work night shifts for about 3 weeks and load test the production infrastructure during off-peak hours. This coincided with our need to perform a handful of breaking infrastructure changes unrelated to the Kubernetes initiative.

AWS Route53 was a real god-send here. We generated three Route53 weighted-DNS configurations:

  • A Alias record pointing only to pet servers
  • A Alias records pointing to a mix of pets and k8s
  • A Alias record pointing only to k8s

In the end, however, this specific work only helped generate the knowledge base we needed in the long run and the results were not terribly useful. We were unable to generate a truly organic load test from which we could make infrastructure decisions.

Testing in Production

mimitw

Weighted DNS really made a difference here as well. In fact, this whole blog post could be summed up as “Route53 Is Amazing.” We had a couple of interesting pieces of information to work with.

  • We had 6 pet monolith servers
  • We had 2 hostnames
    • api, used by integrations and other services
    • ingest, used by critical flow traffic
  • Both hostnames pointed to the same 6 servers

This gave me a self-selecting feature break to phase in kubernetes.

api switchover

We started with the api hostname, as it would be more tolerant to failure. We started with weighted DNS directing 5% of traffic to k8s, 95% to the pets.[fn1]

After observing this configuration for a few weeks and performing additional tweaks, we elevated the mix to 75/25 and got stuck here for several months. Many of our problems here were mere instrumentation rather than issues with our software or with Kubernetes.

We solved this problem by changing the NEW_RELIC_APP_NAME variable for this deployment, allowing us to more readily separate out performance profiles between the pets and the k8s deployments. In fact, we wound up with the same software reporting as several different applications in New Relic. Once we did this, we were able to plainly see that performance in k8s was actually much better than we’d thought, and pulled the lever to move to 100% k8s.

ingest switchover

Moving ingest endpoints over was a much more arduous affair. This is our most critical traffic, the kind of business processes where our customers start blowing up our phones when things go wrong. As a result, we were much more conservative in migrating this traffic. The work here ran in parallel to the work on api, just at a slower pace.

Then something unexpected happened.

We had been talking about retiring the ingest hostname for quite some time and I lost track of the status of that initiative. Shortly after we reached a 50/50 mix on ingest, our tablet engineers shipped a version that implemented this change, and they did so in the run-up to a product launch expected to deliver significantly more traffic than we were accustomed to receiving.

In fact, in expectation of this launch, I had placed my entire team on-call for the day. As a result, we observed this delta between expectation and reality almost instantly. The k8s horizontal pod autoscaler kicked in and drove our api deployment up to its upper bound almost instantly.

We did some quick back-of-the-napkin math and wondered if we should drop api back to a mix with the pets, but ultimately decided to stay the course. Once the launch was over, we evaluated all available telemetry we had on the event and realized that we were safe to flip ingest over completely.

We entered the weekend with the original pet servers held in reserve, with Route53 healthchecks and failover configured to redirect traffic if anything hiccups in k8s. Being in Sports and Entertainment, our highest traffic-hours are typically mid-evening on a Saturday night, and we’ve had a string of record breaking Saturdays recently. Data Science reported that this weekend was projected to be a little lighter than usual, so we elected to stay the course yet again.

The 7 o’clock hour on Saturday, April 8, 2017, was not a little lighter than usual. It was our highest traffic-hour ever.

I posted this to Twitter the following morning before I got the full traffic results:

That is, indeed, a very pretty viz. The HPA once again kicked in and took care of business. Each block you see here represents a pod started to respond to traffic.


fn1: Please note that this doesn’t actually load balance the incoming traffic, but rather the DNS lookups. Because we have enterprise customers, there was some risk here that one very large customer would get k8s in the lookup and the traffic pattern would have been skewed beyond expectations. To whatever degree this happened, it did not affect our operations.