Moving the Rest of the Monolith to PaaSTA
Kyle Anderson, Site Reliability Engineer
- Jun 19, 2017
This past April (2017) we finally migrated our monolith to PaaSTA (our open source PaaS based on Apache Mesos). Yes, although Yelp does subscribe to the Service-Oriented-Architecture theory and we constantly try to reduce the scope of the monolith, realistically it still looms over us as a large towering codebase that pays the bills. But that doesn’t mean we can’t try to constantly improve it. This blog post is about our latest improvement to the monolith: treating it just like any other service at Yelp and running it on PaaSTA.
Background: What is Yelp’s Monolith Made Of?
Yelp’s monolith is composed of perfectly proportioned parts of Puppet, Apache, mod_wsgi, and Python + virtualenv. Before PaaSTA, it was deployed directly to servers (a hybrid of on-premise datacenter and Amazon EC2) using a bespoke rsync-based deployment system. For the purposes of this blog post I’m going to call this “Classic” infrastructure. (“Legacy” has such a negative connotation, I rarely use it. I like “Classic” because it comes with a sense of respect for the past. Think “Classic Cars.”)
This in itself isn’t very exciting but it comes with some challenges that are baked into the system:
- The application is tightly coupled to the operating system it runs on (hard to upgrade)
- The bespoke deployment system means it is different than all our other applications, improvements to it only apply to the monolith
- Our Puppet configuration for the host is tightly coupled to the application, some changes require Ops help and are hard to coordinate (a developer can’t just try a new version of mod_wsgi on stage)
- Servers for the monolith are different than servers that run services (in many ways…)
Yeah Well, How Do You Make the Monolith “Just Like Any Other Service”?
The direction is obvious: the monolith should be run on the PaaS like any other service. The roadmap is unclear: how do we go from Puppet to PaaSTA without breaking the website?
Step 1: Dark launch
The same best-practices around launching large experimental features apply equally to infrastructure and applications at Yelp. To “dark launch” our monolith on PaaSTA, first we deploy it as a new and different SmartStack endpoint. SmartStack is the service-discovery tool we use, created by Airbnb, that allows us to decouple the deployment of a service from the discovery of that service. By launching under a new endpoint (haproxy frontend) in SmartStack, our normal traffic to the monolith won’t be able to discover it, but we can still test it through a special HAProxy access control list (ACL) that will send you to the PaaSTA deployment if you have a special cookie.
Doing it this way gives us some good benefits:
- We exercise the PaaSTA components every time we deploy, allowing us to find breakages before it’s live
- In these very early days there is no risk for normal users to find PaaSTA-powered webservers
- Core team members can opt in with the cookie to eat our own dogfood :)
Once the glaring issues are found and fixed, we could increase the scope.
Step 2: Canary
We already used canary deployments with our classic rsync-based system. For this step we have two canaries: one on PaaSTA and one on Classic. There is a critical difference between this canary step and the “dark launch” step: the canary gets live traffic!
We need the live traffic to hit this thing so we can find more bugs and evaluate the performance. At this stage we made critical decisions about container sizes, hardware / instance classes, etc.
Here would be a good time to start fixing things like monitoring dashboards, alerting tools, orchestration scripts, etc. All of these little odds and ends will need to handle the hybrid mode. For example, this is where we want to make sure the tooling we have for rolling back code is solid and fast on both platforms. It is “ok” to have them broken during this canary, but these small breakages are blockers for the next step.
Step 3: Migrate! (Rampup)
Once the canary has proven itself, it’s time to crank up the traffic onto more and more servers. You might consider a Blue/Green deployment from here, but for a change this large and fundamental we decided not to do this for a business reason: it would cost too much. Remember from the introduction that we run a hybrid infrastructure, servers in datacenter and using AWS. We can’t just rack twice as many servers and flip everything. No, for a change this large we took it slow and re-imaged our physical and virtual servers over the course of a month.
During this phase you’ll want to to exercise your load balancing tier, making sure it can handle things like traffic shifting, dynamic backend discovery, sane timeouts, etc.
A concrete example of an issue we found at this stage was the classic “running out of ephemeral ports” problem. We knew we would encounter issues like this as we exercised the new stack. Luckily we have the classic infrastructure still in place to hold us over while we fix these types of bugs.
Step 4: Cleanup
Although the cleanup step is not that interesting, it is kinda fun. I hear that in some shops engineering time is not always allocated to this step, but for operations and infrastructure teams that would be crazy; you have to clean up or you will drown.
For Yelp we were able to clean up a custom AMI baking pipeline, tons of puppet code, and of course, the classic rsync-based deployment mechanism.
Step 5: … Profit?
Once on a new platform, some things that were very difficult to do become easy to do! Here are some examples:
- Using PyPy instead of CPython. With a Docker-based deployment system, this is “just” a change to the Dockerfile (after blacklisting some packages that have PyPy-incompatible C-extensions). Some teams at Yelp can now easily use this alternative interpreter and get massive speedups.
- Upgrade the base linux distro with a code push. Again, a container-based approach gives good isolation between the host OS and the application’s base image. This is no longer a large multi-team effort across multiple months.
- Can take advantage of the built-in goodies of a PaaS like automatic monitoring, error reporting, autoscaling, etc.
Perhaps an underrated gain with this migration is the massive reduction in cognitive load between the two systems. Now Yelp developers and operations engineers have a unified experience when deploying services, even if the monolith is a very big service.
An unanticipated bonus side-effect for Yelp is that our deploys are faster! The speed of the new system is really a function of how quickly we can launch new Docker containers and how much spare capacity there is on the cluster, and we can tune these knobs to hit our desired speed/cost balance.
And of course, literally profit. Running on PaaSTA means that we can declare how many resources a service actually needs (cpu/ram/instance count) based on real data, and Mesos can pack the cluster as best as it can. This means that spare resources on a machine no longer need to be wasted, new smaller tasks can be scheduled in. On top of that we can autoscale the entire cluster to make our compute spend match our actual compute demand, on an hour-by-hour basis!
And then there is our true “secret” weapon for saving money by running on PaaSTA: Using Amazon Spot Fleet. The nitty gritty details on how we do this sanely without sacrificing availability of the website are reserved for another blog post.
Current State of The Art
The monolith and almost all other services at Yelp run on PaaSTA. We don’t run our stateful (Kafka, Cassandra, Memcache) things on it, yet. While the migration was rough and slow, the payout makes it worth it. There is still plenty of work to do! There are still many use-cases for running code at Yelp that PaaSTA can’t do, like large analytic (EMR) jobs, realtime streaming workloads (Apache Flink), and even just random one-off tasks (xargs!). Now that the biggest use-case (web serving) is migrated, I look forward to extending PaaSTA to do even more new and exciting things!
Become an Engineer at Yelp
Backend application teams at Yelp work on a lot of incredible infrastructure projects like this. If you're interested apply below!View Job