One primary goal of the infrastructure teams here at Yelp is to get as close to zero downtime as possible. This means that when users make requests for www.yelp.com we want to ensure that they get a response, and that they get a response as fast as possible. One way we do that at Yelp is by using the excellent HAProxy load balancer. We use it everywhere: for our external load balancing, internal load balance, and with our move to a Service Oriented Architecture, we find ourselves running HAProxy on every machine at Yelp as part of SmartStack.
We love the flexibility that SmartStack gives us in developing our SOA, but that flexibility comes at a cost. When services or service backends are added or permanently removed, HAProxy has to reload across our entire infrastructure. These reloads can cause reliability problems because while HAProxy is top notch at not dropping traffic while it is running, it can (and does) drop traffic during reloads.
HAProxy Reloads Drop Traffic
As of version 1.5.11, HAProxy does not support zero downtime restarts or reloads of configuration. Instead, it supports fast reloads where a new HAProxy instance starts up, attempts to use SO_REUSEPORT to bind to the same ports that the old HAProxy is listening to and sends a signal to the old HAProxy instance to shut down. This technique is very close to zero downtime on modern Linux kernels, but there is a brief period of time during which both processes are bound to the port. During this critical time, it is possible for traffic to get dropped due to the way that the Linux kernel (mis)handles multiple accepting processes. In particular, the issue lies with the potential for new connections to result in a RST from HAProxy. The issue is that SYN packets can get put into the old HAProxy’s socket queue right before it calls close, which results in a RST of those connections.
There are various workarounds to this issue. For example, Willy Tarreau, the primary maintainer of HAProxy, has suggested that users can drop SYN packets for the duration of the HAProxy restart so that TCP automatically recovers. Unfortunately, RFC 6298 dictates that the initial SYN timeout should be 1s, and the Linux kernel faithfully hardcodes this. As such, dropping SYNs mean that any connections that attempt to establish during the 20-50ms of an HAProxy reload will encounter an extra second of latency or more. The exact latency depends on the TCP implementation of the client, and while some mobile devices retry as fast as 200ms, many devices only retry after 3s. Given the number of HAProxy reloads and the level of traffic Yelp has, this becomes a barrier to the reliability of our services.
Making HAProxy Reloads Not Drop Traffic
To avoid this latency, we built on the solution proposed by Willy. His solution actually works very well at not dropping traffic, but the extra second of latency is a problem. A better solution for us would be to delay the SYN packets until the reload was done, as that would only impose the latency of the HAProxy reload on new connections. To do this, we turned to Linux queueing disciplines (qdiscs). Queueing disciplines manipulate how network packets are handled within the Linux kernel. Specifically you can control how packets are enqueued and dequeued, which provides the ability to rate limit, prioritize, or otherwise order outgoing network traffic. For more information on qdiscs, I highly recommend reading the lartc howto as well as the relevant man pages.
After some light bedtime reading of the Linux kernel source code, one of our SREs, Josh Snyder, discovered a relatively undocumented qdisc that has been available since Linux 3.4: the plug queueing discipline. Using the plug qdisc, we were able to implement zero downtime HAProxy reloads with the following standard Linux technologies:
tc: Linux traffic control. This allows us to set up queueing disciplines that route traffic based on filters. On newer Linux distributions there is also libnl-utils which provide interfaces to some of the newer qdiscs (such as the plug qdisc).
iptables: Linux tool for packet filtering and NAT. This allows us to mark incoming SYN packets.
Smartstack clients connect to the loopback interface to make a request to HAProxy, which fortunately turns incoming traffic into outgoing traffic. This means that we can set up a queuing discipline on the loopback interface that looks something like Figure 1.
Figure 1: Queueing Discipline
This sets up a classful implementation of the standard pfifo_fast queueing discipline using the prio qdisc, but with a fourth “plug” lane . A plug qdisc has the capability to queue packets without dequeuing them, and then on command flush those packets. This capability in combination with an iptables rule allows us to redirect SYN packets to the plug during a reload of HAProxy and then unplug after the reload. The handles (e.g ‘1:1’, ‘30:’) are labels that allow us to connect qdiscs together and send packets to particular qdiscs using filters; for more information consult the lartc howto referenced above.
We then programmed this functionality into a script we call qdisc_tool. This tool allows our infrastructure to “protect” a HAProxy reload where we plug traffic, restart haproxy, and then release the plug, delivering all the delayed SYN packets. This invocation looks something like:
We can easily reproduce this technique with standard userspace utilities on modern Linux distributions such as Ubuntu Trusty. If your setup does not have nl-qdisc-add but does have a 3.4+ Linux kernel, you can manipulate the plug via netlink manually.
Set up the Queuing Disciplines
Before we can do graceful HAProxy reloads, we must first set up the queueing discipline described above using tc and nl-qdisc-add. Note that every command must be run as root.
Mark SYN Packets
We want all SYN packets to be routed to the plug lane, which we can accomplish with iptables. We use a link local address so that we redirect only the traffic we want to during the reload, and clients always have the option of making a request to 127.0.0.1 if they wish to avoid the plug. Note that this assumes you have set up a link local connection at 169.254.255.254.
Toggle the Plug While Reloading
Once everything is set up, all we need to do to gracefully reload HAProxy is to buffer SYNs before the reload, do the reload, and then release all SYNs after the reload. This will cause any connections that attempt to establish during the restart to observe latency equal to the amount of time it takes HAProxy to restart.
In production we observe that this technique adds about 20ms of latency to incoming connections during the restart, but drops no requests.
This design has some benefits and some drawbacks. The largest drawback is that this works only for outgoing links and not for incoming traffic. This is because of the way that queueing disciplines work in Linux, namely that you can only shape outgoing traffic. For incoming traffic, one must redirect to an intermediary interface and then shape the outgoing traffic from that intermediary. We are working on integrating a solution similar to this for our external load balancers, but it is not yet in production.
Furthermore, the qdiscs could also probably be tuned more efficiently. For example, we could insert the plug qdisc at the first prio lane and adjust the priomap accordingly to ensure that SYNs always get processed before other packets or we could tune buffer sizes on the pfifo/plug qdiscs. I believe that for this to work with an interface that is not loopback, the plug lane would have to be moved to the first lane to ensure SYN deliverability.
The reason that we decided to go with this solution over something like huptime, hacking file descriptor passing into HAProxy, or dancing between multiple local instances of HAProxy is because we deemed our qdisc solution the lowest risk. Huptime was ruled out quickly as we were unable to get it to function on our machines due to an old libc version, and we were uncertain if the LD_PRELOAD mechanism would even work for something as complicated as HAProxy. One engineer did implement a proof of concept file descriptor patch during a hackathon but the complexity of the patch and the potential for a large fork caused us to abandon that approach; it turns out that doing file descriptor passing properly is really hard. Of the three options, we most seriously considered running multiple HAProxy instances on the same machine and using either NAT, nginx, or another HAProxy instance to switch traffic between them. Ultimately we decided against it because of the number of unknowns in implementation, and the level of maintenance that would be required for the infrastructure.
With our solution, we maintain basically zero infrastructure and trust the Linux kernel and HAProxy to handle the heavy lifting. This trust appears to be well placed as in the months this has been running in production we have observed no issues.
To demonstrate that this solution really works, we can fire up an nginx HTTP backend with HAProxy sitting in front, generate some traffic with Apache Benchmark, and see what happens when we restart HAProxy. We can then evaluate a few different solutions this way.
All tests were carried out on a freshly provisioned c3.large AWS machine running Ubuntu Trusty and a 3.13 Linux kernel. HAProxy 1.5.11 was compiled locally with TARGET=linux2628. Nginx was started locally with the default configuration except that it listens on port 8001 and serves a simple “pong” reply instead of the default html. Our compiled HAProxy was started locally with a basic configuration that had a single backend at port 8001 and a corresponding frontend at port 16000.
Just Reload HAProxy
In this experiment, we only restart HAProxy with the ‘-sf’ option, which initiates the fast reload process. This is a pretty unrealistic test because we are restarting HAProxy every 100ms, but it illustrates the point.
Socket reset! Restarting HAProxy has caused us to fail a request even though our backend was healthy. If we tell apache benchmark to continue on receive errors and do more requests:
Only 0.25% of requests failed. This is not too bad, but well above our goal of zero.
Drop SYNs and Let TCP Do the Rest
Now we try the method where we drop SYNs. This method seems to completely break with high restart rate as you end up with exponentially backing off connections, so to get reliable results I could only restart HAProxy every second.
Figure 2: Iptables Experiment Results
As expected, we drop no requests but incur an additional one second of latency. When request timings are plotted in Figure 2 we see a clear bimodal distribution where any requests that hit the restart take a full second to complete. Less than one percent of the test requests observe the high latency, but that is still enough to be a problem.
Use Our Graceful Restart Method
In this experiment, we restart HAProxy with the ‘-sf’ option and use our queueing strategy to delay incoming SYNs. Just to be sure we are not getting lucky, we do one million requests. In the process of this test we restarted HAProxy over 1500 times.
Figure 3: TC Experiment Results
Success! Restarting HAProxy has basically no effect on our traffic, causing only minor delays as can be seen in Figure 3. Note that this method is heavily dependent on how long HAProxy takes to load its configuration, and because we are running such a reduced configuration, these results are deceivingly fast. In our production environment we do observe about a 20ms penalty during HAProxy restarts.
This technique appears to work quite well to achieve our goal of providing a rock-solid service infrastructure for our developers to build on. By delaying SYN packets coming into our HAProxy load balancers that run on each machine, we are able to minimally impact traffic during HAProxy reloads, which allows us to add, remove, and change service backends within our SOA without fear of significantly impacting user traffic.
Thanks to Josh Snyder, John Billings and Evan Krall for excellent design and implementation discussions.
Yelp has had an iOS app for as long as third-party iOS apps have existed. Maintaining a codebase with that much history is always interesting and sometimes challenging, and one of the biggest challenges is dependency management. For a long time, git submodules met most of our needs and caused relatively few headaches. However, the submodule approach made it difficult to understand what unanticipated or even breaking changes will be introduced when bumping a submodule by a commit – or several. A Git SHA has no concept of versioning.
Additionally, adding a new library often required changes to the build settings of the Yelp app. Sometimes new header search paths were required, other times special build flags had to be added. As the project became larger and more complicated, the process of adding a new dependency became more and more difficult. It was not uncommon for the integration of a new library to require around a day of developer time. There had to be a better way.
Enter CocoaPods: a dependency manager for Objective-C projects.Ruby has gem+bundler, and Node.js has npm. Each of these tools operate in a similar way. Your application specifies the libraries and which versions to use. The concept of a dependency manager is nothing new. Python has pip and easy_install, depends on, and the dependency manager recursively resolves these dependencies until a final list is compiled. The manager then downloads the required libraries and makes them available to your project.
Since Objective-C is a compiled language, CocoaPods provides the additional convenience of configuring header search paths, setting build flags, etc., on a per-library basis. These options are specified by the author of each library as part of its podspec, neatly shifting the cognitive load off of you, the developer.
Libraries in CocoaPods – called pods – are versioned according to the Semantic Versioning standard. This scheme makes it infinitely easier to understand when changing a library version is safe and when it will introduce backwards-incompatible changes.
CocoaPods even has built-in support for all of our internal libraries – converting our existing submodules was a breeze. We first created a new Git repository, YelpSpecs, as an internal counterpart to the main CocoaPods Specs repository. Then, after writing a podspec for each internal library and pushing those specs to YelpSpecs, our internal libraries could be referenced alongside the external libraries in our Podfile. At last count, we had eleven internal libraries referenced in YelpSpecs.
Extracting our code into pods enabled us to do a lot of cleanup. Submodules no longer needed an xcodeproj file, as CocoaPods handled all of the configuration for us. In the end, we were able to delete thousands of lines from configuration files.
Having an easy dependency management system has encouraged us to better organize our code. Before CocoaPods, we only had 1 internal library which covered everything from networking and caching to utility views. Making a new library was so painful that we opted to throw everything together.
CocoaPods has completely changed that. We’ve begun to dismantle the mega-library, and have created several smaller and more focused libraries. This has made sharing code with our new Business Owner App incredibly easy. Our current set of libraries would have been nearly impossible to manage before CocoaPods. For example, our small utility library, YLUtils, is included in every single one of our internal libraries. Getting Xcode to link this properly would have been a nightmare, but CocoaPods has made it a breeze.
We’ve been using CocoaPods for nearly a year, and the change has allowed for some great improvements in how we work and develop. However, the transition wasn’t always easy – we had been using submodules for so long that we weren’t used to the CocoaPods workflow, where you have to checkout a local version of the pod before making changes. If you don’t checkout a local version first, changes won’t be picked up by git and might be lost the next time you run ‘pod install.’
In order to help our developers, we wrote a small script that goes in the podfile. When running pod install, it removes the write permission from all pod-related files that aren’t installed via a local pod. Xcode will then give you a helpful message, noting that the file is locked. This small change really helped with the transition, preventing future frustration.
After we had happily been using CocoaPods for several months, a new version was released. However, this new release was not backwards-compatible – if developers installed the new release on their system, they would be unable to build older versions of the app or branches that hadn’t recently been updated with master.
In order to solve this problem, we realized that we would need a dependency manager to manage our dependency manager. Since CocoaPods is itself a Ruby gem, the obvious choice was bundler. We added a shim that installs the correct version of CocoaPods locally via bundler. With this setup, developers always have access to the correct version of CocoaPods regardless of what branch or version they’re working on. If you have a large team or frequently switch between newer and older branches, we strongly recommend a configuration like this. We’ve set up a sample project to demonstrate our system – check out the readme for more details.
The work we’ve put into tuning our CocoaPods setup has already paid off. It’s made it extremely simple to create and try new libraries, which has in turn helped us make our code base more modular. Switching to CocoaPods has allowed us to spend less time fighting configurations, and more time developing great new features.
A couple of weeks ago, the Yelp Product & Engineering teams put their creative minds together to work on our coolest, funniest and hardcore-est of ideas at the 16th edition of our internal Hackathon. As always, the food was plentiful- our kitchens were stacked with delicious catered food, fresh fruits and snacks, gourmet coffee, and, wait for it, ice-cream sandwiches! While all that deliciousness fueled our bodies, our minds were fueled by do-it-yourself Metal Earth 3D model kits and metal-etching workshops. If that wasn’t enough, we even got a chance to race these amazing Anki cars.
Our VP of Engineering kicking off the proceedings with a transatlantic mimosa toast
Over 60 fantastic projects came out of our engineering offices in San Francisco, Hamburg and London – ideas that ranged from visualizations of our rich dataset to hardcore machine learning applications to nifty internal tools designed to skyrocket developer productivity and of course, the incredible out-of-the-box projects.
The unwavering constant of Yelp Hackathons – a medley of work, play and food
Speaking of out-of-the-box projects, the dynamic duo of Yoni D. and Matthew K. decided to reinvigorate a certain 19th century invention. What started off as an idea to self-develop photographic film quickly evolved into a project to create their very own camera using the swanky 3D printer that we have here in our San Francisco office. Using the design for an existing pinhole camera, they 3D printed parts of the camera, painted and assembled them to create a product as hardcore as its name – the Kühlkdebeulefotoapparat. Don’t know what means? Rumor has it that it has something to do with the duo’s last names. And it takes pretty good photographs too.
Another fancy piece of gadgetry that came out of our top-secret 3D printing lab
While some folks were busy fashioning things out of thin air, others joined forces to solve an interesting problem that we face here at Yelp. We build a ton of amazing features every single day that leave us with a lot of data with varying tools to visualize them. We’ve gotten really comfortable using Kibana to expose our data in Elasticsearch, but could we somehow use the same UI that we know and love to visualize our great data on Redshift? A team comprised of Garrett J., Jimming C. and Tobi O. decided to answer exactly that. Poking into the Kibana code (yay open source!), they realized that while the project was still under active development, it had a well-defined interface to query ElasticSearch. They built a translation server that modeled a similar interface for Redshift that Kibana could understand. The result – the ability to create new dashboards and visualizations that will help us gain more insight into our data!
Our hackers showing off their projects at a “science-fair” style exhibition
While some folks spent their time coming up with elegant solutions to hard problems, a team comprised of Duncan C. and Matthew M. did something, um, quite the opposite. In the true spirit of being “unboring,” they created “DumbStack” – an insanely complicated web of machinery comprised of Webfaction, Slack, AppEngine, SQS, EC2, Heroku, Tumblr, Netcat, Github and Google Translate to solve a simple problem – posting a tweet.
Definitely the easiest way to post to Twitter, right?
Have some creative ideas brewing in your head? Why don’t you come join forces as we embark on our next awesome Hackathon this summer. We are constantly on the lookout for amazing engineers, product managers and data scientists to bolster our team. Check out our exciting product and engineering job openings at www.yelp.com/careers and apply today.
I geek out about the Common Crawl. It’s an open source crawl of huge parts of the Internet, accessible for anyone to use. You have full access to the HTML and text of billions of web pages. What’s more, you can scan the entire thing, tens of terabytes, for just a few bucks on Amazon EC2. These days they’re releasing a new dataset every month. It’s awesome.
People frequently use mrjob to scan the Common Crawl, so it seems like a fitting tool for us to use. mrjob, if you’re not familiar, is a Python framework written by Yelp to help run Hadoop jobs locally or on Amazon’s EMR service. Since the Common Crawl is stored in Amazon’s S3, it makes a lot of sense to use EMR to access it.
I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.
Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.
Right away I realized that this is a huge problem space. Sophisticated solutions would use NLP, fuzzy matching, cluster analysis, a whole slew of signals and methods. I wanted to come up with something simple, more of a proof-of-concept. If some basic approaches yielded decent results, then I could easily justify developing more sophisticated methods to explore this dataset.
I started thinking about phone numbers. They’re easy to parse, identify, and don’t require any fuzzy matching. A phone number either matches, or it doesn’t. If I could match phone numbers from the Common Crawl with phone numbers of businesses in the Yelp database, it may lead to actually finding the web pages for those businesses.
I started planning my MapReduce job:
I start with a mapper that takes both Common Crawl WET pages (WET pages are just web page text) and Yelp business data. For each page from the Common Crawl, I parse the page looking for phone numbers and yield the URLs keyed off of each phone number that I find. For the Yelp business data, I yield the ID of the business, keyed off of the phone number that we have on record, which lets me start combining any businesses and URLs that match the same phone number in my reduce step.
I ran this against a small set of input data to make sure it was working correctly but already knew that I was going to have a problem. There are some websites that just like to list every possible phone number and then there are sites like Yelp which have pages dedicated to individual businesses. A website that simply lists every possible phone number is definitely not going to be a legitimate business page. Likewise, a site like Yelp that has a lot of phone numbers on it is unlikely to be the true home page of any particular business. I need a way to filter these results.
So let’s set up a step where we organize all the entries keyed by domain. That way if it looks like there are too many businesses hosted on that single domain, we can filter them out. Granted this isn’t perfect, it may exclude large national chains, but it’s a place to start.
The Common Crawl data files are in WARC format. The warc python package can help us parse these files, but WARC itself isn’t a line-by-line format that’d be suitable as direct input to our job. Instead, the input to our mapper will be the paths to WARC files stored in S3. Conveniently, the Common Crawl provides a file that is exactly that.
According to the Common Crawl blog, we know that the WET paths for the December 2014 crawl live at:
We’ll supply that path and a path to a similar file of our Yelp dataset to our mrjob:
Here’s an excerpt from the full job that illustrates how we load the WARC files and map them into phone numbers and URLs:
Performance Problem: Distribute Our Input To More Mappers
The default Hadoop configuration assumes that there are many lines of input and each individual line is relatively fast to process. It designates many lines to a single map task.
In this case, given that our input is a list of WET paths, there’s a relatively small amount of line input, but each line is relatively expensive to process. Each WET file is about 150MB so each line actually represents a whole lot more input than just that single line. For best performance, we actually want to dole out each input to a separate mapper.
Fortunately this is pretty easy by simply specifying the input format to NLineInputFormat:
Note that NLineInputFormat will reformat our input a bit. It will change our input lines S3 paths into '\t' so we need to treat our input as tab-delimited key, value pairs and simply ignore the key. MRJob’s RawProtocol can do this:
Identifying Phone Numbers
This part isn’t too complicated. To find possible phone numbers on a given page, I use a simple regex. (I also tried the excellent phonenumbers Python package, which is far more robust but also much slower, it wasn’t worth it.) It can be expanded to work on international phone numbers, but for now we’re only looking at US-based businesses.
Running the Job (On the Cheap!)
The WET files for the December 2014 Common Crawl are about 4TB compressed. That’s a fair bit of data, but we can process it pretty quickly with a few high-powered machines. Given that AWS charges you by the instance-hour, it’s most cost effective to just use as many instances required to process your job in a little less than 60 minutes. I found that I could process this job in about an hour with 20 c3.8xlarge instances.
The normal rate for a c3.8xlarge is currently $1.68/hour. Plus $.270/hr for EMR. Thus under standard pricing, our job would cost ($1.68 + $0.27) * 20 = $39. But with spot pricing, we can do much better!
Spot pricing lets you get EC2 instances at much lower rates when there is unused capacity. The current spot price for a c3.8xlarge in us-east-1 is around $0.26. We still have to pay the additional EMR charge, but this gets us to a much more reasonable ($0.26 + $0.27) * 20 = $10.60.
The mrjob found approximately 748 million US phone numbers in the Common Crawl December 2014 dataset. Of the ones we were able to match against businesses in the Yelp database, 48%, already had URLs associated with them. If we assume the Yelp businesses that have URLs are correct URLs, we can get a rough estimate of accuracy by comparing the URLs in the Yelp database against the URLs that our MRJob identified.
So of the businesses that matched and already had URLs, 48% of them were the same URLs that we already had. 61% had matching domains. If we wanted to use this for real, we’d probably want to combine this with other signals to get higher accuracy. But this isn’t a bad first step.
The Common Crawl is a great public resource. You can scan over huge portions of the web with some simple tools and a price of a sandwich. MRJob makes this super easy with Python. Give it a try!
At Yelp we value our ability to quickly ship code. We’re constantly pushing changes out to production, and we even encourage our interns to ship code on their first day. We’ve managed to maintain this pace even as the engineering team has grown to over 300 people and our codebase has reached several million lines of Python code (with a helping of Java on the side).
One key factor in maintaining our iteration speed has been our move to a service
oriented architecture. Initially, most of our development work occurred in a single, monolithic web application, creatively named ‘yelp-main.’ As the company grew, our developers were spending increasing amounts of time trying to ship code in yelp-main. After recognizing this pain point, we started experimenting with a service oriented architecture to scale the development process, and so far it’s been a resounding success. Over the course of the last three years, we’ve gone from writing our first service to having over seventy production services.
What kinds of services are we writing at Yelp? As an example, let’s suppose you want to order a burrito for delivery. First of all, you’ll probably make use of our autocompletion service as you type in what you’re looking for. Then, when you click ‘search,’ your query will hit half a dozen backend services as we try to understand your query and select the best businesses for you. Once you’ve found your ideal taqueria and made your selection from the menu, details of your order will hit a number of services to handle payment, communication with partners, etc.
What about yelp-main? There’s a huge amount of code here, and it’s definitely not going to disappear anytime soon (if ever). However we do see yelp-main increasingly becoming more of a frontend application, responsible for aggregating and rendering data from our growing number of backend services. We are also actively experimenting with creating frontend services so that we can further modularize the yelp-main development process.
The rest of this blog post roughly falls into two categories: general development practices and infrastructure. For the former, we discuss how we help developers write robust services, and then follow up with details on how we’ve tackled the problems of designing good interfaces and testing services. We then switch focus to our infrastructure, and discuss the particular technology choices we’ve made for our service stack. This is followed by discussions on datastores and service discovery. We conclude with details on future directions.
Service oriented architecture forces developers to confront the realities of distributed systems such as partial failure (oops, that service you were talking to just crashed) and distributed ownership (it’s going to take a few weeks before all the other teams update their code to start using v2 of your fancy REST interface). Of course, all of these challenges are still present when developing a monolithic web application, but to a lesser extent.
We strive to mitigate many of these problems with infrastructure (more on that later!), but we’ve found that there’s no substitute for helping developers really understand the realities of the systems that they’re building. One approach that we’ve taken is to hold a weekly ‘services office hours,’ where any developer is free to drop in and ask questions about services. These questions cover the full gamut, from performance (“What’s the best way to cache this data?”) to architecture (“We’ve realized that we should have really implemented these two separate services as a single service – what’s the best way to combine them?”).
We’ve also created a set of basic principles for writing and maintaining services. These principles include architecture guidelines, such as how to determine whether a feature is better suited to a library or whether the overhead of a service is justified. They also contain a set of operational guidelines so that service authors are aware of what they can expect once their service is up and running in production. We’ve found that it can often be useful to refer to these principles during code reviews or design discussions, as well as during developer onboarding.
Finally, we recognize that we’re sometimes going to make mistakes that negatively impact either our customers or other developers. In such cases we write postmortems to help the engineering team learn from these mistakes. We work hard to eliminate any blame from the postmortem process so that engineers do not feel discouraged from participating.
Most services expose HTTP interfaces and pass around any structured data using JSON. Many service authors also provide a Python client library that wraps the raw HTTP interface so that it’s easier for other teams to use their service.
There are definite tradeoffs in our choice to use HTTP and JSON. A huge benefit of standardizing on HTTP is that there is great tooling to help with debugging, caching and load balancing. One of the more significant downsides is that there’s no standard solution for defining service interfaces independently of their implementation (in contrast to technologies such as Thrift). This makes it hard to precisely specify and check interfaces, which can lead to nasty bugs (“I thought your service returned a ‘username’ field?”).
We’ve addressed this issue by using Swagger. Swagger builds on the JSON Schema standard to provide a language for documenting the interface of HTTP/JSON services. We’ve found that it’s possible to retrofit Swagger specifications onto most our our services without having to change any of their interfaces. Given a Swagger specification for a service, we use swagger-py to automatically create a Python client for that service. We also use Swagger UI to provide a centralized directory of all service interfaces. This allows developers from across the engineering team to easily discover what services are available, and helps prevent duplicated effort.
Testing within a service is fairly standard. We replace any external service calls with mocks and make assertions about the call. The complexity arises when we wish to perform tests that span services. This is one of the areas where a service oriented architecture brings additional costs. Our first attempt at this involved maintaining canonical ‘testing’ instances of services. This approach lead to test pollution for stateful services, general flakiness due to remote service calls and an increased maintenance burden for developers.
In response to this issue we’re now using Docker containers to spin up private test instances of services (including databases). The key idea here is that service authors are responsible for publishing Docker images of their services. These images can then be pulled in as dependencies by other service authors for acceptance testing their services.
The majority of our services are built on a Python service stack that we’ve assembled from several different open-source components. We use Pyramid as our web framework and SQLAlchemy for our database access layer, with everything running on top of uWSGI. We find that these components are stable, well-documented and have almost all of the features that we need. We’ve also had a lot of success with using gevent to eke additional performance out of Python services where needed. Our Elasticsearch proxy uses this approach to scale to thousands of requests per second for each worker process.
We use Pyramid tweens to hook into request lifecycles so that we can log query and response data for monitoring and historical analysis. Each service instance also runs uwsgi_metrics to capture and locally aggregate performance metrics. These metrics are then published on a standard HTTP endpoint for ingestion into our Graphite-based metrics system.
Another topic is how services access third-party libraries. Initially we used system-installed packages that were shared across all service instances. However, we quickly discovered that rolling out upgrades to core packages was an almost impossible task due to the large number of services that could potentially be affected. In response to this, all Python services now run in virtualenvs and source their dependencies from a private PyPI server. Our PyPI server hosts both the open-source libraries that we depend upon, and our internally-released libraries (built using a Jenkins continuous deployment pipeline).
Finally, it’s important to note that our underlying infrastructure is agnostic with respect to the language that a service is written in. The Search Team has used this flexibility to write several of their more performance-critical services in Java.
A significant proportion of our services need to persist data. In such cases, we try to give service authors the flexibility to choose the datastore that is the best match for their needs. For example, some services use MySQL databases, whereas others make use of Cassandra or ElasticSearch. We also have some services that make use of precomputed, read-only data files. Irrespective of the choice of datastore, we try to keep implementation details private to the owning service. This gives service authors the long-term flexibility to change the underlying data representation or even the datastore.
The canonical version of much of our core data, such as businesses, users and reviews, still resides in the yelp-main database. We’ve found that extracting this type of highly relational data out into services is difficult, and we’re still in the process of finding the right way to do it. So how do services access this data? In addition to exposing the familiar web frontend, yelp-main also provides an internal API for use by backend services in our datacenters. This API is defined using a Swagger specification, and for most purposes can be viewed as just another service interface.
A core problem in a service oriented architecture is discovering the locations of other service instances. Our initial approach was to manually configure a centralized pair of HAProxy load balancers in each of our datacenters, and embed the virtual IP address of these load balancers in configuration files. We quickly found that this approach did not scale. It was labor intensive and error prone both to deploy new services and also to move existing services between machines. We also started seeing performance issues due to the load balancers becoming overloaded.
We addressed these issues by switching to a service discovery system built around SmartStack. Briefly, each client host now runs an HAProxy instance that is bound to localhost. The load balancer is dynamically configured from service registration information stored in ZooKeeper. A client can then contact a service by connecting to its localhost load balancer, which will then proxy the request to a healthy service instance. This facility has proved itself highly reliable, and the majority of our production services are now using it.
We are currently working on a next-generation service platform called Paasta. This uses the Marathon framework (running on top of Apache Mesos) to allocate containerized service instances across clusters of machines, and integrates with SmartStack for service discovery. This platform will allow us to treat our servers as a much more fungible resource, freeing us from the issues associated with statically assigning services to hosts. We will be publishing more details about this project later this year.
The work described in this blog post has been carried out and supported by numerous members of the Engineering Team here at Yelp. Particular credit goes to Prateek Agarwal, Ben Chess, Sam Eaton, Julian Krause, Reed Lipman, Joey Lynch, Daniel Nephin and Semir Patel.