Engineering Blog

May 18th, 2015

HTTPS Client Testing Made Easy

You’ve probably read about the recent AFNetworking vulnerability. Nowadays, it’s not sufficient to just test your SSL certificates. You must also test how your clients use these certificates to have confidence your users aren’t getting pwned.

There are plenty of tools to test broken cipher suites and cryptographic vulnerabilities but, until now, there wasn’t a readily-available, free, and simple tool for testing how your client handles certificate requests and X.509 verification over the wire. Enter tlspretense-service.

This tool provides a simple Docker container built around iSEC Partner’s tlspretense certificate testing suite that acts as a MitM to test your clients. It works very similarly to tlspretense-docker, except instead of routing through the container by making (potentially problematic) networking changes, you instead connect to the container directly. This means that if your client accepts a service URL to connect to, you can just point it at tlspretense-service to test the robustness of your certificates. The tlspretense’s configuration file contains more information on which tests are run.

tlspretense-service is fairly easy to get up and running, since the Docker container does the bulk of the work for you. When it’s done running, you’ll get a handy report like this:
image01

Each test tells you what it expected, what the actual result was, and the complete duration of each connection.

The above example demonstrates the default behaviour of curl (curl https://localhost:8443). Note that curl rejected every connection in this naive test. This is because tlspretense provides its own CA for use in testing. Here’s what happens if we trust it (curl https://localhost:8443 --cacert tlspretense/ca/goodcacert.pem):
image02

This time, our client connected in the majority of cases. The output then proceeds to show whether curl continued the connection, what expected behavior would be, and whether the test passed or failed.

Here are a few interesting things you can do with this container:

  1. Rigorously test certs for your public HTTPS clients, including web browsers and web proxies that connect to the outside world.
  2. Intercept and test certs for your internal HTTPS and HSTS service traffic safely, without disclosing information to a third party.
  3. Create regression tests and test harnesses around your clients, to ensure they’re always X.509 compliant, without having those tests perform invasive networking changes.

This tool is still a work in progress, so feel free to report any bugs or issues you find with it. Test on!

May 8th, 2015

Yelp Tech Talks: Mobile Testing 1, 2, 3 Wrap Up

Last week we held our second tech talk, focusing on mobile, in a new internal series we launched this year. The presenters covered two very important topics in mobile: wearable apps and testing.

Building Our Apple Watch App

The evening started with a talk by Bill M. who lead the efforts to build our Apple Watch app. We knew we had to build the app in order to provide our users with the best possible access and experience. Since the platform was brand new and kept changing, it came with its own challenges. Development needed to be planned carefully to make sure we could deliver an app with a set of features that added up to a useful experience but still hit the launch date.

The Yelp Apple Watch storyboard

The Yelp Apple Watch storyboard

With a few key features in mind, our iOS developers dove in and started coding away. The storyboard (shown above), defines all of the functionality for the app. The team faced a lot of challenges with how to handle network requests, images, location, and phone-watch communication.

Testing

After learning how to build an Apple Watch app, Mason G. and Tim M. followed up on how we test our iOS and Android mobile apps (respectively) to ensure the best possible end product.

Both apps share a common API. Whenever we want to change this API, we start by writing documentation and examples. This allows us to define a contract between the API and the clients, and work out major problems before we write any code. Additionally, we can then take the examples and use them as mock data to test the clients.

Mason then led us into the world of iOS testing. Testing is critical in order to prevent regressions and give developers confidence that their changes work. The team relies on unit, integration, and acceptance tests to make sure that different components all work correctly. Our test suite leverages several tools, including KIF, to provide great test coverage.

Tim then spoke on the challenging world of Android testing. The major concerns with Android testing include test scalability, reliability, and speed. Despite the many obstacles to overcome, we’ve created a solid setup here at Yelp, largely due to the great open source libraries available for the platform, including Spoon and Espresso.

Screenshot from a tech talk showing all the different testing frameworks.

Screenshot from a tech talk showing all the different testing frameworks.

We have the full video and slides online:

On to the next talks!

If you weren’t able to attend our tech talk this time around, don’t worry! You’ll still get a chance to see our offices at our 3rd annual WWDC party. RSVP here.

If you’re interested in future events at Yelp or in engineering opportunities, let us know!

April 30th, 2015

Mycroft – Load Data into Redshift Automatically

Yelp generates terabytes of logs every day. Starting in 2010 with the release of mrjob, Yelp has relied heavily on Amazon Elastic MapReduce (EMR) and MapReduce jobs to analyze this data. While MapReduce works well to repeatedly answer the same question, it’s not a great tool to answer questions that are not well defined or that need to be answered only once. Consequently, we started using Redshift, Amazon’s Postgres-compatible column-oriented data warehouse, to explore our data.

Yelp’s log data already lands on S3 every day making it a convenient location to stage data for loading into Redshift. Unfortunately, most of our logs aren’t in a format that can be directly loaded but instead need to be lightly transformed, then converted into JSON or CSV for loading. mrjob is the perfect tool to perform these light transformations – so much so that we started building infrastructure to make this extremely common pattern as easy as possible.

Mycroft is an orchestrator that coordinates mrjob, S3, and Redshift to automatically perform light transformations on daily log data. Just specify a cluster, schema version, S3 path, and start date, and Mycroft will watch S3 for new data, transforming and loading data without user action. Mycroft’s web interface can be used to monitor the progress of in-flight data loading jobs, and can pause, resume, cancel or delete existing jobs. Mycroft will notify users via email when new data is successfully loaded or if any issues arise. It also provides tools to automatically generate schemas from log data, and even manages the expiration of old data as well as vacuuming and analyzing data.

Mycroft provides a web interface that makes it easy to create new data loading jobs.

Mycroft provides a web interface that makes it easy to create new data loading jobs.

Mycroft ships as a set of Docker containers which use several AWS services, so we’ve provided a small configuration script to ease the initial customization. Once configured, the service itself can be started using docker-compose, making getting it up and running relatively painless.

A comprehensive Quickstart is available for getting Mycroft up and running. The guide steps through getting a copy of Mycroft, configuring Mycroft and launching the required AWS services, and culminates in generating a schema for some example data and loading that data into Redshift.

Mycroft is available on GitHub. Please let us know if you encounter any issues with Mycroft, and don’t hesitate to submit pull requests with any great features you decide to develop.

Acknowledgements
Thanks to the team and everyone that helped build Mycroft: John Roy, Boris Senderzon, Anusha Rajan, and Justin Cunningham.

April 28th, 2015

Data Science Contest “Keeping it Fresh”: Predict Restaurant Health Scores

Yelp connects people with local businesses and along the way we’ve gathered rich data about customers’ experiences at those businesses via reviews, tips, check-ins and business attributes. We are constantly asking ourselves how the collective wisdom of Yelpers can be used to better inform cities in their efforts around protecting the health of our communities. In particular, could we use Yelp’s reviews and business information to make the process of sending Health Inspectors to restaurants more efficient?

According to the Centers for Disease Control, more than 48 million Americans per year become sick from food, and an estimated 75% of the outbreaks came from food prepared by caterers, delis, and restaurants. Currently, inspectors are sent to restaurants in a mostly random fashion. Since cities only have a limited number of health inspectors, quite often their time is wasted on spot checks at clean, rule-abiding restaurants. This also means that sometimes restaurants with poor health and safety records are discovered too late.

It turns out that with Yelp’s data, cities can improve the process of assigning Health Inspectors drastically. A research study by Prof. Michael Luca from Harvard Business School and Prof. Yejin Choi from Stony Brook University and their graduate students found that a machine learnt model built using Yelp’s reviews data and past health inspection records is able to successfully predict future inspection scores for restaurants 82% of the time. Can we do better?

We want to challenge data scientists worldwide to design a better health inspection prediction algorithm. Yelp is co-sponsoring a new Data Science contest “Keeping it Fresh” in collaboration with the City of Boston, DrivenData.org and Harvard University economists (Ed, Andrew, Scott, and Mike). Using Yelp’s data for restaurants, food and nightlife businesses in Boston as well as past history of health inspections, we are asking contestants to predict the future health score that will be assigned to a business at their next health inspection.

Winning algorithms will be awarded financial prizes — but the real prize is the opportunity to help the City of Boston, which is committed to examining ways to integrate the winning algorithm into its day-to-day inspection operations.

The goal for this competition is to use data from social media to narrow the search for health code violations in Boston. Competitors will have access to historical hygiene violation records from the City of Boston — a leader in open government data — and Yelp’s consumer reviews. The challenge: Figure out the words, phrases, ratings, and patterns that predict violations, to help public health inspectors do their job better. The first-place winner will receive $3,000, and two runners-up will receive $1,000 each. All prizes are provided by Yelp.

Figure 1: Health inspection history for a popular San Francisco restaurant. This restaurant’s health score is predicted in Figure 2 below.

Figure 1: Health inspection history for a popular San Francisco restaurant. This restaurant’s health score is predicted in Figure 2 below.

Yelp engineers have been fascinated by this problem. In a recent internal hackathon, a team of Yelp engineers comprising of Wing Y., Srivathsan R., Florian H., Jon C. and Srivatsan S., decided to dive deep into our rich user-generated content to find out correlations between reviews and actual health scores for businesses. They modelled Yelp reviews as a bag-of-words and used machine learning techniques like logistic regression to predict health scores. They then went on to overlay their algorithmically predicted scores with the actual city-issued health scores for those businesses over a period of time. Consider this example of a popular San Francisco restaurant:

Figure 2: Health score of a popular San Francisco restaurant in black and the predicted health score in green.

Figure 2: Health score of a popular San Francisco restaurant in black and the predicted health score in green.

It turns out that the actual health scores issued by the city of San Francisco (in black) follow very closely with the algorithmically predicted health scores (in green). That doesn’t come as a surprise because Yelpers are quite often talking about the same things that health inspectors look for – cleanliness, ambience, methods of preparation. This restaurant also exemplifies our observation from a randomized, controlled trial in one large city that restaurants whose low hygiene ratings are posted on Yelp tend to respond by cleaning up and performing better on their next inspection.

In the end such public/private partnerships with cities will enable us to enrich the experiences of all parties: consumers who avoid getting sick, businesses who are able to use actionable information to improve their health standards and cities who are able to optimize their finite resources of Health Inspectors to match them with restaurants more efficiently.

The competition opened yesterday (Monday, April 27th) and will accept submissions for eight weeks. Submissions will be evaluated on fresh hygiene inspection results during the six weeks following the competition; after that, the prizes will be awarded. Your submission will not only put you in the running for the prize – it has the chance to transform how city governments ensure public health.

We are excited to see what machine learning algorithms our contestants will build. So wait no more. Go sign up for the “Keeping it Fresh” Contest here.

Acknowledgements
Thanks to our partners from: City of Boston, Harvard Business School and DrivenData for making this contest a reality. Special thanks to Luther L. for leading the charge and bringing together all the parties involved. Thanks to Srivatsan S. for helping author this blog post.

April 13th, 2015

True Zero Downtime HAProxy Reloads

HAProxy: Cornerstone of Reliable Websites

One primary goal of the infrastructure teams here at Yelp is to get as close to zero downtime as possible. This means that when users make requests for www.yelp.com we want to ensure that they get a response, and that they get a response as fast as possible. One way we do that at Yelp is by using the excellent HAProxy load balancer. We use it everywhere: for our external load balancing, internal load balance, and with our move to a Service Oriented Architecture, we find ourselves running HAProxy on every machine at Yelp as part of SmartStack.

We love the flexibility that SmartStack gives us in developing our SOA, but that flexibility comes at a cost. When services or service backends are added or permanently removed, HAProxy has to reload across our entire infrastructure. These reloads can cause reliability problems because while HAProxy is top notch at not dropping traffic while it is running, it can (and does) drop traffic during reloads.

HAProxy Reloads Drop Traffic

As of version 1.5.11, HAProxy does not support zero downtime restarts or reloads of configuration. Instead, it supports fast reloads where a new HAProxy instance starts up, attempts to use SO_REUSEPORT to bind to the same ports that the old HAProxy is listening to and sends a signal to the old HAProxy instance to shut down. This technique is very close to zero downtime on modern Linux kernels, but there is a brief period of time during which both processes are bound to the port. During this critical time, it is possible for traffic to get dropped due to the way that the Linux kernel (mis)handles multiple accepting processes. In particular, the issue lies with the potential for new connections to result in a RST from HAProxy. The issue is that SYN packets can get put into the old HAProxy’s socket queue right before it calls close, which results in a RST of those connections.

There are various workarounds to this issue. For example, Willy Tarreau, the primary maintainer of HAProxy, has suggested that users can drop SYN packets for the duration of the HAProxy restart so that TCP automatically recovers. Unfortunately, RFC 6298 dictates that the initial SYN timeout should be 1s, and the Linux kernel faithfully hardcodes this. As such, dropping SYNs mean that any connections that attempt to establish during the 20-50ms of an HAProxy reload will encounter an extra second of latency or more. The exact latency depends on the TCP implementation of the client, and while some mobile devices retry as fast as 200ms, many devices only retry after 3s. Given the number of HAProxy reloads and the level of traffic Yelp has, this becomes a barrier to the reliability of our services.

Making HAProxy Reloads Not Drop Traffic

To avoid this latency, we built on the solution proposed by Willy. His solution actually works very well at not dropping traffic, but the extra second of latency is a problem. A better solution for us would be to delay the SYN packets until the reload was done, as that would only impose the latency of the HAProxy reload on new connections. To do this, we turned to Linux queueing disciplines (qdiscs). Queueing disciplines manipulate how network packets are handled within the Linux kernel. Specifically you can control how packets are enqueued and dequeued, which provides the ability to rate limit, prioritize, or otherwise order outgoing network traffic. For more information on qdiscs, I highly recommend reading the lartc howto as well as the relevant man pages.

After some light bedtime reading of the Linux kernel source code, one of our SREs, Josh Snyder, discovered a relatively undocumented qdisc that has been available since Linux 3.4: the plug queueing discipline. Using the plug qdisc, we were able to implement zero downtime HAProxy reloads with the following standard Linux technologies:

  • tc: Linux traffic control. This allows us to set up queueing disciplines that route traffic based on filters. On newer Linux distributions there is also libnl-utils which provide interfaces to some of the newer qdiscs (such as the plug qdisc).
  • iptables: Linux tool for packet filtering and NAT. This allows us to mark incoming SYN packets.

Smartstack clients connect to the loopback interface to make a request to HAProxy, which fortunately turns incoming traffic into outgoing traffic. This means that we can set up a queuing discipline on the loopback interface that looks something like Figure 1.

HAProxy_qdisc

Figure 1: Queueing Discipline

This sets up a classful implementation of the standard pfifo_fast queueing discipline using the prio qdisc, but with a fourth “plug” lane . A plug qdisc has the capability to queue packets without dequeuing them, and then on command flush those packets. This capability in combination with an iptables rule allows us to redirect SYN packets to the plug during a reload of HAProxy and then unplug after the reload. The handles (e.g ‘1:1’, ‘30:’) are labels that allow us to connect qdiscs together and send packets to particular qdiscs using filters; for more information consult the lartc howto referenced above.

We then programmed this functionality into a script we call qdisc_tool. This tool allows our infrastructure to “protect” a HAProxy reload where we plug traffic, restart haproxy, and then release the plug, delivering all the delayed SYN packets. This invocation looks something like:

We can easily reproduce this technique with standard userspace utilities on modern Linux distributions such as Ubuntu Trusty. If your setup does not have nl-qdisc-add but does have a 3.4+ Linux kernel, you can manipulate the plug via netlink manually.

Set up the Queuing Disciplines

Before we can do graceful HAProxy reloads, we must first set up the queueing discipline described above using tc and nl-qdisc-add. Note that every command must be run as root.

Mark SYN Packets

We want all SYN packets to be routed to the plug lane, which we can accomplish with iptables. We use a link local address so that we redirect only the traffic we want to during the reload, and clients always have the option of making a request to 127.0.0.1 if they wish to avoid the plug. Note that this assumes you have set up a link local connection at 169.254.255.254.

Toggle the Plug While Reloading

Once everything is set up, all we need to do to gracefully reload HAProxy is to buffer SYNs before the reload, do the reload, and then release all SYNs after the reload. This will cause any connections that attempt to establish during the restart to observe latency equal to the amount of time it takes HAProxy to restart.

In production we observe that this technique adds about 20ms of latency to incoming connections during the restart, but drops no requests.

Design Tradeoffs

This design has some benefits and some drawbacks. The largest drawback is that this works only for outgoing links and not for incoming traffic. This is because of the way that queueing disciplines work in Linux, namely that you can only shape outgoing traffic. For incoming traffic, one must redirect to an intermediary interface and then shape the outgoing traffic from that intermediary. We are working on integrating a solution similar to this for our external load balancers, but it is not yet in production.

Furthermore, the qdiscs could also probably be tuned more efficiently. For example, we could insert the plug qdisc at the first prio lane and adjust the priomap accordingly to ensure that SYNs always get processed before other packets or we could tune buffer sizes on the pfifo/plug qdiscs. I believe that for this to work with an interface that is not loopback, the plug lane would have to be moved to the first lane to ensure SYN deliverability.

The reason that we decided to go with this solution over something like huptime, hacking file descriptor passing into HAProxy, or dancing between multiple local instances of HAProxy is because we deemed our qdisc solution the lowest risk. Huptime was ruled out quickly as we were unable to get it to function on our machines due to an old libc version, and we were uncertain if the LD_PRELOAD mechanism would even work for something as complicated as HAProxy. One engineer did implement a proof of concept file descriptor patch during a hackathon but the complexity of the patch and the potential for a large fork caused us to abandon that approach; it turns out that doing file descriptor passing properly is really hard. Of the three options, we most seriously considered running multiple HAProxy instances on the same machine and using either NAT, nginx, or another HAProxy instance to switch traffic between them. Ultimately we decided against it because of the number of unknowns in implementation, and the level of maintenance that would be required for the infrastructure.

With our solution, we maintain basically zero infrastructure and trust the Linux kernel and HAProxy to handle the heavy lifting. This trust appears to be well placed as in the months this has been running in production we have observed no issues.

Experimental Setup

To demonstrate that this solution really works, we can fire up an nginx HTTP backend with HAProxy sitting in front, generate some traffic with Apache Benchmark, and see what happens when we restart HAProxy. We can then evaluate a few different solutions this way.

All tests were carried out on a freshly provisioned c3.large AWS machine running Ubuntu Trusty and a 3.13 Linux kernel. HAProxy 1.5.11 was compiled locally with TARGET=linux2628. Nginx was started locally with the default configuration except that it listens on port 8001 and serves a simple “pong” reply instead of the default html. Our compiled HAProxy was started locally with a basic configuration that had a single backend at port 8001 and a corresponding frontend at port 16000.

Just Reload HAProxy

In this experiment, we only restart HAProxy with the ‘-sf’ option, which initiates the fast reload process. This is a pretty unrealistic test because we are restarting HAProxy every 100ms, but it illustrates the point.

Experiment

Results

Socket reset! Restarting HAProxy has caused us to fail a request even though our backend was healthy. If we tell apache benchmark to continue on receive errors and do more requests:
Only 0.25% of requests failed. This is not too bad, but well above our goal of zero.

Drop SYNs and Let TCP Do the Rest

Now we try the method where we drop SYNs. This method seems to completely break with high restart rate as you end up with exponentially backing off connections, so to get reliable results I could only restart HAProxy every second.

Experiment

Results

iptables_graph

Figure 2: Iptables Experiment Results

As expected, we drop no requests but incur an additional one second of latency. When request timings are plotted in Figure 2 we see a clear bimodal distribution where any requests that hit the restart take a full second to complete. Less than one percent of the test requests observe the high latency, but that is still enough to be a problem.

Use Our Graceful Restart Method

In this experiment, we restart HAProxy with the ‘-sf’ option and use our queueing strategy to delay incoming SYNs. Just to be sure we are not getting lucky, we do one million requests. In the process of this test we restarted HAProxy over 1500 times.

Experiment

Results

tc_graph

Figure 3: TC Experiment Results

Success! Restarting HAProxy has basically no effect on our traffic, causing only minor delays as can be seen in Figure 3. Note that this method is heavily dependent on how long HAProxy takes to load its configuration, and because we are running such a reduced configuration, these results are deceivingly fast. In our production environment we do observe about a 20ms penalty during HAProxy restarts.

Conclusion

This technique appears to work quite well to achieve our goal of providing a rock-solid service infrastructure for our developers to build on. By delaying SYN packets coming into our HAProxy load balancers that run on each machine, we are able to minimally impact traffic during HAProxy reloads, which allows us to add, remove, and change service backends within our SOA without fear of significantly impacting user traffic.

Acknowledgements

Thanks to Josh Snyder, John Billings and Evan Krall for excellent design and implementation discussions.