Engineering Blog

November 4th, 2014

Reflections from Grace Hopper (Part 1)

As you probably heard, Yelp attended Grace Hopper this year. Nine software engineers from different teams attended and, for many of us, it was our first time.

It was a unique experience to see so many talented women in one place. In addition to the talks and panel discussions, we also had the opportunity and pleasure to represent Yelp at the career fair. It was amazing to see a consistent flow of students and industry talent, all happy customers, stop by our booth to speak with us and tell us their stories of using Yelp.

We had such a great time that we wanted to share some highlights with you. This is the first of two posts where each of us have added an excerpt from our experience in our own words.

Our group at Grace Hopper

Our first day at the career fair. From left: Jen F., Wei W., Susanne L., Tasneem M., Carmen J., Emily F., Virginia T.

Tasneem M.

Engineering Manager, Ads

I joined Yelp about a month ago and was super excited to be part of this journey with fun and interesting women from the company. I have worked in the software industry since the early 2000s and have grown as an engineer, a manager and a leader partly because I have had inspiring role models throughout my career. My mentors have challenged me to take on opportunities that I felt I wasn’t quite ready for.

My experience at Grace Hopper was a reminder of how far I have come and yet how far I still can go. It has re-inspired me to continue mentoring women and help them take charge of their careers in a male dominated industry. I am also motivated to help tackle the diversity issues within my local community.

The highlights for me were being at the career fair, attending the keynotes and technology lightning talks. Having spent a lot of time at the career fair, I was impressed by the talent and deep interest in data mining and machine learning. I look forward to seeing some of you join our growing community of awesome engineers at Yelp. I also loved Pat Kirkland’s talk on “Executive Presence.” I appreciated her role-playing and practical guidance on the different personas (prey, predator and partner) and have been able to successfully experiment with some of her tips at work. The lightning talks were a great way to hear several unique stories and perspectives within an hour. I hope that we can be on stage next year sharing some of our experiences.

Susanne L.

Database Engineer, Operations

I am a database engineer at Yelp, working on a team where we all make sure that our persistent stores are reliable, scalable, fast and give our Yelp consumers a pleasant experience with our product. I thought it was awesome to see so many great women in computing coming to Grace Hopper. Some of them were looking for an internship or a full time position and they stopped by our booth genuinely interested in Yelp. It was exciting to hear how our product makes consumers happy.

A lot of these young women had remarkable experiences but didn’t really know what to highlight and how to sell themselves. My suggestion to some of them was to:

  • Consider having an “elevator pitch” prepared. Sell yourself in 60 seconds: try a pitch that includes your name, year, major, minor (if you have one), and why you want this job.
  • Prepare to talk about a distinguishing skill/project. For example, “I did a hackathon, where I developed an android app..” Be prepared to answer any questions about the project and what you learned from it.
  • Know what you are looking for. “What kind of internship at Yelp would you be interested in?” Backend/Frontend/Mobile/Web? Which programming language do you like? Why? Avoid “I don’t know, anything is fine” unless you are a freshman!

Jen F.

Systems Administrator, Corporate Infrastructure

I have been at Yelp since August 2012. For me,  the talk I will remember most is “Beyond the Buzzwords: Test-Driven Development” where I got a quick overview of how one can practice test driven development (TDD). While TDD isn’t the most exciting technical topic, the speaker, Sabrina B. Williams from Google, did an awesome job making it relatable. She was able to share a live demo of a project from concept to the tests to fully-implemented program.

The keynotes were full of rock stars of the technology world that were also surprisingly engaging (and occasionally controversial!).  Who hasn’t gone to a technology conference keynote and expected a glorified sales pitch that was easily forgettable?  None of that was here at Grace Hopper.

I also appreciated how diverse the talks were – from recommendations on how to get your first job to dealing with office politics to in-depth technical talks. However, one thing I’d love to see more at Grace Hopper next year is talks about advancing your career in non-managerial roles. All in all, this is a great conference and I only wish I had been able to participate when I was just getting into the industry!

October 31st, 2014

Scaling Traffic from 0 to 139 Million Unique Visitors

At LAUNCH Scale last week, I gave a talk to over 75 co-founders (CEOs and CTOs) on how we’ve scaled traffic here at Yelp. It brought back memories of Darwin biting through our ethernet cable and reminded me of the run up to our IPO, making sure we had enough capacity to handle the expected surge in traffic from the world’s press (and more recently, the launch of Yelp in Hong Kong!). For close to 8 years, I’ve had the privilege to work alongside some of the best engineers in the world and have seen the meticulous work and thought it takes to scale a site to serve over a hundred million unique visitors.

image00

I joined Yelp in early 2007 as a software engineer coming from Google, where I had spent the previous 4 years. On my very first day I was handed the search engine and asked to “improve it.” At the time we were handling approximately 200,000 searches per day and of those searches 86,400 of them were from load balancer doing health checks! We had one primary database running MySQL with a couple of replication slaves running a mix of InnoDB and MyISAM tables (side note: MyISAM isn’t great when your databases hard fail). We were using Apache without gzip enabled (pro tip: enable it!) and our data science toolkit was: cat, grep, wc, gnuplot, and awk. Because Yelp was born pre-AWS, we had to operate our own data centers.

As our traffic grew, our infrastructure had to scale with it. We started using a CDN to host our static content. The one MySQL database with slaves was scaled by vertically sharding it, moving tasks that were write heavy and non-user-interactive to different databases (e.g. clicks/impression/phone calls for advertisers). Another high-impact win our team had for improving database performance was moving our hosts to FusionIO PCIe cards as our primary storage. On the data center side, our operations team moved from having one data center to having many. Given our traffic make-up, we decided that our new data centers would be “read-only” and that we would have a separate primary data center where all writes happen. This made scaling our read only traffic much more straightforward. We now had data centers closer to users, allowing us to use DNS to geographically load balance our traffic, making the experience faster for users. We’ve also been able to leverage Amazon EC2 using AWS Direct Connect, which allows our engineering teams to bring up hardware whenever they need. It’s been awesome removing the hardware barrier for getting to production.

As our traffic scaled, our logging infrastructure needed to keep up as well. We started off using syslog-ng and rsync to handle logs stored on a NFS server and lots of disks. In October 2008 we moved to using scribe (now a custom branch), which has served us very well over the past 5+ years that we’ve been using it. We take the logs scribe aggregates and move them into Amazon S3 for storage, which makes using EMR on AWS seamless. This is why in 2009 we open sourced mrjob, which allows any engineer to write a MapReduce job without contending for resources. We’re only limited by the amount of machines in an Amazon data center (which is an issue we’ve rarely encountered). Real-time analytics are much better than periodically run batch jobs, so recently we open sourced Pyleus which allows anyone to write Storm topologies using Python. Another powerful tool we created, called MOE, allows anyone to optimize the parameters of any function (e.g. ad or search ranking functions). We’re extremely proud of our open source contributions and hope to have many more in the future.

For the past 10 years, the Yelp Engineering team has worked really hard to build a scalable architecture with the ability to develop and push code to the site multiple times a day. For those of you who enjoy working on these types of problems, make sure to check out our careers page. We’re always hiring!

If you’d like to know more on how we scaled the site, check out the slides from my presentation at LAUNCH Scale and feel free to reach out to me on Twitter at @stopman.

October 15th, 2014

Introducing Pyleus: An Open-source Framework for Building Storm Topologies in Pure Python

Yelp loves Python, and we use it at scale to power our websites and process the huge amount of data we produce.

Pyleus is a new open-source framework that aims to do for Storm what mrjob, another open-source Yelp project, does for Hadoop: let developers process large amounts of data in pure Python and iterate quickly, spending more time solving business-related problems and less time concerned with the underlying platform.

First, a brief introduction to Storm. From the project’s website, “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”

A Pyleus topology consists of, at minimum, a YAML file describing the structure of the topology, declaring each component and how tuples flow between them. The pyleus command-line tool builds a self-contained Storm JAR which can be submitted to any Storm cluster.

When it comes to massive data processing demos, “word count” is a classic. Since Storm operates on “unbounded streams of data”, we can’t calculate a total count for each unique word, since the input stream could continue indefinitely. Instead, our topology will maintain a monotonically increasing counter for each unique word, and log the value each time we see it.

So how can you build a word count Storm topology using Pyleus? All you need is a pyleus_topology.yaml and a few Python components.

For a simple demonstration, you need to know about the three core concepts in Storm:

  • A tuple is the unit of data within a Storm topology, flowing into and out of processing components.
  • Spouts are components that feed tuples into a topology. Usually, a spout consumes data from an external source, like Kafka or Kinesis, then emits records as tuples.
  • Bolts subscribe to the output streams of one or more other spouts and bolts, do some processing, then emit tuples of their own.

This topology has three components: A spout that emits a random line of “lorem ipsum” text, a bolt that splits lines into words, and bolt that counts and logs occurrences of the same word.

Here are the contents of pyleus_topology.yaml:

The spout configuration is self-explanatory, but the bolts must indicate the tuple streams to which they are to subscribe. The split-words bolt subscribes to line-spout with a shuffle_grouping—this means that tuples emitted from line-spout should be evenly and randomly distributed amongst all instances of split-words, be there one, five, or fifty.

count-words, however uses a fields_grouping on the ‘word’ field. This forces all tuples emitted from split-words with the same value of ‘word’ to go to the same instance of count-words. This allows the code in word_count.count_words to make the assumption that it will “see” all occurrences of the same word within the same process.

word_count/line_spout.py:

word_count/count_words.py:

word_count/split_words.py is left as an exercise for the reader. (Or, you could just check out the full example on GitHub)

Now, running pyleus build in this directory will produce a file word_count.jar which you can submit to a Storm cluster with pyleus submit, or run locally for testing with pyleus local.

The code for a Pyleus topology can be very simple, but one feature in particular we are excited about is built-in virtualenv integration. Simply include a requirements.txt file alongside your pyleus_topology.yaml, and pyleus build will build a virtualenv that your code can use and embed it in the JAR. You can even re-use packaged Pyleus components, and refer to them directly in pyleus_topology.yaml!

Another team at Yelp has already developed a Pyleus spout for consuming data from an internal source and built a Python package for it. Now, others within the company can add one line to their requirements.txt and use the spout in their pyleus_topology.yaml without writing a single line of code.

Pyleus is beta software, but feedback and pull requests are happily accepted.

Get started using Pyleus by installing it with pip install pyleus, then check out the source on GitHub for more examples.

October 9th, 2014

Using MOE, the Metric Optimization Engine, to Optimize an A/B Testing Experiment Framework

A/B Testing Experiment Frameworks and MOE

We recently open sourced MOE, the Metric Optimization Engine, a machine learning tool for solving global, black box optimization problems. An example application for such a system is optimally running online A/B experiments.

A/B testing segments the users that come to a site into buckets, or cohorts, and show different versions of the site to different cohorts of users. One can show 50% of users one version of a site (version A) and 50% of users another version of a site (version B). After some amount of time we can see which version of the site performs better on various metrics (like Click Through Rate (CTR), conversions, or total revenue) and shift all of your traffic over to the better version. This can be repeated as we converge towards the best version of the site under the metric(s).

It can take a long time to attain statistically significant results for experiments. We want to know with high confidence that one version of the site is better than another. Depending on user traffic it can take days, or even weeks, to run a single iteration. These experiments are also expensive; by showing a suboptimal version of your site, you are sacrificing potential revenue. There is also an opportunity cost associated with creating, developing, and running any particular experiment. So the fewer experiment iterations to get to the optimum, the better.

Furthermore, many A/B tests are merely parameter selection problems (an A/A’ test, where a feature stays the same, and only the underlying parameters change). Given a feature, we want to find the optimal configuration values for its parameters as quickly as possible.

MOE was designed to optimally solve problems like this, when we want to find an optimal set of parameters (inputs) of a function (a metric, like CTR) when sampling this function is time consuming or expensive (like running an A/B test experiment on live traffic). By leveraging MOE we can build a general experiments platform that automatically promotes good cohorts and pushes traffic away from under-performing cohorts, then replacing them with new, winning configurations.

MOE in A-B testing (1)

Figure 1: MOE can be used to optimally assign traffic allocations in an A/B testing framework and to suggest new parameters to sample given the historical values already sampled.

Using Multi-Armed Bandits for Optimal Traffic Allocation

Instead of naively allocating traffic in some uniform way (like a 50/50 split) we can use the information about how an individual cohort is performing to dynamically and optimally allocate the best percentage of traffic to it. This is fundamentally a tradeoff between exploration and exploitation. Exploration is equivalent to gaining more knowledge about the system by continuing to allocate traffic to a wide variety of points. Exploitation on the other hand, is using the knowledge we have already gained to get the most expected return out of the system as we currently understand it. Others have shown that using Multi-Armed Bandits can achieve better results, faster, in online A/B tests.

MOE has many different bandit policies implemented and allows the user to select a policy that best fits their desired trade-off between the exploration and exploitation of the data. By using a Multi-Armed Bandit approach we can limit the amount of suboptimal traffic allocation in an A/B test, leaving as few gains on the table as possible.

Using Bayesian Global Optimization to Suggest Parameters

As the Multi-Armed Bandit system starts lowering the traffic allocation for a certain experiment we have two choices. We can turn off that cohort, and allow the remaining cohorts to battle it out until we have one clear winner (within whatever confidence we desire). Alternatively, we can have MOE suggest a new cohort using Bayesian Global Optimization (with a new set of underlying parameters) for the system to try. This allows the system to play an internal game of king-of-the-hill, where better and better parameters are being sampled as the objective continues to rise higher and higher.

MOE uses the historical data gained so far to decide which points to sample next. It suggests new parameters to try that have the highest Expected Improvement under a Gaussian Process model. These are the parameters that are expected beat the current best set of parameters by the most, using whatever metric we provide as the objective function. This method converges much faster to global optima than other heuristic methods like grid search, random search or basic hill climbing.

By constantly spinning down underperforming cohorts in our experiment and replacing them with optimal new cohorts we allow our experiment framework to continually search the underlying parameter space for the best possible parameters.

Putting it all Together

Let’s assume that we want to optimize some underlying parameter that affects our ad Click Through Rate (CTR) in some way. This parameter could represent anything from a threshold in our system to some hyperparameter to a feature in our ad targeting system. Let’s say we currently are running with this parameter set to the value 0.2 in production, our status quo.

We would like to run an experiment where we sample other values of parameter in an attempt to maximize CTR. We will need to run time consuming and expensive experiments every time we want to measure CTR, so we will use MOE to help us find the optimal value in as few samples as possible. We simulate the observed CTR of the system for 1 million users by sampling from a Bernoulli distribution with the given true CTR.

In a real world system we do not know exactly how the underlying parameter affects CTR, which is why we need to sample various values with real traffic to find the optima. For the purposes of this example, we will define the exact function of CTR with respect to the underlying parameter for simulation purposes, keeping it hidden from MOE. Note that the function need not be convex or even continuous.

true_val

Figure 2: The graph of the true CTR vs the underlying parameter. There is a local CTR maximum at 0.143 and a global CTR maximum at 0.714.

We begin the experiment by asking MOE for 2 new parameter values to sample, in addition to our current status quo value of 0.2. We will set our objective function for MOE to optimize to be the relative gain over the status quo CTR. We note that this gives our status quo parameter a value of 0, and any parameter that has a higher CTR than the status quo will have a positive value, while any parameter that has a lower CTR will have a negative value.

Our initial representation of the underlying system, and the Gaussian Process, looks like Figure 3.

init_gauss

Figure 3: The blue (mean) and green (variance) plot is the initial Gaussian Process representation of the space that MOE optimizes. The dashed gray line is the true, unknown objective function, the relative CTR gain vs status quo.

MOE suggests two points as the initial parameters to test. These will be the corresponding parameters for cohort 1 and cohort 2 respectively. Initially we will conservatively allocate new cohorts to have 5% of traffic each, using the epsilon greedy bandit policy with epsilon set to 0.15. The remaining 90% of traffic will go towards the best parameter value observed so far.

Once we have the simulated data we can determine which cohorts have performed well and which ones should be turned off. We turn off any cohort that has a CTR more than two standard deviations worse than the current best observed CTR. We then query MOE for new, optimal parameters to sample given how all historical parameters have performed. This is repeated multiple times as poorly performing parameter values are culled and new parameter values are suggested by MOE. See Figure 4 for the evolution of the underlying Gaussian Process and watch MOE converge to the globally optimal parameter value.

ab_test

Figure 4: The new points be sampled each day (red x) and previous points (blue x) influence the Gaussian Process representation of the relative CTR gain that MOE uses to optimize the underlying parameter. We can see the Gaussian Process getting more accurate with respect to the true function (dashed gray) as more points are sampled.

By iterating to the global optimal parameter value in as few samples as possible, we can increase the system CTR far beyond the initial status quo value, or even the local maxima. In this example, by the second set of sampled parameters MOE is beating both the status quo CTR value and the CTR the best local parameter value, achieving the best global parameter value within the third set of sampled points.

ctr_plot

Figure 5: The simulated CTR during for each round of sampled parameter values. Initially the system CTR is equal to the status quo CTR. MOE quickly finds better and better values of the underlying parameter, converging to the global optima in 3 short sets of samples.

Using MOE in your experiment framework

October 6th, 2014

October Events at Yelp

October will be a busy month for all of us, with a lot of great conferences and events happening back-to-back. We start the month off with an SF Python meetup in collaboration with PyLadies. Zach Musgrave will be presenting on performance profiling in production. Later this month, Scott Clark will be presenting on MOE for SF Machine Learning here at Yelp HQ too! If you’re not familiar with MOE, it’s our black box optimization engine to help you with real world metric optimization.

We’ll also be at Grace Hopper on October 8-10, StrataConf on October 15-17, and at HTML5DevConf on October 20-24. Michael Stoppelman will be presenting at StrataConf on Yelp’s Data Pipeline and Ken Struys will be presenting at HTML5DevConf on Scaling Selenium Testing at Yelp so make sure to catch both of those!

Events happening at Yelp HQ

  • Wednesday, October 8, 2014 – 6:15PM – Python in Handling E-Crime and Massive Web Requests in Production (SF Python)
  • Thursday, October 9, 2014 – 6:00PM – Android Tour plus Hands-On Session (Learning Android Development)
  • Wednesday, October 15, 2014 – 6:00PM – Introducing the Metric Optimization Engine (MOE) (SF Machine Learning)
  • Thursday, October 16, 2014 – 6:30PM – Designing for Longevity (Designers + Geeks)
  • Wednesday, October 22, 2014 – 6:45PM – How Product Managers Make Your Business Scale (Products That Count)
  • Thursday, October 30, 2014 – 6:00PM – Large-scale Linear Classification: Status and Challenges by Prof. C.J. Lin (SF Machine Learning)

Conferences Yelp is attending