Engineering Blog

December 10th, 2014

Learning to Rank for Business Matching

At Yelp, we solve a lot of different information retrieval problems ranging from highlighting reviews for businesses to recommending nearby restaurants. We recently improved one system called business matching, moving from a naive ranking function to a machine learned approach using Learn to Rank.

The Business Matching Problem

Business matching is a system that accepts a description of a business (e.g. name, location, phone number) and returns the Yelp businesses that description best fits. We can use this in several different contexts. For example, in 2012 Yelp announced a partnership spearheaded by the City of San Francisco to allow municipalities to publish restaurant inspection information to Yelp. The data feed shared by the City of San Francisco contained restaurants with business pages on Yelp but we didn’t know the exact business IDs needed to correctly update their information.

(more…)

December 3rd, 2014

December Events At Yelp

Christmas ornament on tree

While we’re all still recovering from our Thanksgiving-induced food comas, we wanted to share some of our upcoming events for December! This Thursday, some of our engineers will be giving lightning talks at the Women Who Code meetup so make sure to come!

We’re also very excited to be sponsoring the Women Who Code Holiday Party in Waterloo Ontario on December 11th. If you’re in the area, RSVP and stop by! For the rest of the week (through December 5th), Yelp will also be matching donations to Women Who Code (up to $5,000 total). Please consider donating and helping them connect women in tech.

  • Wednesday, December 3, 2014 – 6:15PM – Lightning talks and High Performance Python by Raymond Hettinger (Python)
  • Thursday, December 4, 2014 – 6:30PM – Lightning Talks (Women Who Code)
  • Thursday, December 11, 2014 – 7:00PM – Experiences Everywhere: Designing for a Multi-Device World // Holiday Party (Designers + Geeks)
  • Wednesday, December 17, 2014 – 6:15PM – Advanced JS: D3.js Workshop (Girl Develop It)
November 26th, 2014

Yelp Hackathon 15: Sharks, Sock Puppets and Spectacular Projects

We recently held the 15th edition of Hackathon, a two-day event comprising of pure, unadulterated hacking, innovation, creativity and fun. Folks clustered together, hammering on their keyboards, scribbling on whiteboards, and some electrical engineers were plugging away on their breadboards. The kitchens were filled with delicious catered food, and fresh fruits and snacks lined up in the common spaces. We even had a school of remote-controlled sharks lurking around the office and taking over the meeting rooms.

Those ominous-looking RC sharks do not seem to deter our intrepid hackers!

Those ominous-looking RC sharks do not seem to deter our intrepid hackers!

Close to 60 great projects came out of the event from our offices in San Francisco, Palo Alto, Hamburg, and London – projects that ranged from cool data mining and visualizations, useful product features, funny utilities, and some serious hardcore hacking.

Are they building a quadcopter to fight those sharks?

Are they building a quadcopter to fight those sharks?

Speaking of hardcore hacking, a team of engineers comprised of Cameron P., Ben B. and Kevin L. designed an 8-bit ISA (Instruction Set Architecture) with a 16-bit stack-based math co-processor. They implemented an emulator for the architecture and also designed/built an assembler to target it. With their assembler they built an interpreter for Brainfuck, a popular esoteric and Turing-complete language. To top it all off, they wrote their emulator and assembler in Go. Their project idea echoes a collective sentiment that runs through the fabric of our engineering team: taking on a hard challenge! Could they design a useful ISA in the constrained space of 8-bit fixed width instructions? If they designed it could they actually build it? And if they built it could they make a non-trivial program to prove it was useful? For them, it was fun just trying to accomplish all of that.

The Creative team at Yelp came up with a fun Yelp branding campaign for the Hackathon, complete with a set of mascots and a selfie station. They seem to have had some high-profile guests that day.

The Creative team at Yelp came up with a fun Yelp branding campaign for the Hackathon, complete with a set of mascots and a selfie station. They seem to have had some high-profile guests that day.

Like most of our Hackathons in the past, this one resulted in a whole array of projects focussed on Yelp data. Why? Well, we have a fantastic dataset generated by a passionate and dedicated community of Yelpers around the world. A team of data scientists and product managers comprised of Farid H., Natarajan S. and Daniel F. built a pretty cool tool that allows users to explore a city by its neighborhoods. Say you’re a tourist visiting San Francisco and want to dine in a touristy neighborhood with fancy restaurants. Or maybe you are a budget traveller and want to check out the local nightlife scene. All you need to do is to tweak the filters on this nifty tool, and voila!

Touristy neighborhoods with fancy restaurants!

Touristy neighborhoods with fancy restaurants!

Local favorites for the budget traveler!

Local favorites for the budget traveler!

To build this tool, the team wrote an algorithm to compute different attribute scores. For example, this algorithm would compute a tourist_score by analyzing the percentage of reviews for businesses in a given neighborhood written by someone who is not from that city.

As engineers, the words reliability and performance are near and dear to our hearts, so much so that some of us use Hackathon as a time to dive into our codebase and explore ways to speed things up. A team of mobile developers comprising of Alex H., Mason G. and Ben A. did just exactly that with some serious iOS hacking.

One of the most complex views on the Yelp iOS app is the business view, the page that lets you learn all about the business you searched for. A lot of the complexity comes from all of the features it has which includes rendering maps, reviews, review highlights, and a swipe-able photoview, just to name a few. How could they make this faster?

Well, they figured out that the bottleneck was due to large amounts of both layout and CPU-bound drawing. The team rebuilt the entire view with collection cell automatic sizing and caching (which is available on iOS 8 and has a fallback available for iOS 7 users). This gave them incremental layout and rendering instead of full view layout and rendering, which sped things up!

They modernized the view to be fully compatible with future screen size changes by using Auto Layout instead of frame based layouts. Even though an analogous Auto Layout view is slower than a similar frame based layout, they still saw large performance improvements from the incremental rendering and rewritten views. These changes dropped rendering times and cut memory usage by roughly 50%.

Our hackers showing off their projects in a science-fair style exhibition

Our hackers showing off their projects in a science-fair style exhibition

Good job, everyone! Or as Darwin would say with staunch approval, “Woof! Woof!”

Thinking about that next killer Hackathon idea, aren’t you? Check out our exciting product and engineering job openings at www.yelp.com/careers and apply today. Hackathon 16 isn’t too far off!

November 12th, 2014

Scaling Elasticsearch to Hundreds of Developers

Yelp uses Elasticsearch to rapidly prototype and launch new search applications, and moving quickly at our scale raises challenges. In particular, we often encounter difficulty making changes to query logic without impacting users, as well as finding client library bugs, problems with multi-tenancy, and general reliability issues. As the number of engineers at Yelp writing new Elasticsearch queries grew, our Search Infrastructure team was having difficulty supporting the multitude of ways engineers were finding to send queries to our Elasticsearch clusters. The infrastructure we designed for a single team to communicate with a single cluster did not scale to tens of teams and tens of clusters.

Problems we Faced with Elasticsearch at Yelp

Elasticsearch is a fantastic distributed search engine, but it is also a relatively young datastore with an immature ecosystem. Until September 2013, there was no official Python client. Elasticsearch 1.0 only came out in February of 2014. Meanwhile, Yelp has been scaling search using Elasticsearch since September of 2012 and as early adopters, we have hit bumps along the way.

We have hundreds of developers working on tens of services that talk to tens of clusters. Different services use different client libraries and different clusters run different versions of Elasticsearch. Historically, this looked something like:

image02

Figure 1: Yelp Elasticsearch Infrastructure

What are the Problems?

  • Developers use many different client libraries, we have to support this.
  • We run multiple version of Elasticsearch, mostly 0.90.1, 1.0.1 and 1.2.1 clusters.
  • Multi-tenancy is often not acceptable for business critical clients because Elasticsearch cannot offer machine level resource controls and its JVM level isolation is still in development.
  • Having client code spread over a multitude of services and applications makes auditing and changing client code hard.

These problems all derive from Elasticsearch’s inevitably wide interface. Elasticsearch developers have explicitly chosen a wide interface that is hard to defend due to lack of access controls, which makes sense given the complexity they are trying to express with an HTTP interface. However, it means that treating Elasticsearch as just another service in a Service Oriented Architecture rapidly becomes difficult to maintain. The Elasticsearch API is continually evolving, sometimes in backwards incompatible ways, and the client libraries built on top of that API are continually changing as well, which ultimately means that iteration speeds suffer.

Change is Hard

As we scaled usage of Elasticsearch here at Yelp, it became harder and harder to change existing code. To illustrate these concerns let us consider two examples of developer requests on the infrastructure mentioned in Figure 1:

Convert Main Web App to use the RequestsES Client Library

This involves finding all the query code in our main web app and then, for each one:

1. Create secondary paths that use RequestsES
2. Setup RequestBucketer groups.
3. Write duplicate tests.
4. Deploy the change.
5. Remove duplicate tests.

We can make the code changes fairly easily but deploying our main web app takes a few hours and we have a lot of query code that needs to be ported. This would take significant developer time due to the amount of complexity involved in deploying our main web application. The high developer cost of changing this code outweighs the infrastructure benefits, which means this change is not pursued.

Convert Service 4 to elasticsearch-py and move them to Cluster 4

Service 4’s SLA has become stricter and they can no longer tolerate the downtime caused by Service 1’s occasionally expensive facet queries. Service 4’s developers also want the awesome reliability features that Elasticsearch 1.0 brought such as snapshot and restore. Unfortunately, our version of the YelpES client library does not support 1.X clusters, but the official Python client does, which is ok because engineers in Search Infrastructure are experts in porting YelpES code to the official Python client. Alas, we do not know anything about Service 4. This means we have to work with the team that owns Service 4, have them build parallel paths, and tell them how to communicate with the new cluster. This takes significant developer time because of coordination overhead between our teams.

It is easy to see that as the number of developers grows, these development patterns just do not scale. Developers are continually adding new query code in various services, using various client libraries, in various programming languages. Furthermore, developers are afraid to change existing code because of long deployment times and business risk. Infrastructure and operations engineers must maintain multi-tenant clusters housing clients with completely different uptime requirements and usage patterns.

Everything about this is bad. It is bad for developers, infrastructure engineers, and operations engineers, and it leads to the following lesson learned:

Systems that use Elasticsearch are more maintainable when query code is separated from business logic

Our Solution

Search Infrastructure at Yelp has been employing a proxy service we call Apollo to separate the concerns of the developers from the implementation details so that now our infrastructure looks like this:

image01

Figure 2: Apollo

Key Design Decisions

Isolate infrastructure complexity

The first and foremost purpose of Apollo is to isolate the complexity of our search infrastructure from developers. If a developer wants to search reviews from their service, they post a json blob:

{"query_text": "chicken tikka masala", "business_ids": [1, 2, 3] }

to an Apollo url:

apollo-host:1234/review/v3/search

The developer need never know that this is doing an Elasticsearch query using the elasicsearch-py client library, against an Elasticsearch cluster running in our datacenter that happens to run Elasticsearch version 1.0.1.

Validation of all incoming and exiting json objects using json-schema ensures that interfaces are respected and because these schemas ship with our client libraries we are able to check interfaces in calling code, even when that calling code is written in Python.

Make it easy to iterate on query code

Every query client is isolated in their own client module within Apollo, and each client is required to provide an input and output schema that governs what types of objects their client should accept and return. Each such interface is bound to a single implementation of a query client, which means that in order to write a non-backwards compatible interface change, one must write an entirely new client that binds to a new version of the interface. For example, if the interface to review search changes, developers write a separate module and bind it to /review/v4/search, while continuing to have the old module bound to /review/v3/search. No more “if else” experiments, just self contained modules that focus on doing one thing well.

A key feature of per module versioning is that developers can iterate on their query client independently and the Apollo service is continuously delivered, ensuring that new query code hits production in tens of minutes. Each client can also be selectively turned off or redirected to another cluster if they are causing problems in production.

As for language, we chose Python due to Yelp’s mature Python infrastructure and the ease in which consumers could quickly define simple and complicated query clients. For a high throughput service like Apollo, Python (or at least Python 2) is usually the wrong choice due to high resource usage and poor concurrency support, but by using the excellent gevent library for concurrency and the highly optimized json parsing library ujson, we were able to scale Apollo to extremely high query loads. In addition, these libraries are all drop-ins so clients do not have to design concurrency into their query logic, it comes for free. At peak load Apollo with gevent can do thousands if not tens of thousands of concurrent Elasticsearch queries on a single uwsgi worker process, which is pretty good compared to the single concurrent query that normal Python uwsgi workers can achieve.

Make it easy to iterate on infrastructure

Because the only thing that lives in Apollo is code that creates Elasticsearch queries, it is easy to port clients to new libraries or move their client to a different cluster in a matter of minutes. The interface stays the same and end to end tests ensure functionality is not broken.

Another key capability is that from the start we designed these modules to implement a simple interface that is composable. This composable-first architecture has allowed us to provide wrappers like:

  • SlowQueryLogger: A unary wrapper that logs any slow requests to a log for auditing and monitoring.
  • Tee: A binary wrapper that allows us to make requests to two clients but only wait on results from one of them. This is useful for dark launching new clients or load testing new clusters.
  • Mux: A n-ary wrapper that directs traffic between many clients. This is useful for gradual rollouts of new query code or infrastructure.

As an example, let us assume there are two query clients which differ only in the client library they use and Elasticsearch version they expect: ReviewSearchClient and OfficialReviewSearchClient. Furthermore, let us say our operations engineer has just provisioned a new shiny cluster running Elasticsearch 1.2.1 that lives in the cloud and is ready to be load tested. An example composition of these clients within Apollo might be:


This maps to the following request path:

image00

Figure 3: Life of a Request in Apollo

In this short amount of Python code we achieved the following:

1. If any query takes longer than 500ms, log it to our slow query log for inspection
2. Send all traffic to a Mux that muxes between an old PyES implementation and our new official client implementation. We can change the Mux weights at runtime without a code push
3. Separately send traffic to a cloud cluster that we want to load test. Do not wait for the result.

Most importantly of all, we never had to worry about which consumers are making review search requests because there is a well defined interface that is well tested. Additionally, because Apollo uses Yelp’s mature Python service stack we have performance and quality metrics that can be monitored for this client, meaning that we do not have to be afraid to make these kinds of changes.

Revisiting developer requests

Now that Apollo exists, making changes goes from weeks to days, which means our organization can continue to be agile in the face of changing developer needs and backwards incompatible Elasticsearch versions. Let us revisit those developer requests now that we have Apollo:

Convert Main Web App to use the RequestsES Client Library

We have to find all the clients in Apollo that the Main Web App queries and implement their interfaces using the RequestsES client library. Then we wire up a Mux for each client that allows us to switch between the two implementations of the interface, deploy our code (~10 minutes) and gradually roll out the new code using configuration changes. From experience, query code like this can get ported in an afternoon. Having minute long deploys to production makes all the difference because it means that you can get multiple pushes to production in one day instead of one week. Also, because the elasticsearch query crafting code is separate from all the other business logic, it is easier to reason about and feel confident in changes.

Convert Service 4 to elasticsearch-py and move them to Cluster 4

We can implement Service 4’s interface using the new client library, re-using existing tests to ensure functional equivalence between the two implementations. Then we set up a Tee to the new cluster to make sure our new code works and the cluster can handle Service 4’s load. Finally, we wait a few days to ensure everything works and then we change the query client to point at the new cluster. If we really want to be safe we can setup a Mux and gradually roll it over. This whole process takes a few days or less of developer time.

Infrastructure Win

Now that Yelp engineers can leverage Apollo, along with our real time indexing system and dynamic Elasticsearch cluster provisioning, they can develop search applications faster than ever. Whereas before Search Infrastructure was accustomed to telling engineers “unfortunately we can’t do that yet”, today we have the flexibility to support even the most ambitious projects.

Since the release of Apollo just a few months ago, we have ported every major Yelp search engine running on Elasticsearch to use Apollo as well as enabled dozens of new features to be developed by other teams. Furthermore, due to the power of Apollo we were able to seamlessly upgrade to Elasticsearch 1.X for a number of our clients where prior to this that would have been nearly impossible given our uptime requirements.

As for performance, we have found that the slight overhead of running this proxy have proved more than worth it in deployment, cluster reconfiguration, and developer iteration time, enabling us to make up for the request overhead by deploying big win refactors that improve performance.

At the end of the day Apollo gives us flexibility, fast deploys, new Elasticsearch versions, performant queries, fault tolerance and isolation of complexity. A small abstraction and the right interface turns out to be a big win.

November 11th, 2014

November Events at Yelp

image00

This month we have a handful of exciting events and a few new ones! We kicked off the month with Hackathon 15.0 where Yelpers created and shared some amazing projects (more on that in a few weeks!).

Now that the hackathon dust has settled, we’re starting off by hosting the Bay Area Girl Geek Dinner to help encourage networking between girl geeks. On top of that, we’ll be at AnDevCon November 18 – 21 so make sure to find us there too.

Events happening at Yelp HQ:

  • Tuesday, November 11, 2014 – 5:30PM – Yelp Girl Geek Dinner (Bay Area Girl Geek Dinner)
  • Thursday, November 13, 2014 – 7:00PM – Hooked: How to Build Habit-Forming Products (Designers + Geeks)
  • Tuesday, November 18, 2014 – 6:30PM – Docker Meetup at Yelp (Docker)
  • Wednesday, November 19, 2014 – 6:45PM – You Are What You Buy (Products That Count)
  • Thursday, November 20, 2014 – 6:30PM – Become a Mad Scientist through Failure and Determination (BAKG)