Engineering Blog

February 11th, 2015

Reading Between the Lines: How We Make Sense of Users’ Searches

The Problem

People expect a lot out of search queries on Yelp. Understanding exact intent from a relatively vague text input is challenging. A few months ago, the Search Quality team felt like we needed to take a step back and reassess how we were thinking about a user’s search so that we could return better results for a richer set of searches.

Our main business search stack takes into account many kinds of features that can each be classified as being related to one of distance, quality and relevance. However, sometimes these signals encode related meaning, making the equivalence between them difficult to understand (e.g., we know a business is marked as “$ – inexpensive” and that people mention “cheap” in their reviews), but often they capture orthogonal information about a business (e.g., reviews are not a reliable source of a business’ hours). We need a stage where we discern which parts of a query tell us what types of information should be searched over.

It’s also useful to embellish the plain text of a search query with context. Are there related words that are very relevant to a substring of text? Is it possible that the user misspelled part of the query? Having this extended information would enhance our ability to recall relevant businesses, especially in low content areas.


Let’s consider a more concrete example, such as a search for “mimosa brunch” in San Francisco. The above search results show some features that our desired (and now current) query understanding system should be able to extract:

  • “mimosa” has the related search term “champagne”
  • “brunch” maps to the attribute “Good for Brunch”

All these enhancements sounded great but were becoming increasingly difficult to implement. Our old search stack combined the understanding of the intent and semantics of a query and its execution into a single step. This resulted in engineers having a hard time adding new quality features: concepts were overloaded (the word “synonym” had very different meanings in different parts of the code), questions of semantics were tied up in execution implementation, and there was no way for other parts of Yelp to understand the semantic knowledge locked up in our Lucene search stack.

Semantics as Abstraction

To help solve these problems, we found it useful to formally split up the process of extracting meaning from the process of executing a search query. To that end, we created a component named Soxhlet (named after a kind of extractor used in chemistry labs) which runs before search execution, and whose sole job is to extract meaning from a raw text query. It turns the input text into a Rich Query that holds structured semantics in the context of business search. This Rich Query then becomes the preferred abstraction for the rest of the search stack to execute upon.


Encoding Meaning

What does this Rich Query data structure look like? When we were first whiteboarding possible representations we tended to underline parts of the query and assign semantics to each part.


As you can see from the above representation, there are complex semantics encoded in a signal that is mostly textual (the query text). Therefore, we think that annotating the query string is a good way to represent a Rich Query: ranges of the original query text are marked as having typed information via various kinds of Annotations.

For those of you familiar with Lucene, this data structure is very similar to a TokenStream but with annotations instead of attributes. This makes consumption of the Rich Query in the recall and relevance steps straightforward while at the same time giving enough abstraction so that non-search components can also easily consume Rich Queries.

As an example, the extracted RichQuery for a query popular with our team is:

When the above Rich Query is analyzed by our core relevance stack (a service called “Lucy”) from the AttributeQueryAnnotation, it knows to also search over the attribute field of businesses for the “$” attribute. Furthermore, the confidence of 1.0 implies that an attribute match should be considered equivalent to a textual match for “cheap.” Other annotations may have lower confidences associated with them if their extraction process is less precise and this is factored into the scoring function of constructed Lucene queries.

By being able to use the way Annotations overlap in the original search query, it’s now possible to more easily cross-reference Annotations with each other and with the original text. The recall and relevance stages of the search stack use this ability to ensure that the right constraints are used.

Extraction Process

What is the architecture within Soxhlet that allows us to turn a plain query into a Rich Query? Our first observation is that we should have specialized components for producing each type of Annotation. Our second observation is that basic Annotations can help identify more complex ones; you could imagine taking advantage of detected synonyms to properly identify cost attributes in the search query string, etc.

From these observations, we decided that a good starting design was a series of transformation steps that iteratively adds new Annotation types. Each one of these steps is called a Transformer. Each transformer takes a Rich Query as input, looks at all existing Annotations and the original query text, and produces a modified Rich Query as output that possibly contains new annotations. Stripped of boilerplate, the code for the top-level Soxhlet class is very simple:


This architecture also provides us with a mechanism for sharing prerequisite work between Transformers, while not forcing coupling between earlier Transformers or the search recall and relevance steps. As an example, although all Transformers could operate on the raw query string, many kinds of text analysis work on the token level, so many Transformers would have to duplicate the work of tokenizing the query string. Instead, we can have a tokenization Transformer create token Annotations from the raw string, and then every subsequent Transformer has access to that tokenization information.

This optimization should be used carefully though with Soxhlet being internationalized into dozens of languages and more being regularly added, normalization (accent removal, stemming etc.) is an important part of extracting rich queries. It’s tempting to store to normalized queries at beginning of the extraction process but we have avoided doing so to prevent couplings among Transformers and between Soxhlet and other services such as the recall and relevance stage of the search stack.

The Transformers themselves can vary in complexity so let’s dive into some details on how they actually work:

Spelling Transformer

We actually use two different types of spelling transformations. “Spelling Suggestion” is where we still perform the Rich Query extraction and search on the original query but offer a “Did You Mean” link in the results for queries that could be good suggestions. These suggestions are computed at query time by generating candidate suggestions within a small edit distance of the original query and then scoring them with a noisy channel model that takes into account query priors and likelihood of edits.

“Spelling Correction” occurs when we are very confident about a correction and perform the Rich Query extraction and search on the corrected query. These corrections are generated by promoting Spelling Suggestions that have performed well in the past based on click rates for their “Did You Mean” link. At query time we simply lookup a map from query to correction.

This two-tiered approach is very effective in that it allows us to achieve high precision with Spelling Correction without sacrificing recall by allowing Spelling Suggestions for lower confidence suggestions. Our future plans for spelling include improving internationalization and incorporating more advanced features such as a phonetic algorithm into our suggestion scorer.

Synonyms Transformer

Context is very important to synonym extraction and this drove a number of our design decisions:

  • The same word can have different synonyms in different query contexts. For example, the word “sitting” in the query “pet sitting” has the synonym “boarding”. However, the same word in the query “house sitting” probably shouldn’t have “boarding” as a synonym.
  • The same word in the same query can have different synonyms in different domain contexts. For example the word “haircut” could have the synonym “barber” in a local business search engine such as Yelp. However, on an online shopping website a better synonym would be “hair clippers”.
  • The same word can be used very differently in queries compared to documents. For example, it might be common for a review to mention “I got my car repaired here” but it would be very unusual to write “I got my auto repaired here”. As a result, an algorithm that uses only document data to extract synonyms would likely not recall “auto repair” having the synonym “car repair” despite this being a very good query synonym. In fact, research has shown that the similarity between queries and documents is typically very low.

To solve the above problems we adopted a two-step approach to synonym extraction. In the first step we extract possible candidate synonyms from user’s query refinements. In the second step we create vectors for each query containing the distribution of clicked businesses and then score synonyms based on the cosine similarity of their query vectors. We store this information in a lookup table that is accessed at query time to find synonyms.

Attributes Transformer

We found a simple lookup table mapping between phrases and attributes to be very effective for extracting attributes. This transformer also illustrates one of the benefits of a unified query understanding framework – by piggy-backing off the results of the Synonyms Transformer we are able to use the knowledge that “cheap” is a synonym of “inexpensive” to extract the attribute of “$” from “inexpensive restaurant” without explicitly having the mapping of the “inexpensive” to “$” in our attribute lookup. Although the Attributes Transformer is very simple but precise, you could imagine ways of improving the recall such as using a language model for each attribute.

For example the Synonyms Transformer simply contains a dictionary to lookup synonyms that have been pre-generated by batch jobs from query logs. On the other hand, the Spelling Correction Transformer uses a more complex noisy channel model at query time to determine likely corrections.


Using Soxhlet in our business search stack, we’ve been able to roll out the ability to use query synonyms to enhance recall and relevance, more intelligently spell-correct query mistakes, better detect attributes in queries and most importantly we now have an elegant framework for extending our query understanding functionality by adding new transformers. We have already seen significant search results CTR gains as a result of these changes and have set a foundation for sharing with other parts of Yelp our concept of meaning for business search queries. Although it is a major undertaking to move many of the existing “meaning extraction,” like parts of our search stack into Soxhlet, we believe the benefits from this architecture are well worth it.

Special thanks to David S. for helping author this blog post and the Query Understanding team who helped build the platform: Chris F., Ray G., Benjamin G., Natarajan S., Maria C., and Denis L.

February 6th, 2015

Yelp Dataset Challenge is Doubling Up!

Two years, four highly competitive rounds, over $35,000 in cash prizes awarded and several hundred peer-reviewed papers later: the Yelp Dataset Challenge is doubling up. We are proud to announce our latest dataset that includes information about local businesses in 10 cities across 4 countries. This dataset contains 1.6M reviews and 500K tips by 366K users for 61K businesses along with rich attributes data (such as hours of operation, ambience, parking availability) for these businesses, social network information about the users, as well as aggregated check-ins over time for all these users. This treasure trove of local business data is waiting to be mined and we can’t wait to see you push the frontiers of data science research with our data.

Figure 1: Cities included in the Yelp Dataset Challenge Round 5

Figure 1: Cities included in the Yelp Dataset Challenge Round 5

At Yelp, one of our missions is to engage with the academic community and help them by providing real-world data to aid their research. Our dataset should appeal to researchers in data mining, machine learning, economics and urban planning alike. So whether you are building a cutting-edge Natural Language Parsing (NLP) algorithm that mines sentiments expressed by our reviewers, figuring out what business attributes (service quality, ambience, etc.) make a local business popular, or designing better cities and communities by mining local business data – our dataset has everything you need to put your research together.

New Competition: Deadline is June 30th 2015

Download the new dataset and remember to submit your entry by June 30, 2015 in order to be eligible for one of our top-project prizes of $5,000. Please note that the contest itself is open only to students. The contest will run from Feb 5, 2015 to June 30, 2015. See the website for the full terms and conditions.

New Dataset: 10 cities, 4 countries

The most recent Yelp Dataset Challenge (our fourth round) ran from August 1 – Dec 31 2014, giving students access to reviews and businesses from five cities worldwide: Phoenix, Las Vegas, and Madison in the U.S., Waterloo in Canada and Edinburgh in U.K..

In Round 5, open now, we are expanding the dataset to include data from five new cities: Pittsburgh, Urbana-Champaign, and Charlotte in the U.S., Montreal in Canada and Karlsruhe in Germany. We have also updated the data from the original five cities with any businesses that were added or new reviews and tips that were written in those cities since August 1, 2014.

To get your creative juices flowing, here are a few things that you could do with this dataset and some interesting projects from the last round:

Cultural Trends

By adding a diverse set of cities, we want participants to compare and contrast what makes a particular city different. For example, are people in international cities less concerned about driving in to a business, indicated by their lack of mention about parking? What cuisines are Yelpers raving about in these different countries? Do Americans tend to eat out late compared to the Germans and English? In which countries are Yelpers sticklers for service quality? And what about that old adage: is success of a business just location, location and location?

Seasonal Trends

What about seasonal effects: Are HVAC contractors being reviewed just at onset of winter, and manicure salons at onset of summer? Do you see any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? Are there more reviews for sports bars on major game days and if so, could you predict that?

Natural Language Processing (NLP)

What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?

Some creative projects from Round 4:

We are still in the process of reviewing the submissions from Round 4. The response has been overwhelming and we have received over 60 well thought-out and insightful submissions. Yelp’s team of data mining engineers is still reviewing the submissions to decide the winners of the grand prizes. Meanwhile, we thought we’d share a few interesting submissions.

Figure 2: Good Food Bad Service Restaurants

Figure 2: Good Food Bad Service Restaurants

A team of Stanford Data Science majors mined our review dataset to identify the characteristics of restaurants that are consistently ranked for their good food, but bad service. If you live in Phoenix, Las Vegas, Madison, Waterloo or Edinburgh, then you can check out these Good Food Bad Service Restaurants they’ve identified. For instance, the words in blue above are those that are mentioned more often in 5-star reviews than in 1-star reviews, e.g., “duck,” “cod,” “poached,” and “crepes.” In contrast, “airport,” “occupied,” “issues,” “serving,” and “7 pm” are words used much more often in lower rated reviews. Their recommendation to “Good Food Bad Service” restaurants: hire more wait-staff for the 7 pm rush!


Figure 3: Thanksgiving is all about food while Christmas is about shopping!

Figure 3: Thanksgiving is all about food while Christmas is about shopping!

Data Science Society of UCSD wrote up a very interesting blog post describing Naïve Bayes and Random Forests based approaches towards predicting star rating of users solely based on the content in their reviews. They also created some beautiful word clouds to visualize what Yelpers like to do during the holidays. It turns out Thanksgiving is all about food while Christmas is all about shopping!

So go ahead and take our data for a spin. We can’t wait to see what you create!

February 4th, 2015

assert_called_once: Threat or Menace

I remember the first time I laid eyes on the beast: Summer, 2012. The air conditioning at Yelp HQ hummed imperceptibly as I reviewed code for a colleague. This wasn’t my first rodeo, but I was new to Yelp and to working in a large Python codebase. I painstakingly scanned the code for lurking bugs, but couldn’t find any. “Ship it!” I declared electronically, freeing my colleague to deploy his changes.

It is chilling to think that on that day, I looked the beast squarely in the eyes and never realized it. Cunning camouflage allowed it to slip past me and into production.

Hunt the Wumpus

So what is this horrific beast? I will show you, and you still might not spot it! Imagine you are that starry-eyed engineer from the Summer of ’12 and you’re reviewing production code that looks something like this:

This feature even comes with a unit test:

The main idea is that restarting actual servers whenever someone runs the test suite is not a great strategy. So instead of calling the real restart_server(), we use mock.patch() from the Python Mock library to swap in a test double. The fake restart_server(), which we call mock_restart, is a MagicMock which remembers how it has been called and provides convenience methods like assert_called_once_with to validate its experiences.

The code review includes output showing that this test passes. You can paste the code above into a file and run it with py.test if you don’t trust me:

Did you spot the beast?

It is Pitch Black

Here’s a clue: the test will happily continue to pass if you change the assertion to:




Or even:


What’s going on here?

Remember that a mock’s job is to say, “You got it, boss” whenever anyone calls it. It will do real work, like raising an exception, when one of its convenience methods is called, like assert_called_once_with. But it won’t do real work when you call a method that only resembles a convenience method, such as assert_called_once (no _with!).

You Are Likely to be Eaten by a Grue

More troubling, the test passes even if you remove the call to restart_server() from the code-under-test entirely!

This should make your blood run cold. Our test looks reasonable, emits a green “1 test passed,” but doesn’t test anything!

Production code with no test coverage is pretty bad, but this test actually has negative value. Someone spent energy writing it. More engineers will spend energy reading it. Future maintainers, not realizing that they are bedding down with the beast, will add features based on this test and these new features will also be untested. Unwary newcomers will see the useful-looking but ultimately poisonous assert_called_once() and spread it to other modules.

Kill The Beast

Now that we know the beast and how it ticks, how can we defeat it?

Fear of Commitment

Fellow Yelpers Cheng C. and Nicholas N. added a check to our pre-commit hooks that uses the compiler module to walk the AST looking for assert_called_once and a few other disguises commonly used by the beast. If the beast’s signature is detected, the commit is rejected.[1]

This band-aid was cheap to implement and provides a final, automated chance to slay the beast before it can sneak into the master branch. Blacklists are a crude instrument, however, so this hook is not a perfect defense.

You Can Go Your Own Way

After my colleague Keith M.’s first encounter with the beast (pointed out to him by battle-hardened Autumn 2014 me in a code review), he vowed to avoid the Mock convenience methods entirely:

I’ve switched to an idiom of checking mock objects that should be a lot more resilient to the fact that they are pernicious liars.

So instead of:


Keith writes:

assert mock_restart.call_count == 1
assert mock_restart.call_args =="web1-sfo")

His rationale:
The helper method saves a line, but calling things on mocks and expecting them to have side effects (like raising an error) is just asking to get bit by typos and misspellings. Meanwhile, it’s much harder for a mock to accidentally end up with very specific values occupying a random attribute.

Glasses for Your Subaru?

The authors of Mock are aware of the beast and have built a defense into the library — autospec:
If you use a class or instance as the spec for a mock then you can only access attributes on the mock that exist on the real class.

If we change our test to use the autospec defense:

Then the beast is exposed!

autospec is very effective… if you remember to use it. It’s not on by default due to “caveats and limitations“, so the beast relies on the uninitiated and forgetful to avoid being trapped by autospec.

Test-Driven Defense

The best defense against the beast is not a blacklist or a workaround or a clever library feature. It is a change in how you write software.

When I see a test containing a bogus method like assert_called_once, it tells me that the author never verified that the test could fail. A test that cannot fail is no test at all.

The easiest way to insure that your test can fail is to write the test first, watch it fail, then write production code until the test passes. This is Test-Driven Development (TDD) and in addition to being a relatively foolproof means of defeating the beast, it is the best way I know to write well-tested, well-factored, maintainable software.

TDD is what allowed me to outsmart the beast the first time. While adding a feature, I had copied from an existing, beast-infiltrated test. But because I had written the test first, I knew something was amiss when the test suite cheerfully passed. Suspicion and ipdb soon laid bare the beast’s tricks, and now I pass my experiences on to you.

So go, my friends, into the great wilderness of software. May your test suite remain faithful, honest, and beast-free.

[1] You can see this hook at

February 3rd, 2015

February Events At Yelp

The Northeast may be covered in snow, but here in sunny San Francisco, we’re operating at full steam! We got the year off to a great start with a few meetups and the launch of our new tech talk series (more on that later). This month, on top of the meetups we’re hosting, we’re also getting involved with GirlDevWeek. We’ll be hosting a panel here at Yelp HQ, as well as throwing the official after party in conjunction with Pandora. So if you’re going to be at GirlDevWeek, we hope to see you at both events!

Meetups at Yelp HQ:

January 27th, 2015

Animating the Mobile Web

One of the most engaging features of Yelp is our photos and videos gallery. When you visit a Yelp Business Page inside of the mobile app, there is a photo at the top of the page to provide visual context. It also serves as a compelling entry point to our photo viewer if you pull it down. We wanted to have this same effect on our mobile site, so we set out to develop a nice, smooth animation to pull down this photo and delight mobile web users with the same experience they’re used to on our mobile applications.


The Beginning

I was tasked with implementing this animation as part of my internship. Having little prior experience, all I knew was that when the user touches to pull down on the photo, its CSS properties should update over time to generate what is pictured above.

To feel smooth, this animation needs to run at 60 frames per second (fps). This sets our frame budget to 16ms. 16ms to perform all animation-related work necessary to render each frame: sounds like a challenge!

Scoping Out Animation

The first step was to place the background image behind the top of the page so that we can expand and scale it in the future.

After completing this, I started to manually test out how the image would expand and thought about which CSS properties were needed to accomplish the animation effect.

When the photo is being pulled down, several CSS properties should be animated over time:

  • margin-top, to control the photo’s top offset
  • opacity, to fade-in the photo as the user pulls down and to fade-out the rest of the business details page as the user pulls down
  • height/width, to scale the photo up as the user pulls down

Using Chrome Developer Tools I manually fiddled with the CSS properties of the respective DOM Elements to replicate the desired animation and after some experimentation, things looked okay. Still left to do: animate those CSS properties based on a user’s touch movements.

Handling Touch Movements

When researching about touch events, I learned that, on mobile devices, there are
three main events triggered when a user touches a screen: touchstart, touchmove, and touchend.

These three events enable JavaScript to see when and where a user’s finger starts, moves and stops. In our case, we care about the distance between their current touch point and their initial touch point.

Hence the following action plan to handle touch movements:

  • On touchstart: keep track of the initial y coordinate (let’s call it initialY) and store it for future comparisons.
  • On touchmove: get the current vertical coordinate (currentY) and compare:
    • if currentY > initialY, the user is pulling down on the photo. In this case, (currentYinitialY) represents how much a user has pulled down thus far, and can serve as a basis for CSS properties updates (more on that later.)
    • if currentY < initialY, user is trying to scroll normally
  • On touchend: redirect the user to our photo viewer or animate the page back to its original state

After coding this, I started to test how well touch events integrate with the CSS animations.

Initial Test Results

As seen below, we were blowing through our 16ms frame budget, resulting in a significantly laggy and choppy animation.

Screen Shot 2014-12-11 at 2.06.31 PM

This is a screenshot from Chrome’s profiling tool in frames mode while the pull-down animation is running. Each vertical bar represents a frame. Its height indicates the time it took to compute it. Its coloring indicate the type of work done by the browser to compute it. Read more here.

Two things are causing this:

  1. margin-top, height, and width are poor CSS properties to animate. Since updates can’t be offloaded to the GPU, animating on any of these properties takes a heavy toll on the browser, especially on mobile.
  2. Each touchmove event triggers the rendering of a new frame. This is too much rendering work for the renderer, which explain the frames dropped and the choppy animation.

    As Jon Raasch explains in a post on HTML5 Hub: “The renderer tends to choke on the large number of rendering tasks, and often isn’t able to render a certain frame before it has already received instructions to render the next. So, even though the browser renders as many frames of the animation as possible, the dropped frames still make for a choppy-looking animation, which is not to mention the performance implications of overloading the processor with as many tasks as possible.”

Animating Faster

To tackle the first problem of expensive CSS properties, I read Paul Lewis’ and Paul Irish’s post on High Performance Animations to find more efficient replacements.
Their post explained which CSS properties are best for animating on the web and lead us to use:

  • transform: translateY(), to control the photo’s top offset
  • opacity, to fade-in the photo as the user pulls down
  • opacity, to fade-out the rest of the business details page as the user pulls down
  • transform: scale(), to scale the photo up as the user pulls down

In addition to using more efficient CSS properties, I promoted each DOM element taking part in this animation to its own layer by styling them with transform:translateZ(0). This is essential because it offloads rendering work to the GPU and prevents layout thrashing (since the animated elements are on their own layers, the non-animated elements don’t need to be re-laid-out/re-painted).

Animating Smoother

To prevent frames from getting dropped due to too many rendering requests, I used requestAnimationFrame. requestAnimationFrame takes a callback that executes when the browser pushes a new frame to the screen. Essentially, the browser pulls for work at each frame, instead of us pushing work for each new touch event. This allows for concurrent animation to fit into one reflow/repaint cycle. As a result, it makes animations look much smoother because the frame rate is consistent.

Problems solved but could implementation be better?

I had the essentials for a neat photo pull down animation. However, the animation was composed of several independent animations on different DOM elements. Manually computing CSS properties’ values at each frame was unnecessarily complex. I needed a more standard & organized solution to create and run DOM animations.

GitHub and Shifty to the Rescue!

I found on GitHub a tweening engine to abstract most of the difficulties of creating an animation, called Shifty, an open-sourced lightweight tweening engine for JavaScript created by Jeremy Kahn.

Using Shifty would provide us with:

  • the calculation of CSS properties at a certain point in an animation, given a start/end value and a desired duration
  • the ability to easily seek to a certain point in an animation

However, the things Shifty wouldn’t provide us were:

  • the ability to directly apply the calculated CSS properties to specific DOM elements
  • the ability to orchestrate multiple animations simultaneously

How can we build on top of Shifty to help us with our use cases?

As a result, I created two JavaScript classes which extend from the Shifty tweening engine.


The first class is called DomTweenable. It’s the same as a Tweenable object from Shifty except that you can attach additional DOM element to the Tween. Moreover, when you seek to a specific part of the DomTweenable’s tween, the CSS properties are automatically applied to the DOM element.


The second class is called Timeline. Same as a Tweenable object from Shifty except that you can attach multiple DomTweenable objects at specific point in the Timeline’s tween. When you seek to a specific part of the Timeline’s tween, it seeks on each of the DomTweenable objects at (specified position – starting position on timeline.)


Final Result

We now have an easy way to animate on multiple DOM Elements and create smooth animations! And hey, look! Under 60 frames per second:

Screen Shot 2014-12-11 at 1.45.42 PM

Thanks to Simon Boudrias and Arnaud Brousseau for the help!

Resources on animations