Engineering Blog

March 27th, 2015

Yelp Hackathon 16: What the hack is a Kühlkdebeulefotoapparat

A couple of weeks ago, the Yelp Product & Engineering teams put their creative minds together to work on our coolest, funniest and hardcore-est of ideas at the 16th edition of our internal Hackathon. As always, the food was plentiful- our kitchens were stacked with delicious catered food, fresh fruits and snacks, gourmet coffee, and, wait for it, ice-cream sandwiches! While all that deliciousness fueled our bodies, our minds were fueled by do-it-yourself Metal Earth 3D model kits and metal-etching workshops. If that wasn’t enough, we even got a chance to race these amazing Anki cars.

Our VP of Engineering kicking off the proceedings with a transatlantic mimosa toast

Our VP of Engineering kicking off the proceedings with a transatlantic mimosa toast

Over 60 fantastic projects came out of our engineering offices in San Francisco, Hamburg and London – ideas that ranged from visualizations of our rich dataset to hardcore machine learning applications to nifty internal tools designed to skyrocket developer productivity and of course, the incredible out-of-the-box projects.

The unwavering constant of Yelp Hackathons - a medley of work, play and food

The unwavering constant of Yelp Hackathons – a medley of work, play and food

Speaking of out-of-the-box projects, the dynamic duo of Yoni D. and Matthew K. decided to reinvigorate a certain 19th century invention. What started off as an idea to self-develop photographic film quickly evolved into a project to create their very own camera using the swanky 3D printer that we have here in our San Francisco office. Using the design for an existing pinhole camera, they 3D printed parts of the camera, painted and assembled them to create a product as hardcore as its name – the Kühlkdebeulefotoapparat. Don’t know what means? Rumor has it that it has something to do with the duo’s last names. And it takes pretty good photographs too.

Another fancy piece of gadgetry that came out of our top-secret 3D printing lab

Another fancy piece of gadgetry that came out of our top-secret 3D printing lab

While some folks were busy fashioning things out of thin air, others joined forces to solve an interesting problem that we face here at Yelp. We build a ton of amazing features every single day that leave us with a lot of data with varying tools to visualize them. We’ve gotten really comfortable using Kibana to expose our data in Elasticsearch, but could we somehow use the same UI that we know and love to visualize our great data on Redshift? A team comprised of Garrett J., Jimming C. and Tobi O. decided to answer exactly that. Poking into the Kibana code (yay open source!), they realized that while the project was still under active development, it had a well-defined interface to query ElasticSearch. They built a translation server that modeled a similar interface for Redshift that Kibana could understand. The result – the ability to create new dashboards and visualizations that will help us gain more insight into our data!

Our hackers showing off their projects at a “science-fair” style exhibition

Our hackers showing off their projects at a “science-fair” style exhibition

While some folks spent their time coming up with elegant solutions to hard problems, a team comprised of Duncan C. and Matthew M. did something, um, quite the opposite. In the true spirit of being “unboring,” they created “DumbStack” – an insanely complicated web of machinery comprised of Webfaction, Slack, AppEngine, SQS, EC2, Heroku, Tumblr, Netcat, Github and Google Translate to solve a simple problem – posting a tweet.

Definitely the easiest way to post to Twitter, right?

Definitely the easiest way to post to Twitter, right?

This project certainly would’ve made Rube Goldberg proud!

Have some creative ideas brewing in your head? Why don’t you come join forces as we embark on our next awesome Hackathon this summer. We are constantly on the lookout for amazing engineers, product managers and data scientists to bolster our team. Check out our exciting product and engineering job openings at www.yelp.com/careers and apply today.

Hack on!

March 20th, 2015

Analyzing the Web For the Price of a Sandwich

I geek out about the Common Crawl. It’s an open source crawl of huge parts of the Internet, accessible for anyone to use. You have full access to the HTML and text of billions of web pages. What’s more, you can scan the entire thing, tens of terabytes, for just a few bucks on Amazon EC2. These days they’re releasing a new dataset every month. It’s awesome.

People frequently use mrjob to scan the Common Crawl, so it seems like a fitting tool for us to use. mrjob, if you’re not familiar, is a Python framework written by Yelp to help run Hadoop jobs locally or on Amazon’s EMR service. Since the Common Crawl is stored in Amazon’s S3, it makes a lot of sense to use EMR to access it.

The Problem

I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.

Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.

The Approach

Right away I realized that this is a huge problem space. Sophisticated solutions would use NLP, fuzzy matching, cluster analysis, a whole slew of signals and methods. I wanted to come up with something simple, more of a proof-of-concept. If some basic approaches yielded decent results, then I could easily justify developing more sophisticated methods to explore this dataset.

I started thinking about phone numbers. They’re easy to parse, identify, and don’t require any fuzzy matching. A phone number either matches, or it doesn’t. If I could match phone numbers from the Common Crawl with phone numbers of businesses in the Yelp database, it may lead to actually finding the web pages for those businesses.

I started planning my MapReduce job:

image01

I start with a mapper that takes both Common Crawl WET pages (WET pages are just web page text) and Yelp business data. For each page from the Common Crawl, I parse the page looking for phone numbers and yield the URLs keyed off of each phone number that I find. For the Yelp business data, I yield the ID of the business, keyed off of the phone number that we have on record, which lets me start combining any businesses and URLs that match the same phone number in my reduce step.

Spammy Data

I ran this against a small set of input data to make sure it was working correctly but already knew that I was going to have a problem. There are some websites that just like to list every possible phone number and then there are sites like Yelp which have pages dedicated to individual businesses. A website that simply lists every possible phone number is definitely not going to be a legitimate business page. Likewise, a site like Yelp that has a lot of phone numbers on it is unlikely to be the true home page of any particular business. I need a way to filter these results.

So let’s set up a step where we organize all the entries keyed by domain. That way if it looks like there are too many businesses hosted on that single domain, we can filter them out. Granted this isn’t perfect, it may exclude large national chains, but it’s a place to start.

image00

The Input

The Common Crawl data files are in WARC format. The warc python package can help us parse these files, but WARC itself isn’t a line-by-line format that’d be suitable as direct input to our job. Instead, the input to our mapper will be the paths to WARC files stored in S3. Conveniently, the Common Crawl provides a file that is exactly that.

According to the Common Crawl blog, we know that the WET paths for the December 2014 crawl live at:

https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-52/wet.paths.gz

We’ll supply that path and a path to a similar file of our Yelp dataset to our mrjob:

Here’s an excerpt from the full job that illustrates how we load the WARC files and map them into phone numbers and URLs:

Performance Problem: Distribute Our Input To More Mappers

The default Hadoop configuration assumes that there are many lines of input and each individual line is relatively fast to process. It designates many lines to a single map task.

In this case, given that our input is a list of WET paths, there’s a relatively small amount of line input, but each line is relatively expensive to process. Each WET file is about 150MB so each line actually represents a whole lot more input than just that single line. For best performance, we actually want to dole out each input to a separate mapper.

Fortunately this is pretty easy by simply specifying the input format to NLineInputFormat:

Note that NLineInputFormat will reformat our input a bit. It will change our input lines S3 paths into '\t' so we need to treat our input as tab-delimited key, value pairs and simply ignore the key. MRJob’s RawProtocol can do this:

Identifying Phone Numbers

This part isn’t too complicated. To find possible phone numbers on a given page, I use a simple regex. (I also tried the excellent phonenumbers Python package, which is far more robust but also much slower, it wasn’t worth it.) It can be expanded to work on international phone numbers, but for now we’re only looking at US-based businesses.

Running the Job (On the Cheap!)

The WET files for the December 2014 Common Crawl are about 4TB compressed. That’s a fair bit of data, but we can process it pretty quickly with a few high-powered machines. Given that AWS charges you by the instance-hour, it’s most cost effective to just use as many instances required to process your job in a little less than 60 minutes. I found that I could process this job in about an hour with 20 c3.8xlarge instances.

The normal rate for a c3.8xlarge is currently $1.68/hour. Plus $.270/hr for EMR. Thus under standard pricing, our job would cost ($1.68 + $0.27) * 20 = $39. But with spot pricing, we can do much better!

Spot pricing lets you get EC2 instances at much lower rates when there is unused capacity. The current spot price for a c3.8xlarge in us-east-1 is around $0.26. We still have to pay the additional EMR charge, but this gets us to a much more reasonable ($0.26 + $0.27) * 20 = $10.60.

Results

The mrjob found approximately 748 million US phone numbers in the Common Crawl December 2014 dataset. Of the ones we were able to match against businesses in the Yelp database, 48%, already had URLs associated with them. If we assume the Yelp businesses that have URLs are correct URLs, we can get a rough estimate of accuracy by comparing the URLs in the Yelp database against the URLs that our MRJob identified.

So of the businesses that matched and already had URLs, 48% of them were the same URLs that we already had. 61% had matching domains. If we wanted to use this for real, we’d probably want to combine this with other signals to get higher accuracy. But this isn’t a bad first step.

Conclusion

The Common Crawl is a great public resource. You can scan over huge portions of the web with some simple tools and a price of a sandwich. MRJob makes this super easy with Python. Give it a try!

March 11th, 2015

Using Services to Break Down Monoliths

Introduction

At Yelp we value our ability to quickly ship code. We’re constantly pushing changes out to production, and we even encourage our interns to ship code on their first day. We’ve managed to maintain this pace even as the engineering team has grown to over 300 people and our codebase has reached several million lines of Python code (with a helping of Java on the side).

One key factor in maintaining our iteration speed has been our move to a service
oriented architecture. Initially, most of our development work occurred in a single, monolithic web application, creatively named ‘yelp-main.’ As the company grew, our developers were spending increasing amounts of time trying to ship code in yelp-main. After recognizing this pain point, we started experimenting with a service oriented architecture to scale the development process, and so far it’s been a resounding success. Over the course of the last three years, we’ve gone from writing our first service to having over seventy production services.

What kinds of services are we writing at Yelp? As an example, let’s suppose you want to order a burrito for delivery. First of all, you’ll probably make use of our autocompletion service as you type in what you’re looking for. Then, when you click ‘search,’ your query will hit half a dozen backend services as we try to understand your query and select the best businesses for you. Once you’ve found your ideal taqueria and made your selection from the menu, details of your order will hit a number of services to handle payment, communication with partners, etc.

What about yelp-main? There’s a huge amount of code here, and it’s definitely not going to disappear anytime soon (if ever). However we do see yelp-main increasingly becoming more of a frontend application, responsible for aggregating and rendering data from our growing number of backend services. We are also actively experimenting with creating frontend services so that we can further modularize the yelp-main development process.

The rest of this blog post roughly falls into two categories: general development practices and infrastructure. For the former, we discuss how we help developers write robust services, and then follow up with details on how we’ve tackled the problems of designing good interfaces and testing services. We then switch focus to our infrastructure, and discuss the particular technology choices we’ve made for our service stack. This is followed by discussions on datastores and service discovery. We conclude with details on future directions.

Education

Service oriented architecture forces developers to confront the realities of distributed systems such as partial failure (oops, that service you were talking to just crashed) and distributed ownership (it’s going to take a few weeks before all the other teams update their code to start using v2 of your fancy REST interface). Of course, all of these challenges are still present when developing a monolithic web application, but to a lesser extent.

We strive to mitigate many of these problems with infrastructure (more on that later!), but we’ve found that there’s no substitute for helping developers really understand the realities of the systems that they’re building. One approach that we’ve taken is to hold a weekly ‘services office hours,’ where any developer is free to drop in and ask questions about services. These questions cover the full gamut, from performance (“What’s the best way to cache this data?”) to architecture (“We’ve realized that we should have really implemented these two separate services as a single service – what’s the best way to combine them?”).

We’ve also created a set of basic principles for writing and maintaining services. These principles include architecture guidelines, such as how to determine whether a feature is better suited to a library or whether the overhead of a service is justified. They also contain a set of operational guidelines so that service authors are aware of what they can expect once their service is up and running in production. We’ve found that it can often be useful to refer to these principles during code reviews or design discussions, as well as during developer onboarding.

Finally, we recognize that we’re sometimes going to make mistakes that negatively impact either our customers or other developers. In such cases we write postmortems to help the engineering team learn from these mistakes. We work hard to eliminate any blame from the postmortem process so that engineers do not feel discouraged from participating.

Interfaces

Most services expose HTTP interfaces and pass around any structured data using JSON. Many service authors also provide a Python client library that wraps the raw HTTP interface so that it’s easier for other teams to use their service.

There are definite tradeoffs in our choice to use HTTP and JSON. A huge benefit of standardizing on HTTP is that there is great tooling to help with debugging, caching and load balancing. One of the more significant downsides is that there’s no standard solution for defining service interfaces independently of their implementation (in contrast to technologies such as Thrift). This makes it hard to precisely specify and check interfaces, which can lead to nasty bugs (“I thought your service returned a ‘username’ field?”).

We’ve addressed this issue by using Swagger. Swagger builds on the JSON Schema standard to provide a language for documenting the interface of HTTP/JSON services. We’ve found that it’s possible to retrofit Swagger specifications onto most our our services without having to change any of their interfaces. Given a Swagger specification for a service, we use swagger-py to automatically create a Python client for that service. We also use Swagger UI to provide a centralized directory of all service interfaces. This allows developers from across the engineering team to easily discover what services are available, and helps prevent duplicated effort.

Testing

Testing within a service is fairly standard. We replace any external service calls with mocks and make assertions about the call. The complexity arises when we wish to perform tests that span services. This is one of the areas where a service oriented architecture brings additional costs. Our first attempt at this involved maintaining canonical ‘testing’ instances of services. This approach lead to test pollution for stateful services, general flakiness due to remote service calls and an increased maintenance burden for developers.

In response to this issue we’re now using Docker containers to spin up private test instances of services (including databases). The key idea here is that service authors are responsible for publishing Docker images of their services. These images can then be pulled in as dependencies by other service authors for acceptance testing their services.

Service stack

The majority of our services are built on a Python service stack that we’ve assembled from several different open-source components. We use Pyramid as our web framework and SQLAlchemy for our database access layer, with everything running on top of uWSGI. We find that these components are stable, well-documented and have almost all of the features that we need. We’ve also had a lot of success with using gevent to eke additional performance out of Python services where needed. Our Elasticsearch proxy uses this approach to scale to thousands of requests per second for each worker process.

We use Pyramid tweens to hook into request lifecycles so that we can log query and response data for monitoring and historical analysis. Each service instance also runs uwsgi_metrics to capture and locally aggregate performance metrics. These metrics are then published on a standard HTTP endpoint for ingestion into our Graphite-based metrics system.

Another topic is how services access third-party libraries. Initially we used system-installed packages that were shared across all service instances. However, we quickly discovered that rolling out upgrades to core packages was an almost impossible task due to the large number of services that could potentially be affected. In response to this, all Python services now run in virtualenvs and source their dependencies from a private PyPI server. Our PyPI server hosts both the open-source libraries that we depend upon, and our internally-released libraries (built using a Jenkins continuous deployment pipeline).

Finally, it’s important to note that our underlying infrastructure is agnostic with respect to the language that a service is written in. The Search Team has used this flexibility to write several of their more performance-critical services in Java.

Datastores

A significant proportion of our services need to persist data. In such cases, we try to give service authors the flexibility to choose the datastore that is the best match for their needs. For example, some services use MySQL databases, whereas others make use of Cassandra or ElasticSearch. We also have some services that make use of precomputed, read-only data files. Irrespective of the choice of datastore, we try to keep implementation details private to the owning service. This gives service authors the long-term flexibility to change the underlying data representation or even the datastore.

The canonical version of much of our core data, such as businesses, users and reviews, still resides in the yelp-main database. We’ve found that extracting this type of highly relational data out into services is difficult, and we’re still in the process of finding the right way to do it. So how do services access this data? In addition to exposing the familiar web frontend, yelp-main also provides an internal API for use by backend services in our datacenters. This API is defined using a Swagger specification, and for most purposes can be viewed as just another service interface.

Discovery

A core problem in a service oriented architecture is discovering the locations of other service instances. Our initial approach was to manually configure a centralized pair of HAProxy load balancers in each of our datacenters, and embed the virtual IP address of these load balancers in configuration files. We quickly found that this approach did not scale. It was labor intensive and error prone both to deploy new services and also to move existing services between machines. We also started seeing performance issues due to the load balancers becoming overloaded.

We addressed these issues by switching to a service discovery system built around SmartStack. Briefly, each client host now runs an HAProxy instance that is bound to localhost. The load balancer is dynamically configured from service registration information stored in ZooKeeper. A client can then contact a service by connecting to its localhost load balancer, which will then proxy the request to a healthy service instance. This facility has proved itself highly reliable, and the majority of our production services are now using it.

Future directions

We are currently working on a next-generation service platform called Paasta. This uses the Marathon framework (running on top of Apache Mesos) to allocate containerized service instances across clusters of machines, and integrates with SmartStack for service discovery. This platform will allow us to treat our servers as a much more fungible resource, freeing us from the issues associated with statically assigning services to hosts. We will be publishing more details about this project later this year.

Acknowledgements

The work described in this blog post has been carried out and supported by numerous members of the Engineering Team here at Yelp. Particular credit goes to Prateek Agarwal, Ben Chess, Sam Eaton, Julian Krause, Reed Lipman, Joey Lynch, Daniel Nephin and Semir Patel.

March 11th, 2015

March Events @ Yelp

image00

This month we’re ramping up and preparing for an awesome time at PyCon. We’ll be there in full force next month so look for us there at booth 606! Be sure to catch a presentation by our own Soups R. on Friday, April 10 at 12:10 where he’ll be speaking on Data Science in Advertising: Or a future when we love ads. In the meantime, hopefully you aren’t too sleepy from daylight savings time to attend some great events this month:

  • Wednesday, March 11, 2015 – 6:00PM – Tech talks and PyCon Startup Row Pitches (SF Python)
  • Thursday, March 19, 2015 – 7:00PM – The Best Interface Is No Interface (Designers + Geeks)
  • Tuesday, March 24, 2015 – 6:00PM – Mini PyCon 2015 (PyLadies)
  • Wednesday, March 25, 2015 – 7:00PM – Workplace Evolved (Products That Count)
  • Thursday, March 26, 2015 – 7:00PM – Analytics + Visualization for Large-Scale Neuroscience (SF Big Analytics)
February 11th, 2015

Reading Between the Lines: How We Make Sense of Users’ Searches

The Problem

People expect a lot out of search queries on Yelp. Understanding exact intent from a relatively vague text input is challenging. A few months ago, the Search Quality team felt like we needed to take a step back and reassess how we were thinking about a user’s search so that we could return better results for a richer set of searches.

Our main business search stack takes into account many kinds of features that can each be classified as being related to one of distance, quality and relevance. However, sometimes these signals encode related meaning, making the equivalence between them difficult to understand (e.g., we know a business is marked as “$ – inexpensive” and that people mention “cheap” in their reviews), but often they capture orthogonal information about a business (e.g., reviews are not a reliable source of a business’ hours). We need a stage where we discern which parts of a query tell us what types of information should be searched over.

It’s also useful to embellish the plain text of a search query with context. Are there related words that are very relevant to a substring of text? Is it possible that the user misspelled part of the query? Having this extended information would enhance our ability to recall relevant businesses, especially in low content areas.

image01

Let’s consider a more concrete example, such as a search for “mimosa brunch” in San Francisco. The above search results show some features that our desired (and now current) query understanding system should be able to extract:

  • “mimosa” has the related search term “champagne”
  • “brunch” maps to the attribute “Good for Brunch”

All these enhancements sounded great but were becoming increasingly difficult to implement. Our old search stack combined the understanding of the intent and semantics of a query and its execution into a single step. This resulted in engineers having a hard time adding new quality features: concepts were overloaded (the word “synonym” had very different meanings in different parts of the code), questions of semantics were tied up in execution implementation, and there was no way for other parts of Yelp to understand the semantic knowledge locked up in our Lucene search stack.

Semantics as Abstraction

To help solve these problems, we found it useful to formally split up the process of extracting meaning from the process of executing a search query. To that end, we created a component named Soxhlet (named after a kind of extractor used in chemistry labs) which runs before search execution, and whose sole job is to extract meaning from a raw text query. It turns the input text into a Rich Query that holds structured semantics in the context of business search. This Rich Query then becomes the preferred abstraction for the rest of the search stack to execute upon.

image00

Encoding Meaning

What does this Rich Query data structure look like? When we were first whiteboarding possible representations we tended to underline parts of the query and assign semantics to each part.

image02

As you can see from the above representation, there are complex semantics encoded in a signal that is mostly textual (the query text). Therefore, we think that annotating the query string is a good way to represent a Rich Query: ranges of the original query text are marked as having typed information via various kinds of Annotations.

For those of you familiar with Lucene, this data structure is very similar to a TokenStream but with annotations instead of attributes. This makes consumption of the Rich Query in the recall and relevance steps straightforward while at the same time giving enough abstraction so that non-search components can also easily consume Rich Queries.

As an example, the extracted RichQuery for a query popular with our team is:

When the above Rich Query is analyzed by our core relevance stack (a service called “Lucy”) from the AttributeQueryAnnotation, it knows to also search over the attribute field of businesses for the “$” attribute. Furthermore, the confidence of 1.0 implies that an attribute match should be considered equivalent to a textual match for “cheap.” Other annotations may have lower confidences associated with them if their extraction process is less precise and this is factored into the scoring function of constructed Lucene queries.

By being able to use the way Annotations overlap in the original search query, it’s now possible to more easily cross-reference Annotations with each other and with the original text. The recall and relevance stages of the search stack use this ability to ensure that the right constraints are used.

Extraction Process

What is the architecture within Soxhlet that allows us to turn a plain query into a Rich Query? Our first observation is that we should have specialized components for producing each type of Annotation. Our second observation is that basic Annotations can help identify more complex ones; you could imagine taking advantage of detected synonyms to properly identify cost attributes in the search query string, etc.

From these observations, we decided that a good starting design was a series of transformation steps that iteratively adds new Annotation types. Each one of these steps is called a Transformer. Each transformer takes a Rich Query as input, looks at all existing Annotations and the original query text, and produces a modified Rich Query as output that possibly contains new annotations. Stripped of boilerplate, the code for the top-level Soxhlet class is very simple:

image03

This architecture also provides us with a mechanism for sharing prerequisite work between Transformers, while not forcing coupling between earlier Transformers or the search recall and relevance steps. As an example, although all Transformers could operate on the raw query string, many kinds of text analysis work on the token level, so many Transformers would have to duplicate the work of tokenizing the query string. Instead, we can have a tokenization Transformer create token Annotations from the raw string, and then every subsequent Transformer has access to that tokenization information.

This optimization should be used carefully though with Soxhlet being internationalized into dozens of languages and more being regularly added, normalization (accent removal, stemming etc.) is an important part of extracting rich queries. It’s tempting to store to normalized queries at beginning of the extraction process but we have avoided doing so to prevent couplings among Transformers and between Soxhlet and other services such as the recall and relevance stage of the search stack.

The Transformers themselves can vary in complexity so let’s dive into some details on how they actually work:

Spelling Transformer

We actually use two different types of spelling transformations. “Spelling Suggestion” is where we still perform the Rich Query extraction and search on the original query but offer a “Did You Mean” link in the results for queries that could be good suggestions. These suggestions are computed at query time by generating candidate suggestions within a small edit distance of the original query and then scoring them with a noisy channel model that takes into account query priors and likelihood of edits.

“Spelling Correction” occurs when we are very confident about a correction and perform the Rich Query extraction and search on the corrected query. These corrections are generated by promoting Spelling Suggestions that have performed well in the past based on click rates for their “Did You Mean” link. At query time we simply lookup a map from query to correction.

This two-tiered approach is very effective in that it allows us to achieve high precision with Spelling Correction without sacrificing recall by allowing Spelling Suggestions for lower confidence suggestions. Our future plans for spelling include improving internationalization and incorporating more advanced features such as a phonetic algorithm into our suggestion scorer.

Synonyms Transformer

Context is very important to synonym extraction and this drove a number of our design decisions:

  • The same word can have different synonyms in different query contexts. For example, the word “sitting” in the query “pet sitting” has the synonym “boarding”. However, the same word in the query “house sitting” probably shouldn’t have “boarding” as a synonym.
  • The same word in the same query can have different synonyms in different domain contexts. For example the word “haircut” could have the synonym “barber” in a local business search engine such as Yelp. However, on an online shopping website a better synonym would be “hair clippers”.
  • The same word can be used very differently in queries compared to documents. For example, it might be common for a review to mention “I got my car repaired here” but it would be very unusual to write “I got my auto repaired here”. As a result, an algorithm that uses only document data to extract synonyms would likely not recall “auto repair” having the synonym “car repair” despite this being a very good query synonym. In fact, research has shown that the similarity between queries and documents is typically very low.

To solve the above problems we adopted a two-step approach to synonym extraction. In the first step we extract possible candidate synonyms from user’s query refinements. In the second step we create vectors for each query containing the distribution of clicked businesses and then score synonyms based on the cosine similarity of their query vectors. We store this information in a lookup table that is accessed at query time to find synonyms.

Attributes Transformer

We found a simple lookup table mapping between phrases and attributes to be very effective for extracting attributes. This transformer also illustrates one of the benefits of a unified query understanding framework – by piggy-backing off the results of the Synonyms Transformer we are able to use the knowledge that “cheap” is a synonym of “inexpensive” to extract the attribute of “$” from “inexpensive restaurant” without explicitly having the mapping of the “inexpensive” to “$” in our attribute lookup. Although the Attributes Transformer is very simple but precise, you could imagine ways of improving the recall such as using a language model for each attribute.

For example the Synonyms Transformer simply contains a dictionary to lookup synonyms that have been pre-generated by batch jobs from query logs. On the other hand, the Spelling Correction Transformer uses a more complex noisy channel model at query time to determine likely corrections.

Conclusion

Using Soxhlet in our business search stack, we’ve been able to roll out the ability to use query synonyms to enhance recall and relevance, more intelligently spell-correct query mistakes, better detect attributes in queries and most importantly we now have an elegant framework for extending our query understanding functionality by adding new transformers. We have already seen significant search results CTR gains as a result of these changes and have set a foundation for sharing with other parts of Yelp our concept of meaning for business search queries. Although it is a major undertaking to move many of the existing “meaning extraction,” like parts of our search stack into Soxhlet, we believe the benefits from this architecture are well worth it.

Special thanks to David S. for helping author this blog post and the Query Understanding team who helped build the platform: Chris F., Ray G., Benjamin G., Natarajan S., Maria C., and Denis L.