This month we’re ramping up and preparing for an awesome time at PyCon. We’ll be there in full force next month so look for us there at booth 606! Be sure to catch a presentation by our own Soups R. on Friday, April 10 at 12:10 where he’ll be speaking on Data Science in Advertising: Or a future when we love ads. In the meantime, hopefully you aren’t too sleepy from daylight savings time to attend some great events this month:
Wednesday, March 11, 2015 – 6:00PM – Tech talks and PyCon Startup Row Pitches (SF Python)
Thursday, March 19, 2015 – 7:00PM – The Best Interface Is No Interface (Designers + Geeks)
Tuesday, March 24, 2015 – 6:00PM – Mini PyCon 2015 (PyLadies)
People expect a lot out of search queries on Yelp. Understanding exact intent from a relatively vague text input is challenging. A few months ago, the Search Quality team felt like we needed to take a step back and reassess how we were thinking about a user’s search so that we could return better results for a richer set of searches.
Our main business search stack takes into account many kinds of features that can each be classified as being related to one of distance, quality and relevance. However, sometimes these signals encode related meaning, making the equivalence between them difficult to understand (e.g., we know a business is marked as “$ – inexpensive” and that people mention “cheap” in their reviews), but often they capture orthogonal information about a business (e.g., reviews are not a reliable source of a business’ hours). We need a stage where we discern which parts of a query tell us what types of information should be searched over.
It’s also useful to embellish the plain text of a search query with context. Are there related words that are very relevant to a substring of text? Is it possible that the user misspelled part of the query? Having this extended information would enhance our ability to recall relevant businesses, especially in low content areas.
Let’s consider a more concrete example, such as a search for “mimosa brunch” in San Francisco. The above search results show some features that our desired (and now current) query understanding system should be able to extract:
“mimosa” has the related search term “champagne”
“brunch” maps to the attribute “Good for Brunch”
All these enhancements sounded great but were becoming increasingly difficult to implement. Our old search stack combined the understanding of the intent and semantics of a query and its execution into a single step. This resulted in engineers having a hard time adding new quality features: concepts were overloaded (the word “synonym” had very different meanings in different parts of the code), questions of semantics were tied up in execution implementation, and there was no way for other parts of Yelp to understand the semantic knowledge locked up in our Lucene search stack.
Semantics as Abstraction
To help solve these problems, we found it useful to formally split up the process of extracting meaning from the process of executing a search query. To that end, we created a component named Soxhlet (named after a kind of extractor used in chemistry labs) which runs before search execution, and whose sole job is to extract meaning from a raw text query. It turns the input text into a Rich Query that holds structured semantics in the context of business search. This Rich Query then becomes the preferred abstraction for the rest of the search stack to execute upon.
What does this Rich Query data structure look like? When we were first whiteboarding possible representations we tended to underline parts of the query and assign semantics to each part.
As you can see from the above representation, there are complex semantics encoded in a signal that is mostly textual (the query text). Therefore, we think that annotating the query string is a good way to represent a Rich Query: ranges of the original query text are marked as having typed information via various kinds of Annotations.
For those of you familiar with Lucene, this data structure is very similar to a TokenStream but with annotations instead of attributes. This makes consumption of the Rich Query in the recall and relevance steps straightforward while at the same time giving enough abstraction so that non-search components can also easily consume Rich Queries.
As an example, the extracted RichQuery for a query popular with our team is:
When the above Rich Query is analyzed by our core relevance stack (a service called “Lucy”) from the AttributeQueryAnnotation, it knows to also search over the attribute field of businesses for the “$” attribute. Furthermore, the confidence of 1.0 implies that an attribute match should be considered equivalent to a textual match for “cheap.” Other annotations may have lower confidences associated with them if their extraction process is less precise and this is factored into the scoring function of constructed Lucene queries.
By being able to use the way Annotations overlap in the original search query, it’s now possible to more easily cross-reference Annotations with each other and with the original text. The recall and relevance stages of the search stack use this ability to ensure that the right constraints are used.
What is the architecture within Soxhlet that allows us to turn a plain query into a Rich Query? Our first observation is that we should have specialized components for producing each type of Annotation. Our second observation is that basic Annotations can help identify more complex ones; you could imagine taking advantage of detected synonyms to properly identify cost attributes in the search query string, etc.
From these observations, we decided that a good starting design was a series of transformation steps that iteratively adds new Annotation types. Each one of these steps is called a Transformer. Each transformer takes a Rich Query as input, looks at all existing Annotations and the original query text, and produces a modified Rich Query as output that possibly contains new annotations. Stripped of boilerplate, the code for the top-level Soxhlet class is very simple:
This architecture also provides us with a mechanism for sharing prerequisite work between Transformers, while not forcing coupling between earlier Transformers or the search recall and relevance steps. As an example, although all Transformers could operate on the raw query string, many kinds of text analysis work on the token level, so many Transformers would have to duplicate the work of tokenizing the query string. Instead, we can have a tokenization Transformer create token Annotations from the raw string, and then every subsequent Transformer has access to that tokenization information.
This optimization should be used carefully though with Soxhlet being internationalized into dozens of languages and more being regularly added, normalization (accent removal, stemming etc.) is an important part of extracting rich queries. It’s tempting to store to normalized queries at beginning of the extraction process but we have avoided doing so to prevent couplings among Transformers and between Soxhlet and other services such as the recall and relevance stage of the search stack.
The Transformers themselves can vary in complexity so let’s dive into some details on how they actually work:
We actually use two different types of spelling transformations. “Spelling Suggestion” is where we still perform the Rich Query extraction and search on the original query but offer a “Did You Mean” link in the results for queries that could be good suggestions. These suggestions are computed at query time by generating candidate suggestions within a small edit distance of the original query and then scoring them with a noisy channel model that takes into account query priors and likelihood of edits.
“Spelling Correction” occurs when we are very confident about a correction and perform the Rich Query extraction and search on the corrected query. These corrections are generated by promoting Spelling Suggestions that have performed well in the past based on click rates for their “Did You Mean” link. At query time we simply lookup a map from query to correction.
This two-tiered approach is very effective in that it allows us to achieve high precision with Spelling Correction without sacrificing recall by allowing Spelling Suggestions for lower confidence suggestions. Our future plans for spelling include improving internationalization and incorporating more advanced features such as a phonetic algorithm into our suggestion scorer.
Context is very important to synonym extraction and this drove a number of our design decisions:
The same word can have different synonyms in different query contexts. For example, the word “sitting” in the query “pet sitting” has the synonym “boarding”. However, the same word in the query “house sitting” probably shouldn’t have “boarding” as a synonym.
The same word in the same query can have different synonyms in different domain contexts. For example the word “haircut” could have the synonym “barber” in a local business search engine such as Yelp. However, on an online shopping website a better synonym would be “hair clippers”.
The same word can be used very differently in queries compared to documents. For example, it might be common for a review to mention “I got my car repaired here” but it would be very unusual to write “I got my auto repaired here”. As a result, an algorithm that uses only document data to extract synonyms would likely not recall “auto repair” having the synonym “car repair” despite this being a very good query synonym. In fact, research has shown that the similarity between queries and documents is typically very low.
To solve the above problems we adopted a two-step approach to synonym extraction. In the first step we extract possible candidate synonyms from user’s query refinements. In the second step we create vectors for each query containing the distribution of clicked businesses and then score synonyms based on the cosine similarity of their query vectors. We store this information in a lookup table that is accessed at query time to find synonyms.
We found a simple lookup table mapping between phrases and attributes to be very effective for extracting attributes. This transformer also illustrates one of the benefits of a unified query understanding framework – by piggy-backing off the results of the Synonyms Transformer we are able to use the knowledge that “cheap” is a synonym of “inexpensive” to extract the attribute of “$” from “inexpensive restaurant” without explicitly having the mapping of the “inexpensive” to “$” in our attribute lookup. Although the Attributes Transformer is very simple but precise, you could imagine ways of improving the recall such as using a language model for each attribute.
For example the Synonyms Transformer simply contains a dictionary to lookup synonyms that have been pre-generated by batch jobs from query logs. On the other hand, the Spelling Correction Transformer uses a more complex noisy channel model at query time to determine likely corrections.
Using Soxhlet in our business search stack, we’ve been able to roll out the ability to use query synonyms to enhance recall and relevance, more intelligently spell-correct query mistakes, better detect attributes in queries and most importantly we now have an elegant framework for extending our query understanding functionality by adding new transformers. We have already seen significant search results CTR gains as a result of these changes and have set a foundation for sharing with other parts of Yelp our concept of meaning for business search queries. Although it is a major undertaking to move many of the existing “meaning extraction,” like parts of our search stack into Soxhlet, we believe the benefits from this architecture are well worth it.
Special thanks to David S. for helping author this blog post and the Query Understanding team who helped build the platform: Chris F., Ray G., Benjamin G., Natarajan S., Maria C., and Denis L.
Posted by Soups R., Ph.D., Software Engineer (Data Mining)
Two years, four highly competitive rounds, over $35,000 in cash prizes awarded and several hundred peer-reviewed papers later: the Yelp Dataset Challenge is doubling up. We are proud to announce our latest dataset that includes information about local businesses in 10 cities across 4 countries. This dataset contains 1.6M reviews and 500K tips by 366K users for 61K businesses along with rich attributes data (such as hours of operation, ambience, parking availability) for these businesses, social network information about the users, as well as aggregated check-ins over time for all these users. This treasure trove of local business data is waiting to be mined and we can’t wait to see you push the frontiers of data science research with our data.
Figure 1: Cities included in the Yelp Dataset Challenge Round 5
At Yelp, one of our missions is to engage with the academic community and help them by providing real-world data to aid their research. Our dataset should appeal to researchers in data mining, machine learning, economics and urban planning alike. So whether you are building a cutting-edge Natural Language Parsing (NLP) algorithm that mines sentiments expressed by our reviewers, figuring out what business attributes (service quality, ambience, etc.) make a local business popular, or designing better cities and communities by mining local business data – our dataset has everything you need to put your research together.
New Competition: Deadline is June 30th 2015
Download the new dataset and remember to submit your entry by June 30, 2015 in order to be eligible for one of our top-project prizes of $5,000. Please note that the contest itself is open only to students. The contest will run from Feb 5, 2015 to June 30, 2015. See the website for the full terms and conditions.
New Dataset: 10 cities, 4 countries
The most recent Yelp Dataset Challenge (our fourth round) ran from August 1 – Dec 31 2014, giving students access to reviews and businesses from five cities worldwide: Phoenix, Las Vegas, and Madison in the U.S., Waterloo in Canada and Edinburgh in U.K..
In Round 5, open now, we are expanding the dataset to include data from five new cities: Pittsburgh, Urbana-Champaign, and Charlotte in the U.S., Montreal in Canada and Karlsruhe in Germany. We have also updated the data from the original five cities with any businesses that were added or new reviews and tips that were written in those cities since August 1, 2014.
To get your creative juices flowing, here are a few things that you could do with this dataset and some interesting projects from the last round:
By adding a diverse set of cities, we want participants to compare and contrast what makes a particular city different. For example, are people in international cities less concerned about driving in to a business, indicated by their lack of mention about parking? What cuisines are Yelpers raving about in these different countries? Do Americans tend to eat out late compared to the Germans and English? In which countries are Yelpers sticklers for service quality? And what about that old adage: is success of a business just location, location and location?
What about seasonal effects: Are HVAC contractors being reviewed just at onset of winter, and manicure salons at onset of summer? Do you see any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? Are there more reviews for sports bars on major game days and if so, could you predict that?
Natural Language Processing (NLP)
What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?
Some creative projects from Round 4:
We are still in the process of reviewing the submissions from Round 4. The response has been overwhelming and we have received over 60 well thought-out and insightful submissions. Yelp’s team of data mining engineers is still reviewing the submissions to decide the winners of the grand prizes. Meanwhile, we thought we’d share a few interesting submissions.
Figure 2: Good Food Bad Service Restaurants
A team of Stanford Data Science majors mined our review dataset to identify the characteristics of restaurants that are consistently ranked for their good food, but bad service. If you live in Phoenix, Las Vegas, Madison, Waterloo or Edinburgh, then you can check out these Good Food Bad Service Restaurants they’ve identified. For instance, the words in blue above are those that are mentioned more often in 5-star reviews than in 1-star reviews, e.g., “duck,” “cod,” “poached,” and “crepes.” In contrast, “airport,” “occupied,” “issues,” “serving,” and “7 pm” are words used much more often in lower rated reviews. Their recommendation to “Good Food Bad Service” restaurants: hire more wait-staff for the 7 pm rush!
Figure 3: Thanksgiving is all about food while Christmas is about shopping!
Data Science Society of UCSD wrote up a very interesting blog post describing Naïve Bayes and Random Forests based approaches towards predicting star rating of users solely based on the content in their reviews. They also created some beautiful word clouds to visualize what Yelpers like to do during the holidays. It turns out Thanksgiving is all about food while Christmas is all about shopping!
So go ahead and take our data for a spin. We can’t wait to see what you create!
I remember the first time I laid eyes on the beast: Summer, 2012. The air conditioning at Yelp HQ hummed imperceptibly as I reviewed code for a colleague. This wasn’t my first rodeo, but I was new to Yelp and to working in a large Python codebase. I painstakingly scanned the code for lurking bugs, but couldn’t find any. “Ship it!” I declared electronically, freeing my colleague to deploy his changes.
It is chilling to think that on that day, I looked the beast squarely in the eyes and never realized it. Cunning camouflage allowed it to slip past me and into production.
Hunt the Wumpus
So what is this horrific beast? I will show you, and you still might not spot it! Imagine you are that starry-eyed engineer from the Summer of ’12 and you’re reviewing production code that looks something like this:
This feature even comes with a unit test:
The main idea is that restarting actual servers whenever someone runs the test suite is not a great strategy. So instead of calling the real restart_server(), we use mock.patch() from the Python Mock library to swap in a test double. The fake restart_server(), which we call mock_restart, is a MagicMock which remembers how it has been called and provides convenience methods like assert_called_once_with to validate its experiences.
The code review includes output showing that this test passes. You can paste the code above into a file and run it with py.test if you don’t trust me:
Did you spot the beast?
It is Pitch Black
Here’s a clue: the test will happily continue to pass if you change the assertion to:
Remember that a mock’s job is to say, “You got it, boss” whenever anyone calls it. It will do real work, like raising an exception, when one of its convenience methods is called, like assert_called_once_with. But it won’t do real work when you call a method that only resembles a convenience method, such as assert_called_once (no _with!).
You Are Likely to be Eaten by a Grue
More troubling, the test passes even if you remove the call to restart_server() from the code-under-test entirely!
This should make your blood run cold. Our test looks reasonable, emits a green “1 test passed,” but doesn’t test anything!
Production code with no test coverage is pretty bad, but this test actually has negative value. Someone spent energy writing it. More engineers will spend energy reading it. Future maintainers, not realizing that they are bedding down with the beast, will add features based on this test and these new features will also be untested. Unwary newcomers will see the useful-looking but ultimately poisonous assert_called_once() and spread it to other modules.
Kill The Beast
Now that we know the beast and how it ticks, how can we defeat it?
Fear of Commitment
Fellow Yelpers Cheng C. and Nicholas N. added a check to our pre-commit hooks that uses the compiler module to walk the AST looking for assert_called_once and a few other disguises commonly used by the beast. If the beast’s signature is detected, the commit is rejected.
This band-aid was cheap to implement and provides a final, automated chance to slay the beast before it can sneak into the master branch. Blacklists are a crude instrument, however, so this hook is not a perfect defense.
You Can Go Your Own Way
After my colleague Keith M.’s first encounter with the beast (pointed out to him by battle-hardened Autumn 2014 me in a code review), he vowed to avoid the Mock convenience methods entirely:
I’ve switched to an idiom of checking mock objects that should be a lot more resilient to the fact that they are pernicious liars.
His rationale: The helper method saves a line, but calling things on mocks and expecting them to have side effects (like raising an error) is just asking to get bit by typos and misspellings. Meanwhile, it’s much harder for a mock to accidentally end up with very specific values occupying a random attribute.
Glasses for Your Subaru?
The authors of Mock are aware of the beast and have built a defense into the library — autospec:
If you use a class or instance as the spec for a mock then you can only access attributes on the mock that exist on the real class.
If we change our test to use the autospec defense:
Then the beast is exposed!
autospec is very effective… if you remember to use it. It’s not on by default due to “caveats and limitations“, so the beast relies on the uninitiated and forgetful to avoid being trapped by autospec.
The best defense against the beast is not a blacklist or a workaround or a clever library feature. It is a change in how you write software.
When I see a test containing a bogus method like assert_called_once, it tells me that the author never verified that the test could fail. A test that cannot fail is no test at all.
The easiest way to insure that your test can fail is to write the test first, watch it fail, then write production code until the test passes. This is Test-Driven Development (TDD) and in addition to being a relatively foolproof means of defeating the beast, it is the best way I know to write well-tested, well-factored, maintainable software.
TDD is what allowed me to outsmart the beast the first time. While adding a feature, I had copied from an existing, beast-infiltrated test. But because I had written the test first, I knew something was amiss when the test suite cheerfully passed. Suspicion and ipdb soon laid bare the beast’s tricks, and now I pass my experiences on to you.
So go, my friends, into the great wilderness of software. May your test suite remain faithful, honest, and beast-free.
The Northeast may be covered in snow, but here in sunny San Francisco, we’re operating at full steam! We got the year off to a great start with a few meetups and the launch of our new tech talk series (more on that later). This month, on top of the meetups we’re hosting, we’re also getting involved with GirlDevWeek. We’ll be hosting a panel here at Yelp HQ, as well as throwing the official after party in conjunction with Pandora. So if you’re going to be at GirlDevWeek, we hope to see you at both events!