Calling All Data Miners!
-
Matt J. Search and Data-Mining Engineer
- Sep 23, 2011
Summers are tough here at Yelp’s HQ in San Francisco. It’s hard to keep up with all the hot new businesses, order the best food at every restaurant you visit, or even just find that fancy dive bar your friends have been chatting up. If you’re anything like me, you aren’t satisfied just using these features on Yelp - you want to tear them apart, see how they tick, and make them better. The trick? All these features are powered by our incredible review data.
The data
Yelp is providing the reviews for nearly seven thousand businesses at 30 universities for students and academics to research and explore, along with some examples to get things started. Check out the examples on our GitHub page, and get the data from our Academic Dataset page (you’ll need to be associated with an academic institution to qualify for access).
Some numbers:
- 30 schools
- 3 data types (reviews, businesses, users)
- Over 150k reviews of nearly 7k businesses Cool, huh? How about we try building something with the dataset?
Positive shmositive
First off, let’s define the problem we want to solve. Personally, I’ve always been interested in sentiment analysis, so the first thing I thought of was this: what are the most positive and negative words for each category? (The following results are generated by the positive_category_words
example, in case you want to follow along.)
The simplest means of accomplishing this is to find the average star rating of all the reviews each word shows up in. We’ll use mrjob for this, since we’re working with a fair amount of data. Here’s simplified version of the job:
class PositiveWords(MRJob):
def review_mapper(self, _, data):
if data[‘type’] != ‘review’:
return
unique_words = set(words(data[‘text’]))
for word in unique_words:
yield word, data[‘stars’]
def positivity_reducer(self, word, ratings):
yield word, avg(ratings)
The full solution is implemented in positive_category_words/simple_global_positivity.py
, and produces the following results:
The 10 most positive words in ranked order : gem, treasure, knowledgeable, impeccable, topnotch, incredible, talented, compliments, macadamia, perfection
The 10 most negative words in ranked order : worst, tasteless, flavorless, awful, rude, disgusting, horrible, terrible, poorly, zero
Making it better
So what’s wrong? The first thing that stands out is that all these words are very general - ‘gem,’ in particular, shows up in over 17k reviews, and could be used to describe pretty much anything. Let’s break it down by category, instead. The code for this portion lives in positive_category_words/weighted_category_positivity.py
, and closely mirrors simple_global_positivity
, with the addition of a category term.
So, for a category near-and-dear to my heart, Italian:
The 10 most positive words : melts, naples, exquisite, vino, promise, hooked, ovens, heaven, char, succulent
The 10 worst words : disgusting, worst, horrible, flavorless, terrible, tasteless, awful, rude, manager, subpar
These are looking a lot better, but there’s still room for improvement. We could use NLTK to distinguish parts of speech, and break down positive adjectives versus positive nouns. Or we could use a stemmer, and just look at root words (dream / dreams / dreaming might be better if they were bucketed together).
Curious about the results for over 400 other categories? Check out our full examples page.
Going further
We’ve provided two more examples in our github repo - a tool that guesses the category of a business given a review and a simple markov-chain review generator (a tool that, given a category and some starting text, fills in the rest of the review), but we’re most excited about what you come up with. Head over to http://www.yelp.com/academic_dataset for information on how to get started and a list of the schools we’re providing data for.
Have any other questions? Shoot us an email at dataset@yelp.com.