Yelp Dataset Challenge Round 3 Winners and Dataset Tools for Round 4
-
Dr. Scott Clark, Ad Targeting Engineer
- Nov 6, 2014
Yelp Dataset Challenge Round 3 Winners
We recently opened the fourth round of the Yelp Dataset Challenge. This announcement included an update to the dataset, adding four new international cities and bringing the total number of reviews in the dataset to over one million. You can download it and participate in the challenge here. Submissions for this round are open until December 31, 2014. See the full terms for more details.
With the opening of our fourth iteration of the challenge, we closed the third round, which ran from February 1, 2014 to July 31, 2014. We are proud to announce two grand prize winners of the $5,000 award from round three:
- Jack Linshi from Yale University with his entry “Personalizing Yelp Star Ratings: A Semantic Topic Modeling Approach.” Jack proposed an approximation of a modified latent Dirichlet allocation (LDA) in which term distributions of topics are conditional on star ratings, allowing topics to have an explicit sentiment associated with them.
- Felix W. from Princeton University with his entry “On the Efficiency of Social Recommender Networks.” Felix constructed metrics for measuring the efficiency of a network in disseminating information/recommendations and applied them to the Yelp social graph, discovering that it is quite efficient.
These entries were selected from many submissions for their technical and academic merit. For a full list of all previous winners of the Yelp Dataset Challenge, head over to the challenge site.
Dataset Example Code
We maintain a repository of example code to help you get started playing with the dataset. These examples show different ways to interact with the data and how to use our open source Python MapReduce tool mrjob with the data.
The repository includes scripts for
- Converting the dataset from JSON to CSV
- Predicting likely categories given review text
- Finishing reviews using Markov Chains
- Finding the sentiment of words in the dataset
Other Tools
There are many ways to explore the vast data within the Yelp Dataset Challenge Dataset. Below are some examples of some of the many cool tools that can be used with our data:
CartoDB is a cloud based mapping, analysis, and visualization engine that shows you how you can transform reviews into insightful visualizations. They recently wrote a blog post demonstrating how to use their tools to gain interesting insights about the Las Vegas part of the dataset.
Statwing is a tool used to clean data, explore relationships, and create charts quickly. They loaded the dataset into their system for people to play with and explore interesting insights.
Yelp Dataset Challenge Round 4
Submissions for this round are open until December 31, 2014. See the full terms for more details. This dataset contains over one million reviews from five cities around the world, along with all of the associated businesses, tips, check-ins, and users along with the social graph. We are excited to see what you come up with!