Yelp Dataset Challenge Round 5 Winner
-
Sébastien C., Data Scientist
- Jan 19, 2016
Yelp Dataset Challenge Round 5 Winners
The fifth round of the Yelp Dataset Challenge ran throughout the first half of 2015 and we were quite impressed with the projects and concepts that came out of the challenge. Today, we are proud to announce the grand prize winner of the $5,000 award: “From Group to Individual Labels Using Deep Features” by Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth (from the University of California, Irvine, the University of Oxford, and the Canadian Institute for Advanced Research). This paper proposes a novel approach to using group-level labels (e.g. the category of an entire review) to learn instance-level classification (e.g. the category of specific sentences inside this review). The authors designed a new objective (cost) function for training a model which uses features from a deep-learning convolutional neural network. This trained neural network can, in turn, be used as a classifier predicting which category a specific instance belongs to. Their innovative research has broad implications for a variety of fields, and not just text classification.
This entry was selected from many submissions for its technical and academic merit. A PDF copy of the winning paper can be found at this location. This paper was published in the Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. For a full list of all previous winners of the Yelp Dataset Challenge, head over to the challenge site. Thanks to all who participated!
Dataset Example Code
We maintain a repository of example code to help you get started playing with the dataset. These examples show different ways to interact with the data and how to use our open source Python MapReduce tool mrjob with the data.
The repository includes scripts for
- Converting the dataset from JSON to CSV
- Predicting likely categories given review text
- Finishing reviews using Markov Chains
- Finding the sentiment of words in the dataset
Other Tools
There are many ways to explore the vast data within the Yelp Dataset Challenge Dataset. Below are some examples of some of the many cool tools that can be used with our data:
CartoDB is a cloud based mapping, analysis, and visualization engine that shows you how you can transform reviews into insightful visualizations. They recently wrote a blog post demonstrating how to use their tools to gain interesting insights about the Las Vegas part of the dataset.
Statwing is a tool used to clean data, explore relationships, and create charts quickly. They loaded the dataset into their system for people to play with and explore interesting insights.
Next Yelp Dataset Challenge Round
Submissions for Round 6 closed on December 31, 2015, but Round 7 starts on January 15, 2016 and a new dataset has been released. We are excited to see what you will come up with!