Yelp Dataset Challenge Round 7 Winner and Announcing Round 9
-
Sébastien C., Data Scientist
- Jan 25, 2017
Yelp Dataset Challenge Round 7 Winners
The seventh round of the Yelp Dataset Challenge ran throughout the first half of 2016 and, as usual, we were impressed with the projects and ideas that came out of the challenge.
Today, we are proud to announce the grand prize winner of the $5,000 award: “Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams” by Abhinav Maurya, Kenton Murray, Yandong Liu, Chris Dyer, William W. Cohen, and Daniel B. Neill (from Carnegie Mellon University, University of Notre Dame in Indiana). The authors created a model to detect and characterize emerging topics in text streams.
Their new model is called Semantic Scan, and unlike traditional methods like Latent Dirichlet Allocation (LDA) it accounts for drift in concepts over time. Semantic Scan combines contrastive topic modelling with online document assignment and spatial scan statistics to quickly identify emerging topics. Compared to methods like Topics over Time, Online LDA, and Labeled LDA, their algorithm is both faster and results in lower Hellinger distances. Needless to say, this work is highly relevant to many technology companies.
This entry was selected from tons of submissions for its technical and academic merit. For a full list of all previous winners of the Yelp Dataset Challenge, head over to the challenge site. Thanks to all who participated!
Dataset Example Code
We maintain a repository of example code to help you get started playing with the dataset. These examples show different ways to interact with the data and how to use our open source Python MapReduce tool mrjob with the data.
The repository includes scripts for
- Converting the dataset from JSON to CSV
- Predicting likely categories given review text
- Finishing reviews using Markov Chains
- Finding the sentiment of words in the dataset
Other Tools
There are many ways to explore the vast data within the Yelp Dataset Challenge Dataset. Below are some examples of some of the many cool tools that can be used with our data:
CartoDB is a cloud based mapping, analysis, and visualization engine that shows you how you can transform reviews into insightful visualizations. They wrote a blog post demonstrating how to use their tools to gain interesting insights about the Las Vegas part of the dataset.
Statwing is a tool used to clean data, explore relationships, and create charts quickly. They loaded the dataset into their system for people to play with and explore interesting insights.
Next Yelp Dataset Challenge: Round 9
The ninth round of the Yelp Dataset Challenge opened on January 24, 2017 (and will close on June 30, 2017), giving students access to reviews and businesses from 11 metropolitan areas scattered over 4 different countries. Compared to the previous round, we added a new metropolitan area: Cleveland, the largest in the Buckeye State! This new dataset and the photo auxiliary file are available for immediate download, containing over 4.1 million reviews. Note that the new dataset JSON files have a slightly different format compared to past rounds.