Data Science Contest “Keeping it Fresh”: Predict Restaurant Health Scores
Soups R., Ph.D., Software Engineer (Data Mining)
- Apr 28, 2015
Yelp connects people with local businesses and along the way we’ve gathered rich data about customers’ experiences at those businesses via reviews, tips, check-ins and business attributes. We are constantly asking ourselves how the collective wisdom of Yelpers can be used to better inform cities in their efforts around protecting the health of our communities. In particular, could we use Yelp’s reviews and business information to make the process of sending Health Inspectors to restaurants more efficient?
According to the Centers for Disease Control, more than 48 million Americans per year become sick from food, and an estimated 75% of the outbreaks came from food prepared by caterers, delis, and restaurants. Currently, inspectors are sent to restaurants in a mostly random fashion. Since cities only have a limited number of health inspectors, quite often their time is wasted on spot checks at clean, rule-abiding restaurants. This also means that sometimes restaurants with poor health and safety records are discovered too late.
It turns out that with Yelp’s data, cities can improve the process of assigning Health Inspectors drastically. A research study by Prof. Michael Luca from Harvard Business School and Prof. Yejin Choi from Stony Brook University and their graduate students found that a machine learnt model built using Yelp’s reviews data and past health inspection records is able to successfully predict future inspection scores for restaurants 82% of the time. Can we do better?
We want to challenge data scientists worldwide to design a better health inspection prediction algorithm. Yelp is co-sponsoring a new Data Science contest “Keeping it Fresh” in collaboration with the City of Boston, DrivenData.org and Harvard University economists ( Ed, Andrew, Scott, and Mike). Using Yelp’s data for restaurants, food and nightlife businesses in Boston as well as past history of health inspections, we are asking contestants to predict the future health score that will be assigned to a business at their next health inspection.
Winning algorithms will be awarded financial prizes — but the real prize is the opportunity to help the City of Boston, which is committed to examining ways to integrate the winning algorithm into its day-to-day inspection operations.
The goal for this competition is to use data from social media to narrow the search for health code violations in Boston. Competitors will have access to historical hygiene violation records from the City of Boston — a leader in open government data — and Yelp’s consumer reviews. The challenge: Figure out the words, phrases, ratings, and patterns that predict violations, to help public health inspectors do their job better. The first-place winner will receive $3,000, and two runners-up will receive $1,000 each. All prizes are provided by Yelp.
Yelp engineers have been fascinated by this problem. In a recent internal hackathon, a team of Yelp engineers comprising of Wing Y., Srivathsan R., Florian H., Jon C. and Srivatsan S., decided to dive deep into our rich user-generated content to find out correlations between reviews and actual health scores for businesses. They modelled Yelp reviews as a bag-of-words and used machine learning techniques like logistic regression to predict health scores. They then went on to overlay their algorithmically predicted scores with the actual city-issued health scores for those businesses over a period of time. Consider this example of a popular San Francisco restaurant:
It turns out that the actual health scores issued by the city of San Francisco (in black) follow very closely with the algorithmically predicted health scores (in green). That doesn’t come as a surprise because Yelpers are quite often talking about the same things that health inspectors look for - cleanliness, ambience, methods of preparation. This restaurant also exemplifies our observation from a randomized, controlled trial in one large city that restaurants whose low hygiene ratings are posted on Yelp tend to respond by cleaning up and performing better on their next inspection.
In the end such public/private partnerships with cities will enable us to enrich the experiences of all parties: consumers who avoid getting sick, businesses who are able to use actionable information to improve their health standards and cities who are able to optimize their finite resources of Health Inspectors to match them with restaurants more efficiently.
The competition opened yesterday (Monday, April 27th) and will accept submissions for eight weeks. Submissions will be evaluated on fresh hygiene inspection results during the six weeks following the competition; after that, the prizes will be awarded. Your submission will not only put you in the running for the prize – it has the chance to transform how city governments ensure public health.
We are excited to see what machine learning algorithms our contestants will build. So wait no more. Go sign up for the “Keeping it Fresh” Contest here.
Thanks to our partners from: City of Boston, Harvard Business School and DrivenData for making this contest a reality. Special thanks to Luther L. for leading the charge and bringing together all the parties involved. Thanks to Srivatsan S. for helping author this blog post.