Our post today is by Will L., an engineering intern on one of Yelp’s backend teams this past fall. Will walks us through the challenges of bringing restaurant health inspection scores to Yelp, a feature we announced today at the United States Conference of Mayors in Washington, DC.
you may have seen on our official blog, we are very excited
about our initial release of the Local Inspection Value-Entry Specification
(LIVES). LIVES is an open data standard crafted by Yelp in partnership with the
cities of San Francisco and New York to allow municipalities to publish
restaurant health inspection information in a machine-readable format.
In this post, we want to take you behind the scenes and give you an overview of all the steps and actions that have happened from getting the standard off the ground to having all of the health inspection information showing up on Yelp.
The Local Inspection Value-Entry Specification was first drafted by Yelp engineer John Boiles (also known at Yelp for his Kinect hacking and the legendary KegMate) in mid-June 2012 and was followed by several collaborative revisions with key members of city health departments. Individuals within the cities of San Francisco, New York, and Philadelphia were instrumental to the process of refining the standard with their domain expertise and feedback. On January 9th, 2013, the latest version (1.0) was published.
A LIVES feed contains several comma-separated value (CSV) files to encapsulate the feed data in easy-to-read textual representation: businesses (businesses.csv), inspections (inspections.csv), related violations (violations.csv), score mappings for municipality-specific conventions (legend.csv), and finally data about the feed itself (feed_info.csv). Below is a snapshot of a portion of businesses.csv taken from the LIVES feed provided by the City of San Francisco:
Once the specification began shaping up at the beginning of October, a team of engineers at Yelp started to build a system to process and display LIVES data on our site. The first step was to come up with a scalable and maintainable system based on the requirements and constraints of the standard. While the LIVES standard is currently in use by two cities, Yelp is calling upon municipalities all across the US to share their health inspection data. As such, scalability played a critical role in our design process.
One of the more interesting and challenging aspects of the project centered around matching up a city’s record for a business to its equivalent on Yelp. While this may sound simple at first, it proves technically challenging when you realize cities are more interested in the legal representation of a business whereas Yelp focuses on what you would see if you were standing in front of the business on the street. For example, Starbucks may register itself as “Starbucks Coffee Company” with the City of San Francisco but will show up as just “Starbucks” on Yelp. Similar problems arise with addresses and phone numbers, all of which are attributes we use to help pinpoint the right business on Yelp (e.g., a chain might use a central number for registrations but have its individual numbers on Yelp).
While matching a set of data to a business is something we do routinely here at Yelp (after all, a search on Yelp is a very similar problem), the stakes for this project were much higher, especially in regards to false positives when matching. Just imagine how a 5-star restaurant with a perfect health record would feel if we incorrectly associated a failing inspection with their profile on Yelp.
To fine tune our matching, we ran several sample data sets from San Francisco and New York City through our tools and evaluated our results, paying particular attention to false negatives and false positives. Through a combination of normalization of the raw data from the municipalities and tweaks to how we weigh each piece of data (name, address, and phone number), we were able to dramatically minimize the number of false positives. Matching business records is never a completed project, however, so we’re constantly collecting metrics on how it’s performing with new data sets and tweaking its algorithms and weights appropriately.
Once we had all of the various implementation pieces glued together, the last step was to implement a rollout strategy. At Yelp, we’ve developed several tools to assist in this process to limit the exposure of a new feature. We’re able to release a feature to our internal users only, expose it to only a certain portion of public traffic, or whitelist the feature for certain businesses only. By combining all of these, we’re able to iterate and deploy features quickly all while keeping risks low.
We still have a lot of work to do with LIVES. Besides continuing our gradual rollout of the feature, our priority is to advocate for the adoption of the standard with municipalities so that more health inspection data is available publicly and can be displayed on Yelp. Since LIVES is an open standard, this not only benefits consumers wondering if that food stand on the corner of the street is a good choice; it also allows other organizations, such as research institutions, to use this data to spot trends and perhaps prevent future foodborne illness outbreaks. We’re equally interested in this data and plan on looking at how the average scores evolve across cities as we make this data more readily available to consumers like you. LIVES was one of Yelp’s first forays into developing an open standard. We’re definitely hooked and look forward to working with more local governments in the future to iterate on this standard and help share the wealth of information they have on local businesses.