Bringing Health Inspection Scores to Yelp
Our post today is by Will L., an engineering intern on one of Yelp’s backend teams this past fall. Will walks us through the challenges of bringing restaurant health inspection scores to Yelp, a feature we announced today at the United States Conference of Mayors in Washington, DC.
As
you may have seen on our official blog, we are very excited
about our initial release of the Local Inspection Value-Entry Specification
(LIVES). LIVES is an open data standard crafted by Yelp in partnership with the
cities of San Francisco and New York to allow municipalities to publish
restaurant health inspection information in a machine-readable format.
In this
post, we want to take you behind the scenes and give you an overview of all the
steps and actions that have happened from getting the standard off the ground
to having all of the health inspection information showing up on Yelp.
The Local
Inspection Value-Entry Specification was first drafted by Yelp engineer John
Boiles (also known at Yelp for his Kinect
hacking and the legendary KegMate) in mid-June 2012 and
was followed by several collaborative revisions with key members of city health
departments. Individuals within the cities of San Francisco, New York, and
Philadelphia were instrumental to the process of refining the standard with
their domain expertise and feedback. On January 9th, 2013, the latest version
(1.0) was published.
A LIVES
feed contains several comma-separated value (CSV) files to encapsulate the feed
data in easy-to-read textual representation: businesses (businesses.csv),
inspections (inspections.csv), related violations (violations.csv), score
mappings for municipality-specific conventions (legend.csv), and finally data
about the feed itself (feed_info.csv). Below is a snapshot of a portion of
businesses.csv taken from the LIVES feed provided by the City of San Francisco:
Once the
specification began shaping up at the beginning of October, a team of engineers
at Yelp started to build a system to process and display LIVES data on our
site. The first step was to come up with a scalable and maintainable system
based on the requirements and constraints of the standard. While the LIVES standard
is currently in use by two cities, Yelp is calling upon municipalities all across
the US to share their health inspection data. As such, scalability played a
critical role in our design process.
One of the
more interesting and challenging aspects of the project centered around
matching up a city’s record for a business to its equivalent on Yelp. While
this may sound simple at first, it proves technically challenging when you
realize cities are more interested in the legal representation of a business
whereas Yelp focuses on what you would see if you were standing in front of the
business on the street. For example, Starbucks may register itself as
“Starbucks Coffee Company” with the City of San Francisco but will show up as
just “Starbucks” on Yelp. Similar problems arise with addresses and phone
numbers, all of which are attributes we use to help pinpoint the right business
on Yelp (e.g., a chain might use a central number for registrations but have
its individual numbers on Yelp).
While
matching a set of data to a business is something we do routinely here at Yelp
(after all, a search on Yelp is a very similar problem), the stakes for this
project were much higher, especially in regards to false positives when
matching. Just imagine how a 5-star restaurant with a perfect health record
would feel if we incorrectly associated a failing inspection with their profile
on Yelp.
To fine
tune our matching, we ran several sample data sets from San Francisco and New
York City through our tools and evaluated our results, paying particular
attention to false negatives and false positives. Through a combination of
normalization of the raw data from the municipalities and tweaks to how we
weigh each piece of data (name, address, and phone number), we were able to dramatically
minimize the number of false positives. Matching business records is never a
completed project, however, so we’re constantly collecting metrics on how it’s
performing with new data sets and tweaking its algorithms and weights
appropriately.
Once we had
all of the various implementation pieces glued together, the last step was to
implement a rollout strategy. At Yelp, we’ve developed several tools to assist
in this process to limit the exposure of a new feature. We’re able to release a
feature to our internal users only, expose it to only a certain portion of
public traffic, or whitelist the feature for certain businesses only. By
combining all of these, we’re able to iterate and deploy features quickly all
while keeping risks low.
We still
have a lot of work to do with LIVES. Besides continuing our gradual rollout of
the feature, our priority is to advocate for the adoption of the standard with
municipalities so that more health inspection data is available publicly and
can be displayed on Yelp. Since LIVES is an open standard, this not only
benefits consumers wondering if that food stand on the corner of the street is
a good choice; it also allows other organizations, such as research
institutions, to use this data to spot trends and perhaps prevent future
foodborne illness outbreaks. We’re equally interested in this data and plan on
looking at how the average scores evolve across cities as we make this data
more readily available to consumers like you. LIVES was one of Yelp’s first
forays into developing an open standard. We’re definitely hooked and look
forward to working with more local governments in the future to iterate on this
standard and help share the wealth of information they have on local businesses.