Engineering Blog

June 1st, 2015

Things To Do Outside of WWDC


WWDC is coming up soon and it’s going to be a blast! We’ve got our Apple Watch (with our lovely Yelp App installed, of course!) and are ready to party. To celebrate, we’re hosting our third annual WWDC party on June 8 and raffling off an Apple Watch (so you can use our app too)!

While you’re in town for WWDC, make sure to catch some of the other great meetups we’ve got lined up. We’ll also be speaking at a Women in Tech event at Alpine Labs and at a DBA Happy Hour co-hosted with Box!

We’re also really looking forward to hearing presentations by the talented girls of Technovation. We’ll help judge their World Pitch competition where girls ranging from the ages of 10-18 partner up to develop ways of incorporating technology into their everyday lives. The top 10 finalist teams from around the world will be visiting us, competing for the chance to win $10,000 towards their project. Come meet the girls and support their hard work and dedication!

  • Thursday, June 4, 2014 – 6:00PM – Introduction to Functional Reactive Programming on Android (SF Android User Group)
  • Monday, June 8, 2015 – 6:30PM – Yelp WWDC Afterparty (Yelp Engineering)
  • Tuesday, June 9, 2015 – 6:00PM – How to Design Habit-Forming Products Workshop (Nir Eyal)
  • Wednesday, June 10, 2015 – 6:15PM – Learn about the inner workings of the Internet and Twisted (SF Python)
  • Thursday, June 11, 2015 – 6:00PM – Interactive Data Science and Sharing with Jupyter and IPython (SF Big Analytics)
  • Thursday, June 18, 2015 – 6:30PM – Failure and Success (Designers + Geeks)
  • Tuesday, June 23, 2015 – 6:45PM – Bigger Data, Bigger Impact (Products That Count)
  • Wednesday, June 24, 2015 – 5:30PM – World Pitch Competition (Technovation)
May 27th, 2015

Seeing Double On Yelp

Being able to easily find what you want on Yelp is a critical part in ensuring the best user experience. One thing that can negatively affect that experience is displaying duplicate business listings in search results, and if you use Yelp often enough, you might have run into duplicate listings yourself.

We constantly receive new business information from a variety of sources including external partners, business owners, and Yelp users. It isn’t always easy to tie different updates from different sources to the same business listing, so we sometimes mistakenly generate duplicates. Duplicates are especially bad when both listings have user-generated content as they lead to user confusion over which page is the “right” one to add a review or check-in to.


The problem of detecting and merging duplicates isn’t trivial. Merging two businesses involves moving and destroying information from multiple tables which is difficult for us to undo without significant manual effort. A pair of businesses can have slightly different names, categories, and addresses while still being duplicates, so trying to be safe by only merging exact matches isn’t good enough. On the other hand, using simple text similarity measures generates a lot of false positives by misclassifying cases like:

Business Match

The first step in our deduplication system is our Business Match service. Using a wrapper over Apache Kafka, every time a new business is created or a business’s attribute is changed, a batch that consumes messages published to the new_business and business_changed topics calls Business Match to find any potential duplicates of the affected business above a relatively low confidence threshold. Business Match works by taking partial business information (such as name, location, and phone) as input, querying Elasticsearch, reranking the results with a learned model, and returning any businesses that the model scores above a scoring threshold. If Business Match returns any results, the business pairs are added to a table of potential duplicates.

Our User Operations team is responsible for going through pairs of businesses in the table and either merging them or marking them as not duplicates. However, the rate at which duplicates are added to the queue far outpaces the rate that humans can manually verify them which motivated us to develop a high-precision duplicate business classifier that would allow us to automatically merge duplicate pairs of businesses.

Getting Labelled Data

In order for our classifier to work, we needed to get thousands of instances of correctly labelled training data. For this, we sampled rows from our business duplicate table and created Crowdflower tasks to get crowdsourced labelings. We’ve launched public tasks as well as internal-only tasks for our User Operations team which let us easily create a gold dataset of thousands of accurately labelled business pairs. In the future, we are planning on trying an active learning approach where only data that our classifier scores with low confidence is sent to Crowdflower, which would minimize the amount of necessary human effort and allow our classifier to reach a high accuracy with a minimal number of training instances.



Our classifier takes as input a pair of businesses and generates features based on analyzing and comparing the business fields. It uses the same model (scikit_learn’s Random Forests) and many of the same basic features as Business Match like geographical distance, synonym-aware field matching, and edit distance / Jaccard similarity on text fields. In order to capture the kinds of false positives described earlier, we also added two intermediate classifiers whose output was used as features for the final classifier.

We created a named entity recognizer to detect and label business names that indicate a person (e.g. lawyers, doctors, real estate agents) in order to detect the differences between a professional and their practice or two professionals working at the same practice.

Another feature we added is a logistic regression classifier that works by running a word aligner on both business names, finding which terms occur on one or both business names, and determining how discriminative the similarities and differences between the two names are. It outputs a confidence score, the number of uncommon words that appeared in one name but not the other, and the number of uncommon words that appeared in both names, which are used as features in the duplicate classifier.



Since merges are hard to undo, false positives are costly so the focus of our classifier was on precision rather than recall. Our main evaluation metric was F0.1 score, which treats precision as 10 times more important than recall. With all of our classifier’s features, we achieved a F0.1 score of 0.966 (99.1% precision, 27.7% recall) on a held-out data set, compared to a baseline F0.1 = 0.915 (97.1% precision, 13.4% recall) for the strategy of only merging exact (name/address) matches and F0.1 = 0.9415 (96.6% precision, 26.4% recall) using only the basic Business Match feature set.

Future Work

With the work done on our duplicate classifier and automatic business merging, we’ve been able to merge over 500,000 duplicate listings. However, there’s still room for improvements on deduplication. Some things slated for future work are:

  • language and geographical area-specific features
  • focusing deduplication efforts on high-impact duplicates (based on number of search result impressions)
  • extracting our named entity and discriminative word classifiers into libraries for use in other projects

With the improvements to our classifier, we hope to be able to detect merge all high confidence duplicate business listings and minimize the necessary amount of human intervention.

May 18th, 2015

HTTPS Client Testing Made Easy

You’ve probably read about the recent AFNetworking vulnerability. Nowadays, it’s not sufficient to just test your SSL certificates. You must also test how your clients use these certificates to have confidence your users aren’t getting pwned.

There are plenty of tools to test broken cipher suites and cryptographic vulnerabilities but, until now, there wasn’t a readily-available, free, and simple tool for testing how your client handles certificate requests and X.509 verification over the wire. Enter tlspretense-service.

This tool provides a simple Docker container built around iSEC Partner’s tlspretense certificate testing suite that acts as a MitM to test your clients. It works very similarly to tlspretense-docker, except instead of routing through the container by making (potentially problematic) networking changes, you instead connect to the container directly. This means that if your client accepts a service URL to connect to, you can just point it at tlspretense-service to test the robustness of your certificates. The tlspretense’s configuration file contains more information on which tests are run.

tlspretense-service is fairly easy to get up and running, since the Docker container does the bulk of the work for you. When it’s done running, you’ll get a handy report like this:

Each test tells you what it expected, what the actual result was, and the complete duration of each connection.

The above example demonstrates the default behaviour of curl (curl https://localhost:8443). Note that curl rejected every connection in this naive test. This is because tlspretense provides its own CA for use in testing. Here’s what happens if we trust it (curl https://localhost:8443 --cacert tlspretense/ca/goodcacert.pem):

This time, our client connected in the majority of cases. The output then proceeds to show whether curl continued the connection, what expected behavior would be, and whether the test passed or failed.

Here are a few interesting things you can do with this container:

  1. Rigorously test certs for your public HTTPS clients, including web browsers and web proxies that connect to the outside world.
  2. Intercept and test certs for your internal HTTPS and HSTS service traffic safely, without disclosing information to a third party.
  3. Create regression tests and test harnesses around your clients, to ensure they’re always X.509 compliant, without having those tests perform invasive networking changes.

This tool is still a work in progress, so feel free to report any bugs or issues you find with it. Test on!

May 8th, 2015

Yelp Tech Talks: Mobile Testing 1, 2, 3 Wrap Up

Last week we held our second tech talk, focusing on mobile, in a new internal series we launched this year. The presenters covered two very important topics in mobile: wearable apps and testing.

Building Our Apple Watch App

The evening started with a talk by Bill M. who lead the efforts to build our Apple Watch app. We knew we had to build the app in order to provide our users with the best possible access and experience. Since the platform was brand new and kept changing, it came with its own challenges. Development needed to be planned carefully to make sure we could deliver an app with a set of features that added up to a useful experience but still hit the launch date.

The Yelp Apple Watch storyboard

The Yelp Apple Watch storyboard

With a few key features in mind, our iOS developers dove in and started coding away. The storyboard (shown above), defines all of the functionality for the app. The team faced a lot of challenges with how to handle network requests, images, location, and phone-watch communication.


After learning how to build an Apple Watch app, Mason G. and Tim M. followed up on how we test our iOS and Android mobile apps (respectively) to ensure the best possible end product.

Both apps share a common API. Whenever we want to change this API, we start by writing documentation and examples. This allows us to define a contract between the API and the clients, and work out major problems before we write any code. Additionally, we can then take the examples and use them as mock data to test the clients.

Mason then led us into the world of iOS testing. Testing is critical in order to prevent regressions and give developers confidence that their changes work. The team relies on unit, integration, and acceptance tests to make sure that different components all work correctly. Our test suite leverages several tools, including KIF, to provide great test coverage.

Tim then spoke on the challenging world of Android testing. The major concerns with Android testing include test scalability, reliability, and speed. Despite the many obstacles to overcome, we’ve created a solid setup here at Yelp, largely due to the great open source libraries available for the platform, including Spoon and Espresso.

Screenshot from a tech talk showing all the different testing frameworks.

Screenshot from a tech talk showing all the different testing frameworks.

We have the full video and slides online:

On to the next talks!

If you weren’t able to attend our tech talk this time around, don’t worry! You’ll still get a chance to see our offices at our 3rd annual WWDC party. RSVP here.

If you’re interested in future events at Yelp or in engineering opportunities, let us know!

April 30th, 2015

Mycroft – Load Data into Redshift Automatically

Yelp generates terabytes of logs every day. Starting in 2010 with the release of mrjob, Yelp has relied heavily on Amazon Elastic MapReduce (EMR) and MapReduce jobs to analyze this data. While MapReduce works well to repeatedly answer the same question, it’s not a great tool to answer questions that are not well defined or that need to be answered only once. Consequently, we started using Redshift, Amazon’s Postgres-compatible column-oriented data warehouse, to explore our data.

Yelp’s log data already lands on S3 every day making it a convenient location to stage data for loading into Redshift. Unfortunately, most of our logs aren’t in a format that can be directly loaded but instead need to be lightly transformed, then converted into JSON or CSV for loading. mrjob is the perfect tool to perform these light transformations – so much so that we started building infrastructure to make this extremely common pattern as easy as possible.

Mycroft is an orchestrator that coordinates mrjob, S3, and Redshift to automatically perform light transformations on daily log data. Just specify a cluster, schema version, S3 path, and start date, and Mycroft will watch S3 for new data, transforming and loading data without user action. Mycroft’s web interface can be used to monitor the progress of in-flight data loading jobs, and can pause, resume, cancel or delete existing jobs. Mycroft will notify users via email when new data is successfully loaded or if any issues arise. It also provides tools to automatically generate schemas from log data, and even manages the expiration of old data as well as vacuuming and analyzing data.

Mycroft provides a web interface that makes it easy to create new data loading jobs.

Mycroft provides a web interface that makes it easy to create new data loading jobs.

Mycroft ships as a set of Docker containers which use several AWS services, so we’ve provided a small configuration script to ease the initial customization. Once configured, the service itself can be started using docker-compose, making getting it up and running relatively painless.

A comprehensive Quickstart is available for getting Mycroft up and running. The guide steps through getting a copy of Mycroft, configuring Mycroft and launching the required AWS services, and culminates in generating a schema for some example data and loading that data into Redshift.

Mycroft is available on GitHub. Please let us know if you encounter any issues with Mycroft, and don’t hesitate to submit pull requests with any great features you decide to develop.

Thanks to the team and everyone that helped build Mycroft: John Roy, Boris Senderzon, Anusha Rajan, and Justin Cunningham.