Yelp hosts tens of millions of photos uploaded by Yelpers from all around the world. The wide variety of these photos provides a rich window into local businesses, a window we’re only just peeking through today.

One way we’re trying to open that window is by developing a photo understanding system which allows us to create semantic data about individual photographs. The data generated by the system has been powering our recent launch of tabbed photo browsing as well as our first attempts at content-based photo diversification.

Building a Photo Classifier

One can imagine a variety of ways to tackle the ambitious goal of holistically understanding pictures. To help simplify our problem, we focused initially on only sorting photos into a handful of predefined classes. Further, we focused only on categories of photos directly relevant to restaurants as shown below:


To develop a classifier that can put a photo into one of these groups, we need to first collect many photos with known labels. We collected this information through a few different ways:

  1. Photo captions: A good number of “menu” photos have the word “menu” in their captions. Similarly, we can find photos titled “sushi” or “burger” that are likely to be food. For the former, we had to worry about false positives because it’s not uncommon to see “food” or “drink” photos whose captions are of the pattern “Best on Their Menu!”, and as a result some cleanup was necessary. To aid us in identifying food items, we relied on Yelp’s menu structures (e.g., which maintain each business’ list of food items. We found that matching food items from the list to the captions of photos yielded a dataset of high precision.

  2. Photo attributes: When uploading photos to Yelp, users are allowed to mark a few attributes about the photo, such as “Is it the storefront?” They are not always accurate, but still serve as a good source of candidate photos.

  3. Crowdsourcing: We ran additional tasks through a crowdsourcing partner to correct our guesses for what label should be applied to each photo and to collect more “inside” and “outside” photos. We have found that this led to generally good quality labels at a reasonable cost (both in time and money).

Once we had our labeled data, we employed deep convolutional neural networks (CNNs) in the form of ”AlexNet” to recognize those classes. CNNs usually consist of a deep stack of multiple convolutional layers (for extracting spatially local and translation-invariant features), ReLU (Rectified Linear Units) layers (for non-saturating activations), pooling layers (for down-sampling and translation-invariance), local response normalization layers (for better generalization) and fully-connected layers as in conventional feedforward neural networks. Softmax outputs and regularization methods such as dropout are also commonly used. Our CNN was built on AWS EC2 GPU instances based on the Caffe framework. We like Caffe because it’s easy to use, performant, open source (BSD 2-clause), and under active development. To address Caffe’s software dependencies, we wrapped our CNN using Docker so that it could be more easily deployed.

We also created abstractions to ensure that our CNN could be easily integrated with other possible forms of classifiers, including different instances of CNNs. As illustrated below, our baseline is a “Caffe Classifier” that runs the CNN by means of Caffe; it’s a special form of an abstract classifier that can take different signals and perform different classification algorithms. Our current “facade” classifier is an ensemble that takes the weight average of classification results from two independently trained Caffe Classifiers. It would be quite straightforward if we decide to further incorporate new classifiers relying on other signals, such as photo captions.

On an evenly split gold test set of 2,500 photos, our current classifier shown above has an overall precision of 94% of precision and recall of 70%. While these numbers can definitely be improved, we found them reasonably good for the applications described below.

Photo Classification Service

Yelp uses SOA (Service-Oriented Architecture) and we made a RESTful photo classification service to support existing or upcoming Yelp applications. The main applications of the service, as detailed below, are based on a business’s classified photos. Since the service is expected to host more than one classifier (e.g., of different versions, or for different types of businesses), the service APIs take a classifier ID, a business ID, and an optional class, and then return all photos belonging to the business that have been classified (into the optional class, if specified) by the classifier, e.g.:

We use a standard MySQL database server to host all classification results, and all service requests can be handled by simple database queries. To avoid more expensive real-time classifications, and because our current applications do not hinge on latest photos’ classifications, we only perform offline classifications. The architecture is shown below: for every new classifier, we sweep over all photos and store their classification results in a database. The sweep is computationally intensive, but Yelp’s infrastructure allows us to alleviate this by running our classifiers in parallel on arbitrarily many machines. After the sweep, everyday we also automatically collect new photos and send them into a batch for both classification and database load:

Application: Cover Photo Diversification

Once we have the photo classification service in place, it can immediately power many key features at Yelp. For one thing, our business detail pages show a set of “cover photos” which are recommended by our photo scoring engine based on user feedback and certain photo attributes. A typical issue with our current cover photos is that the selected photos lack “diversity”: for example, all the cover photos shown below are about food (ramen), and users could not see other aspects of the business unless they click on the “See all” button.

Photo classification now allows us to diversify cover photos by classes – we can readily identify highest-scored non-food photos and incorporate them into cover photos. Through a rigorous A/B test, we confirmed that our restaurant viewers prefer to see a highlighted “food” photo and a highlighted “non-food” photo, as well as two smaller “food” photos and another two “non-food” photos, as illustrated below. Diversification has substantially increase our users’ interactions with photos.

Application: Tabbed Photo Browsing

As anyone who’s looked through Yelp’s photos before knows, the vast majority of Yelp’s photos from restaurants are food, but we’ve heard feedback from users that they find Yelp photos useful for more than just finding the most aesthetically pleasing burger.

Some folks use Yelp photos for checking out the atmosphere for a special event or navigating to a venue for the first time, while others use Yelp photos for more serious applications like finding out if a restaurant can accommodate a handicapped patron. All of these tasks are now easier and more efficient with the launch of tabbed photo browsing

Tabbed photo browsing is our most significant application to date of our photo classification service. Photos are now organized under their respective tabs (classes); as we can see below, it is now way easier to jump to the exact information you’re looking for.

What’s Next

No machine learning system is perfect. If you want to help contribute to improving our photo classification quality, feel free to flag any misclassified photos you see.

We’re only at the beginning of exploring what we can do with Yelp’s vast photo corpus. Stay tuned to see where we’ll go!

Acknowledgements: The photo classification service was designed and implemented by Wei-Hong C., Prasanna S., Joel O., Colin P., and Mohini T. The web front end of tabbed photo browsing was designed by Taron G. and implemented by Lawrence W.

Become a Data-Mining Engineer at Yelp

Interested in using machine learning to exploit Yelp's data? Apply to become a Data-Mining Engineer at Yelp.

View Job

Back to blog