We’re excited to release our first image dataset with hundreds of thousands of user-submitted photos as part of a challenge to all data scientists, launching this week on Kaggle!

Yelp’s users provide several kinds of “unstructured” data such as reviews, photos, and videos. They can also answer structured questions like, “Is the restaurant romantic?” These structured answers are incredibly useful to users who want a quick summary of important attributes of a business. We want to know: can you extract these attributes from our photos dataset, and what is the right way to approach this problem?

If this type of challenge excites you, we have good news for you: we have dozens of similar problems and want your help solving them! Think of this challenge as a way to get to know Yelp, and your submission as a way for us to get to know you. The best solutions will get invitations to interview directly with the engineers that tackle these problems routinely.

Join us in exploring this data and help us push the envelope for using machine learning to unlock structured information from our user generated content.

Introducing the competition

For this competition, participants are tasked with creating a model which predicts binary business attribute labels for restaurants based on the restaurant’s photos. A dataset has been created which consists of user-submitted photos for a subset of restaurants on Yelp, where each restaurant has a variable number of photos. Nine user-reported business attributes are also attached to each restaurant; these are binary tags which can only take on “yes” or “no” values, such as “good for lunch”, “classy ambience”, or “is expensive”. Since the data is user-submitted, there is a certain amount of noise - for example, some images may be duplicated due to a user submitting the same photo twice.

As per other Kaggle competitions, this data is split into a “training” data set, where the business attributes are supplied, and a “test” data set, where the business attributes are withheld. The challenge is to develop a model or algorithm based on the training data, and apply it to the test data. The test data submissions are used to evaluate the model’s accuracy via a metric known as the mean F1 score, which ranges from 0 (worst) to 1 (best).

We have also prepared several baseline models to facilitate benchmarking and comparisons. Our first model is just a naive random guesser: it makes a random assignment for each attribute with equal probability. Random guessing results in a score of 0.4347. Our second model plays with color: it calculates the color distribution of all the images of a test business, compares that to the average color distribution of businesses with positive attribute values and negative attribute values respectively, and assigns the value with a more similar color distribution to the test business. This is harder to beat - our color feature benchmark results in a score of 0.6459.

While our color model benchmark is simple, it’s still able to perform quite a bit better than random guessing. The rest is up to you - will you train models on single image samples, or should you use a multi-instance cost function? Will the latest and greatest deep neural networks be effective? Let us know through the leaderboard!

Prizes

The prize for this competition is a fast track through Yelp’s recruiting process and an opportunity to show our data mining teams just what you’ve got! Keep in mind though, Yelp is not just looking for your best model; we are looking for data mining engineers that can help us use our data in novel ways while also pushing high-quality code to production. For more information about opportunities to use your machine learning and statistics skills at Yelp, check out our data mining career page.

Check it out!

Head over to the Kaggle competition to get a taste of the challenges our engineers solve every day.

Become a Part of the Data Revolution at Yelp

Interested in using machine learning to unlock information contained in Yelp's data through problems like this? Apply to become a Data-Mining Engineer.

View Job

Back to blog