The Yelp Dataset Challenge Goes International! New Data, New Cities, Open to Students Worldwide!

The Challenge

The Yelp Dataset Challenge provides the academic community with a real-world dataset over which to apply their research. We encourage students to take advantage of this wealth of data to develop and extend their own research in data science and machine learning. Students who submit their research are eligible for cash awards and incentives for publishing and presenting their findings.

The most recent Yelp Dataset Challenge (our third round) opened in February 2014, giving students access to our Phoenix Academic Dataset, with reviews and businesses from the greater Phoenix metro area. In the fourth round, open now, we are expanding the dataset to include data from four new cities from around the world. We are also opening up the challenge to international students, see the terms and conditions for more information.

New Data

We are proud to announce that we are extending the popular Phoenix Academic Dataset to include four new cities! By adding a diverse set of cities we hope to encourage students to compare and contrast the different aspects of each city and find new insights about what makes each city unique. The dataset is comprised of reviews, businesses and user information from:

  • Phoenix, AZ
  • Las Vegas, NV (new!)
  • Madison, WI (new!)
  • Waterloo, CAN (new!)
  • Edinburgh, UK (new!)

This new dataset increases the data included in the previous Phoenix Academic Dataset with the following new data and is available for immediate download:

  • Businesses - 42,153 (+26,568 new businesses!)
  • Business Attributes - 320,002 (+208,441 new attributes!)
  • Check-in Sets - 31,617 (+20,183 new check-in sets!)
  • Tips - 403,210 (+289,217 new tips!)
  • Users - 252,898 (+182,081 new users!)
  • User Connections - 955,999 (+804,482 new edges!)
  • Reviews - 1,125,458 (+790,436 new reviews!)

Round 4 is Now Live

Along with the updated dataset, we’re also happy to announce the next iteration of the Yelp Dataset Challenge. The challenge will be open to students around the world and will run from August 1st, 2014 to December 31, 2014. See the website for the full terms and conditions. This data can be used to train a myriad of models and extend research in many fields. So download the dataset now and start using this real-world dataset right away!



Introducing MOE: Metric Optimization Engine; a new open source, machine learning service for optimal experiment design

At Yelp we run a lot of A/B tests. By constantly trying new features and testing their impact, we are able to continue evolving our products and make them as useful as possible. However, running online A/B tests can be expensive (in opportunity cost, user experience, or revenue) and time consuming (to achieve statistical significance).

Furthermore, many A/B tests boil down to parameter selection (more of an A/A’ test, where a feature stays the same, and only the parameters change). Given a feature, we want to find the optimal configuration values for the constants and hyperparameters of the feature as quickly as possible. This can be analytically impossible for many systems. We need to treat these systems like black boxes where we can observe only the input and output. We want some combination of metrics (the objective function) to go up or down, but we need to run expensive, time consuming experiments to sample this function for each set of parameters.

MOE, the Metric Optimization Engine, is an open source, machine learning tool for solving these global, black box optimization problems in an optimal way. MOE implements several algorithms from the field of Bayesian Global Optimization. It solves the problem of finding optimal parameters by building and fitting a model of the objective function given historical information using Gaussian Processes. MOE then finds and returns the point(s) of highest expected improvement. These are the points that will have the highest expected gain over the best historical samples seen so far. For more information see the documentation and examples.

Here are some examples of when you could use MOE:

  • Optimizing a system's click-through rate (CTR). MOE is useful when evaluating CTR requires running an A/B test on real user traffic, and getting statistically significant results requires running this test for a substantial amount of time (hours, days, or even weeks). Examples include setting distance thresholds, ad unit properties, or internal configuration values.

  • Optimizing tunable parameters of a machine-learning prediction method. MOE can be used when calculating the prediction error for one choice of the parameters takes a long time, which might happen because the prediction method is complex and takes a long time to train, or because the data used to evaluate the error is huge. Examples include deep learning methods or hyperparameters of features in logistic regression.

  • Optimizing the design of an engineering system. MOE helps when evaluating a design requires running a complex physics-based numerical simulation on a supercomputer. Examples include designing and modeling airplanes, the traffic network of a city, a combustion engine, or a hospital.

  • Optimizing the parameters of a real-world experiment. MOE can help guide design when every experiment needs to be physically created in a lab or very few experiments can be run in parallel. Examples include chemistry, biology, or physics experiments or a drug trial.

We want to collect information about the system as efficiently as possible, while finding the optimal set of parameters in as few attempts as possible. We want to find the best trade-off between gaining new information about the problem (exploration) and using the information we already have (exploitation). This is an application of optimal learning. MOE uses techniques from this field to solve this problem in an optimal way.

MOE provides REST, Python and C++ interfaces. A MOE server can be spun up within a Docker container in minutes. The black box nature of MOE allows it to optimize any number of systems, requiring no internal knowledge or access. By using MOE to inform parameter exploration of a time consuming process like running A/B tests, performing expensive batch simulations, or tuning costly models, you can optimally find the next best set of parameters to sample, given any objective function. MOE can also help find optimal parameters for heuristic thresholds and configuration values in any system. See the examples for more information.

MOE is available on GitHub. Try it out and let us know what you think at opensource+moe@yelp.com. If you have any issues please tell us about them along with any cool applications you find for it!


May the Yelps be with You

May brings us talks from the Python meetup, another mind-blowing talk from Designers + Geeks, and a talk from Products That Count on brand naming. I’m also excited to give you a sneak peak into June: we’re hosting our second annual WWDC after party. We promise this after party will be so good, you won’t want to leave for the hotel lobby (even if R. Kelly himself invites you).

For our Pythonista readers, let’s take a deeper look into the the upcoming Python meetup. Packaging turns out to be an important part of any language. Great packaging encourages language adoption, focuses the community on 1-2 of the best solutions to a problem, and encourages modular design. But it’s also a surprisingly tricky problem: support for a variety of OS, interactions with compiled libraries, organization of namespaces, programmatic specification of dependencies, and discovery and documentation are just some of the problems that need to be tackled. We haven’t even covered the difference between installing packages “locally” vs system wide, and the implications for deploying a set of packages!

Luckily, next week Noah Kantrowitz is going to help us sort through these issues with two presentations covering Python packaging and deployments. In between the main talks, we’ll see lightning talks and get a chance to mingle and ask questions. Hope you can join us!

Mark your calendars now for next month’s WWDC: Yelp is opening our doors for an after party to top them all! Meet some of the cool cats who work on our award-winning iPhone app and get an inside look at Yelp life. There will be plenty of 5-star hors d’oeuvres and wine served from our own customized barrels! Please RSVP here and don’t forget to bring your conference badge.