Engineering Blog

November 6th, 2014

Yelp Dataset Challenge Round 3 Winners and Dataset Tools for Round 4

Yelp Dataset Challenge Round 3 Winners

We recently opened the fourth round of the Yelp Dataset Challenge. This announcement included an update to the dataset, adding four new international cities and bringing the total number of reviews in the dataset to over one million. You can download it and participate in the challenge here. Submissions for this round are open until December 31, 2014. See the full terms for more details.

With the opening of our fourth iteration of the challenge, we closed the third round, which ran from February 1, 2014 to July 31, 2014. We are proud to announce two grand prize winners of the $5,000 award from round three:

  • Jack Linshi from Yale University with his entry “Personalizing Yelp Star Ratings: A Semantic Topic Modeling Approach.” Jack proposed an approximation of a modified latent Dirichlet allocation (LDA) in which term distributions of topics are conditional on star ratings, allowing topics to have an explicit sentiment associated with them.
  • Felix W. from Princeton University with his entry “On the Efficiency of Social Recommender Networks.” Felix constructed metrics for measuring the efficiency of a network in disseminating information/recommendations and applied them to the Yelp social graph, discovering that it is quite efficient.

These entries were selected from many submissions for their technical and academic merit. For a full list of all previous winners of the Yelp Dataset Challenge, head over to the challenge site.

Dataset Example Code

We maintain a repository of example code to help you get started playing with the dataset. These examples show different ways to interact with the data and how to use our open source Python MapReduce tool mrjob with the data.

The repository includes scripts for

Other Tools

There are many ways to explore the vast data within the Yelp Dataset Challenge Dataset. Below are some examples of some of the many cool tools that can be used with our data:

CartoDB is a cloud based mapping, analysis, and visualization engine that shows you how you can transform reviews into insightful visualizations. They recently wrote a blog post demonstrating how to use their tools to gain interesting insights about the Las Vegas part of the dataset.

Statwing is a tool used to clean data, explore relationships, and create charts quickly. They loaded the dataset into their system for people to play with and explore interesting insights.

Yelp Dataset Challenge Round 4

Submissions for this round are open until December 31, 2014. See the full terms for more details. This dataset contains over one million reviews from five cities around the world, along with all of the associated businesses, tips, check-ins, and users along with the social graph. We are excited to see what you come up with!


November 5th, 2014

Reflections from Grace Hopper (Part 2)

Welcome back! Today we have Wei, Rachel, Jen, Virginia and Anusha sharing their experiences. Wei is an engineer on the consumer team and she brings amazing user experiences to our customers. Rachel and Jen are both engineers on our international team, bringing the power of Yelp to all of our international communities. Virginia works as an engineer on our partnerships team and Anusha is an engineer on our infrastructure team.

Overall, we all had a blast getting to know each other, meeting other amazing women in the industry, hearing some great stories from inspiring women role models, and sourcing some future talent for Yelp. If you missed us at the career fair and are interested in working at Yelp,  check out We are always interested in talented women engineers to grow our community.

Wei W.

Software Engineer, Mobile Site

I was very inspired by Yoky Matsuoka’s talk. She reminded us to constantly evaluate our level of passion and engagement with what we’re doing, and to change courses if we find those levels waning. As a professional tennis player turned robotics professor turned VP of technology at Nest, Yoky is proof that our interests may lead us on many different paths throughout our life and that, it turns out, is actually okay.

Rachel and Wei eating ice-cream

We loved the afternoon ice-cream snacks

Rachel Z.

Software Engineer, International

My time at Grace Hopper was split between interviewing and attending talks. I was a bit overwhelmed by the volume of students coming by our career booth everyday but I was also really glad that we got to interview some strong women engineers while at the conference. It was great to see the interest in technical as well as future leadership possibilities.

Jo Miller’s talk on Winning at the Game of Office Politics was fascinating. I always viewed office politics as evil. Her talk provided a way to look at it from a different angle and assured us that it’s possible to navigate office politics without becoming a political animal.

Some other personal highlights: speaking with industry folks during workshops, eating ice cream, dancing at the parties, and reconnecting with friends. A couple things I hope the conference improves upon next year are to have less scheduling conflicts between interesting talks, and increased availability for the more popular talks.

Jen W. taking a break from interviews

Jen W. taking a break from interviews

Jen W.

Software Engineer, International

I have been working in the industry for over 10 years, always in male-dominated workplaces. What struck me the most was being surrounded by so many women. It was inspiring to meet people from such a wide diversity of experiences and backgrounds, all of them bright, eager, and happy to chat (Hi Yenny! Hey Kanak!).

Those who are in school or looking at career growth will truly benefit from attending future Grace Hopper conferences. For the rest of us, it’s still a great opportunity to meet new people, attend the amazing (in more ways than one!) plenary panels and learn what’s going on in tech in the rest of the world.

Our booth looked awesome and was very popular for its swag :) Clearly all of us were trying to tweet about it while we took this picture!

Our booth looked awesome and was very popular for its swag! Clearly all of us were trying to tweet about it while we took this picture!

Virginia T.

Software Engineer, Partnerships

For me, the most useful and interesting workshop was “The Dynamics of Hyper-Effective Teams.” I loved that there was role-playing involved. For instance, do you know someone on your team who’s an “Airtime Dominator,” someone who speaks at least 20% of the time? Or a “Silent Expert,” the one who has all the knowledge but rarely speaks up? What about “The Naysayer” who constantly refutes others’ ideas?

The workshop concluded with volunteers sharing their stories and advice on working with these personalities. Having personally been through some similar experiences, it was an eye-opener to learn how these (sometimes clashing) personalities can effectively work together. Can’t wait to experiment with it!

Anusha R.

Software Engineer, Infrastructure

I loved attending Grace Hopper for the first time this year! There were a bunch of inspirational speakers at the conference. Just like my co-worker and friend Wei, I also enjoyed Yoky’s talk about how her passions led to her finding her career path. I was also inspired by Barbara Birungi, one of the award winners. Her non-profit is helping women in Uganda get involved in  technology through mentorship and coaching.

There was a good mix of technical and career building talks and workshops. I attended several career building sessions and enjoyed the workshops on office politics, difficult conversations and building hyper-effective teams.

But most of all, I enjoyed talking to the students who visited our booth at career fair. Many of them had interesting projects to talk about. I was excited to see this level of talent and wish them the best in their careers.

November 4th, 2014

Reflections from Grace Hopper (Part 1)

As you probably heard, Yelp attended Grace Hopper this year. Nine software engineers from different teams attended and, for many of us, it was our first time.

It was a unique experience to see so many talented women in one place. In addition to the talks and panel discussions, we also had the opportunity and pleasure to represent Yelp at the career fair. It was amazing to see a consistent flow of students and industry talent, all happy customers, stop by our booth to speak with us and tell us their stories of using Yelp.

We had such a great time that we wanted to share some highlights with you. This is the first of two posts where each of us have added an excerpt from our experience in our own words.

Our group at Grace Hopper

Our first day at the career fair. From left: Jen F., Wei W., Susanne L., Tasneem M., Carmen J., Emily F., Virginia T.

Tasneem M.

Engineering Manager, Ads

I joined Yelp about a month ago and was super excited to be part of this journey with fun and interesting women from the company. I have worked in the software industry since the early 2000s and have grown as an engineer, a manager and a leader partly because I have had inspiring role models throughout my career. My mentors have challenged me to take on opportunities that I felt I wasn’t quite ready for.

My experience at Grace Hopper was a reminder of how far I have come and yet how far I still can go. It has re-inspired me to continue mentoring women and help them take charge of their careers in a male dominated industry. I am also motivated to help tackle the diversity issues within my local community.

The highlights for me were being at the career fair, attending the keynotes and technology lightning talks. Having spent a lot of time at the career fair, I was impressed by the talent and deep interest in data mining and machine learning. I look forward to seeing some of you join our growing community of awesome engineers at Yelp. I also loved Pat Kirkland’s talk on “Executive Presence.” I appreciated her role-playing and practical guidance on the different personas (prey, predator and partner) and have been able to successfully experiment with some of her tips at work. The lightning talks were a great way to hear several unique stories and perspectives within an hour. I hope that we can be on stage next year sharing some of our experiences.

Susanne L.

Database Engineer, Operations

I am a database engineer at Yelp, working on a team where we all make sure that our persistent stores are reliable, scalable, fast and give our Yelp consumers a pleasant experience with our product. I thought it was awesome to see so many great women in computing coming to Grace Hopper. Some of them were looking for an internship or a full time position and they stopped by our booth genuinely interested in Yelp. It was exciting to hear how our product makes consumers happy.

A lot of these young women had remarkable experiences but didn’t really know what to highlight and how to sell themselves. My suggestion to some of them was to:

  • Consider having an “elevator pitch” prepared. Sell yourself in 60 seconds: try a pitch that includes your name, year, major, minor (if you have one), and why you want this job.
  • Prepare to talk about a distinguishing skill/project. For example, “I did a hackathon, where I developed an android app..” Be prepared to answer any questions about the project and what you learned from it.
  • Know what you are looking for. “What kind of internship at Yelp would you be interested in?” Backend/Frontend/Mobile/Web? Which programming language do you like? Why? Avoid “I don’t know, anything is fine” unless you are a freshman!

Jen F.

Systems Administrator, Corporate Infrastructure

I have been at Yelp since August 2012. For me,  the talk I will remember most is “Beyond the Buzzwords: Test-Driven Development” where I got a quick overview of how one can practice test driven development (TDD). While TDD isn’t the most exciting technical topic, the speaker, Sabrina B. Williams from Google, did an awesome job making it relatable. She was able to share a live demo of a project from concept to the tests to fully-implemented program.

The keynotes were full of rock stars of the technology world that were also surprisingly engaging (and occasionally controversial!).  Who hasn’t gone to a technology conference keynote and expected a glorified sales pitch that was easily forgettable?  None of that was here at Grace Hopper.

I also appreciated how diverse the talks were – from recommendations on how to get your first job to dealing with office politics to in-depth technical talks. However, one thing I’d love to see more at Grace Hopper next year is talks about advancing your career in non-managerial roles. All in all, this is a great conference and I only wish I had been able to participate when I was just getting into the industry!

October 31st, 2014

Scaling Traffic from 0 to 139 Million Unique Visitors

At LAUNCH Scale last week, I gave a talk to over 75 co-founders (CEOs and CTOs) on how we’ve scaled traffic here at Yelp. It brought back memories of Darwin biting through our ethernet cable and reminded me of the run up to our IPO, making sure we had enough capacity to handle the expected surge in traffic from the world’s press (and more recently, the launch of Yelp in Hong Kong!). For close to 8 years, I’ve had the privilege to work alongside some of the best engineers in the world and have seen the meticulous work and thought it takes to scale a site to serve over a hundred million unique visitors.


I joined Yelp in early 2007 as a software engineer coming from Google, where I had spent the previous 4 years. On my very first day I was handed the search engine and asked to “improve it.” At the time we were handling approximately 200,000 searches per day and of those searches 86,400 of them were from load balancer doing health checks! We had one primary database running MySQL with a couple of replication slaves running a mix of InnoDB and MyISAM tables (side note: MyISAM isn’t great when your databases hard fail). We were using Apache without gzip enabled (pro tip: enable it!) and our data science toolkit was: cat, grep, wc, gnuplot, and awk. Because Yelp was born pre-AWS, we had to operate our own data centers.

As our traffic grew, our infrastructure had to scale with it. We started using a CDN to host our static content. The one MySQL database with slaves was scaled by vertically sharding it, moving tasks that were write heavy and non-user-interactive to different databases (e.g. clicks/impression/phone calls for advertisers). Another high-impact win our team had for improving database performance was moving our hosts to FusionIO PCIe cards as our primary storage. On the data center side, our operations team moved from having one data center to having many. Given our traffic make-up, we decided that our new data centers would be “read-only” and that we would have a separate primary data center where all writes happen. This made scaling our read only traffic much more straightforward. We now had data centers closer to users, allowing us to use DNS to geographically load balance our traffic, making the experience faster for users. We’ve also been able to leverage Amazon EC2 using AWS Direct Connect, which allows our engineering teams to bring up hardware whenever they need. It’s been awesome removing the hardware barrier for getting to production.

As our traffic scaled, our logging infrastructure needed to keep up as well. We started off using syslog-ng and rsync to handle logs stored on a NFS server and lots of disks. In October 2008 we moved to using scribe (now a custom branch), which has served us very well over the past 5+ years that we’ve been using it. We take the logs scribe aggregates and move them into Amazon S3 for storage, which makes using EMR on AWS seamless. This is why in 2009 we open sourced mrjob, which allows any engineer to write a MapReduce job without contending for resources. We’re only limited by the amount of machines in an Amazon data center (which is an issue we’ve rarely encountered). Real-time analytics are much better than periodically run batch jobs, so recently we open sourced Pyleus which allows anyone to write Storm topologies using Python. Another powerful tool we created, called MOE, allows anyone to optimize the parameters of any function (e.g. ad or search ranking functions). We’re extremely proud of our open source contributions and hope to have many more in the future.

For the past 10 years, the Yelp Engineering team has worked really hard to build a scalable architecture with the ability to develop and push code to the site multiple times a day. For those of you who enjoy working on these types of problems, make sure to check out our careers page. We’re always hiring!

If you’d like to know more on how we scaled the site, check out the slides from my presentation at LAUNCH Scale and feel free to reach out to me on Twitter at @stopman.

October 15th, 2014

Introducing Pyleus: An Open-source Framework for Building Storm Topologies in Pure Python

Yelp loves Python, and we use it at scale to power our websites and process the huge amount of data we produce.

Pyleus is a new open-source framework that aims to do for Storm what mrjob, another open-source Yelp project, does for Hadoop: let developers process large amounts of data in pure Python and iterate quickly, spending more time solving business-related problems and less time concerned with the underlying platform.

First, a brief introduction to Storm. From the project’s website, “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”

A Pyleus topology consists of, at minimum, a YAML file describing the structure of the topology, declaring each component and how tuples flow between them. The pyleus command-line tool builds a self-contained Storm JAR which can be submitted to any Storm cluster.

When it comes to massive data processing demos, “word count” is a classic. Since Storm operates on “unbounded streams of data”, we can’t calculate a total count for each unique word, since the input stream could continue indefinitely. Instead, our topology will maintain a monotonically increasing counter for each unique word, and log the value each time we see it.

So how can you build a word count Storm topology using Pyleus? All you need is a pyleus_topology.yaml and a few Python components.

For a simple demonstration, you need to know about the three core concepts in Storm:

  • A tuple is the unit of data within a Storm topology, flowing into and out of processing components.
  • Spouts are components that feed tuples into a topology. Usually, a spout consumes data from an external source, like Kafka or Kinesis, then emits records as tuples.
  • Bolts subscribe to the output streams of one or more other spouts and bolts, do some processing, then emit tuples of their own.

This topology has three components: A spout that emits a random line of “lorem ipsum” text, a bolt that splits lines into words, and bolt that counts and logs occurrences of the same word.

Here are the contents of pyleus_topology.yaml:

The spout configuration is self-explanatory, but the bolts must indicate the tuple streams to which they are to subscribe. The split-words bolt subscribes to line-spout with a shuffle_grouping—this means that tuples emitted from line-spout should be evenly and randomly distributed amongst all instances of split-words, be there one, five, or fifty.

count-words, however uses a fields_grouping on the ‘word’ field. This forces all tuples emitted from split-words with the same value of ‘word’ to go to the same instance of count-words. This allows the code in word_count.count_words to make the assumption that it will “see” all occurrences of the same word within the same process.



word_count/ is left as an exercise for the reader. (Or, you could just check out the full example on GitHub)

Now, running pyleus build in this directory will produce a file word_count.jar which you can submit to a Storm cluster with pyleus submit, or run locally for testing with pyleus local.

The code for a Pyleus topology can be very simple, but one feature in particular we are excited about is built-in virtualenv integration. Simply include a requirements.txt file alongside your pyleus_topology.yaml, and pyleus build will build a virtualenv that your code can use and embed it in the JAR. You can even re-use packaged Pyleus components, and refer to them directly in pyleus_topology.yaml!

Another team at Yelp has already developed a Pyleus spout for consuming data from an internal source and built a Python package for it. Now, others within the company can add one line to their requirements.txt and use the spout in their pyleus_topology.yaml without writing a single line of code.

Pyleus is beta software, but feedback and pull requests are happily accepted.

Get started using Pyleus by installing it with pip install pyleus, then check out the source on GitHub for more examples.