« Now Testify! | Main | Yelp's Third Hackathon »

10/29/2010

mrjob: Distributed Computing for Everybody

Ever wonder how we power the People Who Viewed this Also Viewed... feature? How does Yelp know that people viewing Coit Tower might also be interested in the Filbert Steps and the Parrots of Telegraph Hill?

It's pretty much what you'd expect: we look at a few months of access logs, find sessions where people viewed more than one business, and collect statistics about pairs of businesses that were viewed in the same session.

Now here's the kicker: we generate on the order of 100GB of log data every day. How do we deal with terabytes of data in a robust and efficient manner? Distributed computing!

Today we want to share that power with you.

As you may have guessed, we use MapReduce. MapReduce is about the simplest way you can break down a big job into little pieces. Basically, mappers read lines of input, and spit out tuples of (key, value). Each key and all of its corresponding values are sent to a reducer.

Here's a simple MapReduce job that does a word frequency count, written in our mrjob Python framework:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount().run()

In goes text, out come tuples of (word, count). You can give it your README file, or all of Project Gutenberg, and it'll scale gracefully.

(More complex examples: PageRank, Text Classifier)

We used to do what a lot of companies do, which is run a Hadoop cluster. We had a dozen or so machines that we otherwise would have gotten rid of, and whenever we pushed our code to our webservers, we'd push it to the Hadoop machines.

This was kind of cool, in that our jobs could reference any other code in our code base.

It was also not so cool. You couldn't really tell if a job was going to work at all until you pushed it to production. But the worst part was, most of the time our cluster would sit idle, and then every once in a while, a really beefy job would come along and tie up all of our nodes, and all the other jobs would have to wait.

About a year ago we noticed that Amazon had a service called Elastic MapReduce (EMR for short) where you could essentially spin up virtual Hadoop clusters with as few or as many machines as you needed, and pay for it by the hour.

I was tasked with migrating our code over to EMR. Not just getting Python scripts to run on EMR, but getting our code base to run more or less as-is, with all of its 40+ library dependencies, C and Cython code that compiles into Python modules, and files that are different from development to production.

It took me a while, but on May 18, 2010, we retired our Hadoop cluster for good, and switched all of our production jobs over to EMR. Our framework, mrjob, is finally mature enough that we're releasing it as an open source library.

  • If you're new to MapReduce and want to learn about it, mrjob is for you.
  • If you want to run a huge machine learning algorithm, or do some serious log processing, but don't want to set up a Hadoop cluster, mrjob is for you.
  • If you have a Hadoop cluster already and want to run Python scripts on it, mrjob is for you.
  • If you want to migrate your Python code base off your Hadoop cluster to EMR, mrjob is for you.
  • (If you don't want to write Python, mrjob is not so much for you. But we can fix that.)

So try it out and let us know what you think! And send us cool examples so we can feature them on our blog. :)

If you have questions or encounter issues, feel free to contact me through Yelp or through GitHub. I really want using mrjob to be a smooth and productive experience for everyone.

By the way, here are some other features on our site that are powered by mrjob: