Big Data with Elastic MapReduce
The majority of our AWS usage is Simple Storage Service (S3) and Elastic MapReduce (EMR). We use these technologies because we want every engineer to be extremely effective, to be able to command a cluster of machines that would normally take another entire team just to manage. We want our engineers asking and answering questions like:
- “What category did the customer want when she searched for ‘Pool’?”
- “When will our mobile traffic eclipse our web traffic?”
- “Was this review written by someone with firsthand experience?”
At Yelp, whether you’re a senior engineer or an intern, if you want to test the next great data driven product, you don’t have to cajole another team for resources, or sit around waiting for your job to be scheduled. Deploying is the same way: no need to worry about interrupting someone else’s batch job, or filling out TPS reports estimating the time you need on the production cluster. Just ship the code; the boxes will be available when you need them.
Yelp AWS Optimization
Make no mistake: optimizing for developer time can mean trading-off potential cost savings. While we believe the trade-off is worth it, that doesn’t mean we ignore our costs! Two of the specific ways we save money on EMR are:
- Re-use of job flows
- Buying reserved instances
We try to find ways of saving money that are invisible to other developers, and base improvements on how developers want to use the resources available. Contrast this with the philosophy of making every developer justify and micro-optimize their costs. We build tools to multiply the effectiveness of fellow engineers, instead of having policies that divide their attention between business issues and implementation details. By sharing tools such as mrjob and EMRio we’re not only letting Yelp developers better focus on business problems, hopefully we’re letting other companies do the same.
I hope you enjoy the videos, and Happy Holidays!