At LAUNCH Scale last week, I gave a talk to over 75 co-founders (CEOs and CTOs) on how we’ve scaled traffic here at Yelp. It brought back memories of Darwin biting through our ethernet cable and reminded me of the run up to our IPO, making sure we had enough capacity to handle the expected surge in traffic from the world’s press (and more recently, the launch of Yelp in Hong Kong!). For close to 8 years, I’ve had the privilege to work alongside some of the best engineers in the world and have seen the meticulous work and thought it takes to scale a site to serve over a hundred million unique visitors.

image00

I joined Yelp in early 2007 as a software engineer coming from Google, where I had spent the previous 4 years. On my very first day I was handed the search engine and asked to “improve it.” At the time we were handling approximately 200,000 searches per day and of those searches 86,400 of them were from load balancer doing health checks! We had one primary database running MySQL with a couple of replication slaves running a mix of InnoDB and MyISAM tables (side note: MyISAM isn’t great when your databases hard fail). We were using Apache without gzip enabled (pro tip: enable it!) and our data science toolkit was: cat, grep, wc, gnuplot, and awk. Because Yelp was born pre-AWS, we had to operate our own data centers.

As our traffic grew, our infrastructure had to scale with it. We started using a CDN to host our static content. The one MySQL database with slaves was scaled by vertically sharding it, moving tasks that were write heavy and non-user-interactive to different databases (e.g. clicks/impression/phone calls for advertisers). Another high-impact win our team had for improving database performance was moving our hosts to FusionIO PCIe cards as our primary storage. On the data center side, our operations team moved from having one data center to having many. Given our traffic make-up, we decided that our new data centers would be “read-only” and that we would have a separate primary data center where all writes happen. This made scaling our read only traffic much more straightforward. We now had data centers closer to users, allowing us to use DNS to geographically load balance our traffic, making the experience faster for users. We’ve also been able to leverage Amazon EC2 using AWS Direct Connect, which allows our engineering teams to bring up hardware whenever they need. It’s been awesome removing the hardware barrier for getting to production.

As our traffic scaled, our logging infrastructure needed to keep up as well. We started off using syslog-ng and rsync to handle logs stored on a NFS server and lots of disks. In October 2008 we moved to using scribe (now a custom branch), which has served us very well over the past 5+ years that we’ve been using it. We take the logs scribe aggregates and move them into Amazon S3 for storage, which makes using EMR on AWS seamless. This is why in 2009 we open sourced mrjob, which allows any engineer to write a MapReduce job without contending for resources. We’re only limited by the amount of machines in an Amazon data center (which is an issue we’ve rarely encountered). Real-time analytics are much better than periodically run batch jobs, so recently we open sourced Pyleus which allows anyone to write Storm topologies using Python. Another powerful tool we created, called MOE, allows anyone to optimize the parameters of any function (e.g. ad or search ranking functions). We’re extremely proud of our open source contributions and hope to have many more in the future.

For the past 10 years, the Yelp Engineering team has worked really hard to build a scalable architecture with the ability to develop and push code to the site multiple times a day. For those of you who enjoy working on these types of problems, make sure to check out our careers page. We’re always hiring!

If you’d like to know more on how we scaled the site, check out the slides from my presentation at LAUNCH Scale and feel free to reach out to me on Twitter at @stopman.

Back to blog