Engineering Blog

Autoscaling Mesos Clusters with Clusterman

Here at Yelp, we host a lot of servers in the cloud. In order to make our website more reliable—yet cost-efficient during periods of low utilization—we need to be able to autoscale clusters based on usage metrics. There are quite a few existing technologies for this purpose, but none of them really meet our needs of autoscaling extremely diverse workloads (microservices, machine learning jobs, etc.) at Yelp’s scale. In this post, we’ll describe our new in-house autoscaler called Clusterman (the “Cluster Manager”) and its magical ability to unify autoscaling resource requests for diverse workloads. We’ll also describe the Clusterman simulator,...

Continue reading

Yelp Dataset Challenge: Round 11 Winners

The eleventh round of the Yelp Dataset Challenge ran throughout the first half of 2018 and we received many impressive, original, and fascinating submissions. As usual, we were struck by the quality of the entries: keep up the good work, folks! Today, we are proud to announce the grand prize winner of the $5,000 award: “Generalized Latent Variable Recovery for Generative Adversarial Networks” by Nicholas Egan, Jeffrey Zhang, and Kevin Shen (from the Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science). The authors used a Deep Convolutional Generative Adversarial Network (DCGAN) to create photo-realistic pictures of food...

Continue reading

Migrating Kafka's Zookeeper With No Downtime

Here at Yelp we use Kafka extensively. In fact, we send billions of messages a day through our various clusters. Behind the scenes, Kafka uses Zookeeper for various distributed coordination tasks, such as deciding which Kafka broker is in charge of assigning partition leaders and storing metadata about the topics in its brokers. Kafka’s success within Yelp has also meant that our clusters have grown substantially from when they were first deployed. At the same time, our other heavy Zookeeper users (e.g., Smartstack and PaasTA) have increased in scale, putting more load on our shared Zookeeper clusters. To alleviate this situation, we...

Continue reading

Joinery: A Tale of Un-Windowed Joins

Summary At Yelp, we generate a wide array of high throughput data streams spanning logs, business data, and application data. These streams need to be joined, filtered, aggregated, and sometimes even quickly transformed. To facilitate this process, the engineering team has invested a significant amount of time analyzing multiple stream processing frameworks, ultimately identifying Apache Flink as the best suited option for these scenarios. We’ve now implemented a join algorithm using Flink, which we’re calling “Joinery.” It is capable of performing un-windowed one-to-one, one-to-many, and many-to-many inner joins across two-or-more keyed data streams. So, how does it work? In the...

Continue reading

TTL as a Service: Automatic Revocation of Stale Privileges

Security and usability are often at odds with one another, a fact that is best illustrated by access control. Deny everyone, and you’ll have a super secure system that no one can use; allow everyone, and you’ll maximize usability at the cost of security. The Principle of Least Privilege exists to balance both security and usability by giving users only the minimum amount of access they need to do their job. This reduces the attack surface by preventing attackers from leveraging a compromised user’s important, albeit unused, privileges for vertical/horizontal escalation. The Problem That said, there are a few key...

Continue reading

All About Yelp Hackathon

It’s time for our fall Hackathon! At Yelp, Hackathons are two-day events that provide unstructured time for our engineering and product teams to work on whatever may scratch their creative itch! Hackathon truly embodies our company values of “Playing Well with Others” and “Being Unboring,” as it invites us to participate in so many different ways. Engineers have the liberty to work on projects related to or completely outside the box of the Yelp product. We’ve seen many types of projects over the years from music videos and new photo classification algorithms to baking workshops, custom video games, and so...

Continue reading

A Guide to Software Engineering for the Visually Impaired

Introduction My name is Abrar Sheikh, and I’m a backend engineer on Yelp’s Distributed Systems team. Our team enables real-time data transfers between microservices and different data stores by building streaming infrastructure on top of Kafka using technologies like Python, Scala, and Apache Flink. I suffer from a genetic disorder called Albinism, which is mainly characterized by two things: Lack of pigmentation in the body, resulting in a white skin and hair tone. Severe loss of vision, which limits the ability to perform routine tasks such as driving, reading, or using computers. Growing up, I was fascinated by computers. Luckily...

Continue reading

The Yelp Production Engineering Documentation Style Guide

Documentation is something that many of us in software and site reliability engineering struggle with – even if we recognize its importance, it can still be a struggle to write it consistently and to write it well. While we in Yelp’s Production Engineering group are no different, over the last few quarters we’ve engaged in a concerted effort to do something about it. One of the first steps towards changing this process was developing our documentation style guide, something that started out as a Hackathon project late last year. I spoke about it when I was giving my talk on...

Continue reading