Engineering Blog

Whose Code is it Anyway?

Improving Code Ownership at Yelp In this prior blog post, Kent talked about how the Engineering Effectiveness (EE) organization was created at Yelp to reduce communication complexity between core teams and product teams. Core teams need to communicate infrastructure changes, manage the deprecation of libraries and tools, and evangelize new tooling to other teams at a regular cadence. EE has been investing in building tools that can communicate these changes and provide insights into what might make product teams more effective in shipping code quickly and safely. In order to measure the engineering effectiveness of Yelp, we need to measure...

Continue reading

Now You See Me: How NICE and PDQ plots Uncover Model Behaviors Hidden by Partial Dependence Plots

Many machine learning (ML) practitioners use partial dependence plots (PDP) to gain insights into model behaviors. But have you run into situations where PDPs average two groups with different behaviors and produce curves applicable to none? Are you longing for tools that help you understand detailed model behavior in a visually manageable way? Look no further! We are thrilled to share with you our newest model interpretation tools: the Nearby Individual Conditional Expectation plot and its companion, the Partial Dependence at Quantiles plot. They highlight local behaviors and hint at how much we may trust such readings. A not NICE...

Continue reading

Orchestrating Cassandra on Kubernetes with Operators

This post is about how Yelp is transitioning from the management of Cassandra clusters in EC2 to orchestrating the same clusters in production on Kubernetes. We will start by discussing the EC2-based deployment we have used for the past few years, followed by an introduction to the Cassandra operator, its responsibilities, the core reconciliation workflow of the operator, and finally, the etcd locking we employ for cross-region coordination. Cassandra is a distributed wide-column NoSQL datastore and is used at Yelp for both primary and derived data. Yelp’s infrastructure for Cassandra has been deployed on AWS EC2 and ASG (Autoscaling Group)...

Continue reading

Tales of a Mobile Developer on Consumer Growth

Engineers on Yelp’s Consumer Growth team work closely with product managers, data scientists, and designers to increase user acquisition, engagement and retention to fuel the rest of the business. The team is central to growing the number of active users on the Yelp platform. In-App Update, introduced in Android Pie, is a feature that allows Android users to update the app by showing a prompt while they’re using the app and keeps users in the app as the update is happening in the background. We thought it would be a valuable feature for consumers on Yelp and began researching and...

Continue reading

Minimizing read-write MySQL downtime

The relational database of choice at Yelp is MySQL and it powers much of the Yelp app and yelp.com. MySQL does not include a native high-availability solution for the replacement of a primary server, which is a single point of failure. This is a tradeoff of its dedication to ensuring consistency. Replacing a primary server is sometimes necessary due to planned or unplanned events, like an operating system upgrade, a database crash or hardware failure. This requires pausing data modifications to the database while the server is restarted or replaced and can mean minutes of downtime. Pausing data modifications means...

Continue reading

Introducing Folium: Enabling Reproducible Notebooks at Yelp

Jupyter notebooks are a key tool that powers Yelp data. It allows us to do ad hoc development interactively and analyze data with visualization support. As a result, we rely on Jupyter to build models, create features, run Spark jobs for big data analysis, etc. Since notebooks play a crucial role in our business processes, it is really important for us to ensure the notebook output is reproducible. In this blog post, we’ll introduce our notebook archive and sharing service called Folium and its key integrations with our Jupyterhub that enable notebook reproducibility and improve ML engineering developer velocity. Folium...

Continue reading

Flink on PaaSTA: Yelp’s new stream processing platform runs on Kubernetes

At Yelp we process terabytes of streaming data a day using Apache Flink to power a wide range of applications: ETL pipelines, push notifications, bot filtering, sessionization and more. We run hundreds and hundreds of Flink jobs, so routine operations like deployments, restarts, and savepoints don’t take thousands of hours of developers’ time, which would be the case without the right degree of automation. The latest addition to our toolshed is a new stream processing platform built on top of PaaSTA, Yelp’s Platform As A Service. Sitting at its core, a Kubernetes operator automatically watches over the deployment and the...

Continue reading

The Dream Query: How we scope projects with GraphQL

At Yelp, new web pages and app screens are powered by GraphQL for fetching data. This blog post describes the Dream Query – a pattern our feature teams use when refactoring or creating new pages. (Check out our previous blog post to see how we dynamically codegen DataLoaders to implement the server layer!) Scoping a new feature with GraphQL Let’s jump in with an example! Imagine your team is tasked with creating the new version of the “Header component” for the website (we’ll use the Yelp.com website in our example). You may receive a design mock that looks like this:...

Continue reading