Engineering Blog

Zero downtime Upgrade: Yelp’s Cassandra 4.x Upgrade Story

The Database Reliability Engineering team at Yelp seamlessly upgraded more than a thousand Cassandra nodes with zero downtime. This post takes you behind the scenes of our upgrade strategy, from planning sessions to flawless rollouts. Background Motivation Apache Cassandra is a distributed wide-column NoSQL datastore and is used widely at Yelp for storing both primary and derived data. Yelp orchestrates Cassandra clusters on Kubernetes with the help of operators, as explained in our operator overview post. Upgrading from Cassandra 3.11 to 4.1 offered several observability and reliability improvements, in addition to performance gains. Based on public benchmarks, we expected to...

Continue reading

Building Biz Ask Anything: From Prototype to Product

Introduction Users have access to a wealth of information on Yelp business pages – from reviews and photos to structured information, menus, and Ask the Community feature on the business page, a single business page can be an ocean of content. At the same time, user expectations have evolved: people now expect immediate, direct answers. Sifting through dozens of reviews to find a simple fact can be time-consuming. Fortunately, advances in Large Language Models (LLMs) have given us a new set of tools, allowing us to tackle information retrieval and summarization tasks that were prohibitively complex just a few years...

Continue reading

How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation

Introduction Modern advertising platforms are fast-paced and interconnected: even small adjustments can have ripple effects on how ads are shown, how budgets are spent, and the value advertisers get from their ad spend. At Yelp, Ad Budget Allocation means splitting each campaign’s spend between on‑platform inventory (our website, mobile site, and app) and off‑platform inventory (the Yelp Ad Network). We optimize this split to meet advertisers’ performance goals while growing overall revenue. Due to the complexity of the budget allocation system and its feedback loop, even small changes can lead to unexpected system‑wide effects. To help us safely evaluate changes,...

Continue reading

S3 server access logs at scale

Introduction Yelp heavily relies on Amazon S3 (Simple Storage Service) to store a wide variety of data, from images, logs, database backups, and more. Since data is stored on the cloud, we need to carefully manage how this data is accessed, secured, and eventually deleted—both to control costs and uphold high standards of security and compliance. One of the core challenges in managing S3 buckets is gaining visibility into who is accessing your data (known as S3 objects), how frequently, and for what purpose. Without robust logging, it’s difficult to troubleshoot access issues, respond to security incidents, and ensure we...

Continue reading

Exploring CHAOS: Building a Backend for Server-Driven UI

A little while ago, we published a blog post on CHAOS: Yelp’s Unified Framework for Server-Driven UI. We strongly recommend reading that post first to gain a solid understanding of SDUI and the goals of CHAOS. This post builds on those concepts to delve into the inner workings of the CHAOS backend and how it generates server-driven content. To briefly recap, CHAOS is a server-driven UI framework used at Yelp. When a client wants to display CHAOS-powered content, it sends a GraphQL query to the CHAOS API. The API processes the query, requests the CHAOS backend to construct the configuration,...

Continue reading

Revenue Automation Series: Testing an Integration with Third-Party System

Background As described in the second blog post of Revenue Automation series, Revenue Data Pipeline processes a large amount of data via complex logic transformations to recognize revenue. Thus, developing a robust production testing and integration strategy was essential to the success of this project phase. The status quo testing process utilized the Redshift Connector for data synchronization once the report was generated and published to the data warehouse (Redshift). This introduced a latency of approximately 10 hours before the data was available in the data warehouse for verification. This delay impacted our ability to verify whether the changes were...

Continue reading

Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More

It has been over 3 years since we published our Nrtsearch blog post and over 4 years since we started using Nrtsearch, our Lucene-based search engine, in production. We have since migrated over 90% of Elasticsearch traffic to Nrtsearch. We are excited to announce the release of Nrtsearch 1.0.0 with several new features and improvements from the initial release. Glossary EBS (Elastic Block Store): Network-attached block storage volumes in AWS. HNSW (Hierarchical Navigable Small World): A graph-based approximate nearest neighbor search technique. Lucene: An open-source search library used by Nrtsearch. S3: Cloud object storage offered in AWS. Scatter-gather: A pattern...

Continue reading

Journey to Zero Trust Access

Glossary ZTA: zero trust architecture SAML: security assertion markup language (an SSO facilitation protocol) Devbox: a remote server used to develop software Zero Trust Access Remote Future Yelp is now a fully remote company, which means our employee base has become increasingly distributed across the world, making secure access to resources from anywhere a critical business function. Yelp historically used Ivanti Pulse Secure as the employee VPN, but due to the need for a more reliable solution, it became clear that a change was necessary to ensure secure and consistent access to internal resources. The Corporate Systems and Client Platform...

Continue reading