The Yelp Reservations service (yelp_res) is the service that powers reservations on Yelp. It was acquired along with Seatme in 2013, and is a Django service and webapp. It powers the reservation backend and logic for Yelp Guest Manager, our iPad app for restaurants, and handles diner and partner flows that create reservations. Along with that, it serves a web UI and backend API for our Yelp Reservations app, which has been superseded by Yelp Guest Manager but is still used by many of our restaurant customers.

This service was built using a DB-centric architecture, and uses a “DB sync” paradigm – a method where clients maintain a local database with a copy of data relevant to them – to sync data with legacy clients. It also relies on database triggers to enforce some business logic. The DB used is PostgreSQL which is not used anywhere else at Yelp, which meant that only a small rotation of long-tenured employees knew Postgres well enough to do outage response. This caused issues in maintenance, visibility, and outage response times. The teams working on the Restaurants products are not infra teams, and the Yelp-wide infra teams (understandably) focus on Yelp-standard infrastructure. As a result, when we did see issues with Postgres it was often a scramble to find people with relevant expertise.

So, we switched out this DB in-place with a Yelp-standard MySQL DB.

As restaurants rely on our product to run their business, this system can’t be taken offline for maintenance, and any data loss is unacceptable—we can’t have someone make a reservation and then have it disappear. This led to much of the complexity of this project, as switching gradually between two data stores on the fly introduced new challenges. Much of the existing documentation we could find on this used toy examples or assumed a clean stop, migration, and restart, so this was also somewhat unexplored territory (hence this blog post!).

Code

Django has MySQL support. As a proof of concept in mid-2022, we switched the development DB (which is local and set up as needed) to a MySQL DB and updated migration code. We got to the point where the service was starting, correctly setting up the DB in MySQL, and responding successfully to some requests. While this ended up being the easy part, it helped prove that the migration was feasible.

Postgres-specific Functionality

Postgres has a lot of functionality that isn’t supported in MySQL. We also used some features that, while supported by MySQL, are not supported by our infra teams.

One example: Postgres has native support for array columns. We used these to store the schedule for each table at a restaurant in our database as an array of integers. We re-implemented this behavior to pack this data into a string, which worked cleanly since the length of the array and each of its elements is constant.

A more complicated set of changes were needed to get rid of database triggers. Triggers are supported by MySQL in general, but are not supported by our MySQL infrastructure. Our code used them to propagate data (triggering when certain database tables are changed) and to enforce constraints around preventing double-booking of tables.

For data propagation, our old system relied on DB changes for certain tables being published as Advanced Message Queuing Protocol (AMQP) events into rabbitmq, which was then consumed by multiple clients that subscribed to the changes relevant to them. This was powered by a Postgres-specific database extension that integrated with Postgres’ transaction management, ensuring clients never received a message until the corresponding transaction was committed. In our new system, we added logic to the Django model’s save() function to add a post-commit trigger to publish to a new AMQP topic, and refactored our code to eliminate “bulk” operations, which write to the DB without calling save(). This means our existing watchers could instead listen to this new AMQP topic even when updating MySQL tables. We also introduced transaction grouping by generating a universally unique identifier (UUID) for each transaction at the start of the transaction block. This identifier was written along with the change data to group changes by transaction. We then monitored the existing topic and the new topic and ensured that the data matched, before switching to using the new topic.

For preventing double-bookings, we had used a system called ‘Atomic Block Holds’. This used a DB trigger to raise an exception and prevent a write if a block (a ‘block’ is a reservation, or anything else, that means a table cannot be reserved at a certain time) on a table would overlap with an existing block. To replicate this behavior without triggers, we created a new table called TableTimeSlotBlock which contains rows keyed on both the table id and 15-minute timeslots for each existing blocked period. Then the application code checks for conflicts and locks the rows (even if they don’t exist yet) by performing a SELECT … FOR UPDATE query. We put this logic earlier in the code than the existing DB trigger, so by examining logs we could ensure that the existing trigger was no longer exercised, meaning that the new solution was at least as restrictive as status quo.

To migrate to this system, we also had to create rows in this new table for all future reservations – and since each existing ‘block’ covered multiple timeslots that meant adding millions of records to the new table.

Release

This was the scary part. We wanted to be able to release the new DB gradually and be able to roll back to Postgres if needed. This meant that we needed to keep both DBs in sync for some period of time. Django has multi-DB support, but that is intended for writing/reading different things to different DBs, not keeping data exactly in sync across multiple DBs.

To achieve this for writes, we:

  • Added a new model called AlsoWriteToMysqlModel in the inheritance hierarchy of all models in our code. This model redefined save() and other object-level DB write functions to write first to a ‘primary’ DB, and then save the object to the ‘secondary’ DB

  • Did the same with AlsoWriteToMysqlQuerySet and queryset operations for all querysets in our code
    • In Django, not all operations are performed on objects; for example you can have a Queryset which has a filter and then delete it, which will perform a single DB query with that filter and never load the actual objects.
  • Added post_save and pre_delete signal handlers for models we don’t control (like the User model or third party models) that do the same
    • We could have used this technique for all models, but we felt that having the logic inside the model when we could was easier to reason about and keeps the DB writes as close together as possible.
  • Replaced the default Django transaction decorator with a decorator that nested a Postgres transaction inside a MySQL transaction. This meant that almost any DB failure would roll back both DBs, as long as we were in a transaction.
    • The exception is for failures at MySQL commit-time; the logic here is that DB triggers made Postgres commits sometimes fail, while MySQL commits should always succeed unless there’s an infra issue. We learned this the hard way after originally having the order reversed in an attempt to reduce the risk of introducing new failing Postgres writes during the rollout, and then having some transactions fail due to DB triggers after committing writes to MySQL, leading to inconsistent data across the databases. This is an interesting example where “playing it safe” in one dimension actually caused a bug.

For reads, we:

  • Added logic to the router (the Django class which determines which DB we read/write to) to separate the ‘read db’ and the ‘write db’
  • Added middleware to set a flag if and when we wanted a request to read from MySQL, which was respected by the router
    • This flag was set before any DB reads/writes in the middleware stack, to ensure each request only reads from one DB

During the release, we first kept reads on Postgres, to keep identical behavior as the status quo while also writing to MySQL. This let us cross-check the databases and fix inconsistencies and bugs at our leisure without affecting customers. We then gradually switched requests to read from MySQL, then switched the write logic to write to MySQL first, and finally (several months later) turned off Postgres writes entirely and cleaned up much of the code we had written.

The release process went relatively smoothly over the course of several months, with a few surprises we describe below.

Surprises & Interesting Takeaways

  • Originally, we planned a transition period where the ‘primary’ database could differ on a per-request basis. However, this causes issues with autoincrement primary keys. Specifically, PostgreSQL maintains a sequence that’s incremented only when a row is inserted without the primary key set. This means that you should either always set the key, or never set the key. Otherwise, each write with the key set, like when we write to MySQL and then save the object to Postgres, will cause a future Postgres write with the key unset to fail. This took some time to figure out during rollout, as the symptom was a small number of errors in status quo flows, but no errors in the MySQL-pinned requests.

  • Django names DB savepoints with random strings. ProxySQL, which in our infrastructure sits between clients and the databases, stores query digests for use in metrics. These digests are meant to be generic representations of queries and not depend on the actual data written or read, but savepoint names are included in the digests, leading to each query using a savepoint having a unique digest. This led to escalating ProxySQL memory usage and a few instances of production issues until we figured it out. We fixed this by changing a setting in our ProxySQL instances.

  • Switching from ‘bulk’ operations to individual object-level operations is significant and can lead to logic issues (since things like save() aren’t called in bulk operations) and performance issues (since an order of magnitude more DB queries could be executed).
    • In a single instance, this meant rewriting an archival batch job to use raw SQL, but otherwise it turned out that MySQL could easily handle the volume of writes we do.
  • Performing an initial data load (backfilling) early is very useful for testing, but we should have done a full, clean, second backfill once we fixed all the bugs. Instead, we fixed specifically broken data we discovered, but this led to bugs caused by already-fixed code that would have been avoided.

  • Don’t overlook other users of the database. Our analytics pipeline was getting data directly from Postgres, and moving it over to use MySQL ended up being time-consuming. This was the final blocker to decommissioning the old database.

  • Using the Yelp-standard stack improved performance. This isn’t due to MySQL being inherently more performant, but by using the same stuff the company uses, we benefit from many people’s efforts monitoring and optimizing our database performance.

Acknowledgements

This was a large project that took the better part of a year to implement. In the interests of brevity, I’ve focused on a subset of the work Restaurants did, but it was all vital for the success of this project. Special thanks to the Database Reliability Engineering and Production Engineering teams, and everyone from Restaurants who worked on this, especially Boris Madzar, Carol Fu, and Daniel Groppe.

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp. If you're interested, apply!

View Job

Back to blog