As Yelp has grown over the years, we've amassed huge collections of data - the collective output of the actions of tens of millions of users. Analyzing this data helps us improve the user experience across the site, from ranking businesses and extracting "review highlights" to fighting spam and keeping the site secure. These tasks involve long-running batch processes that analyze large logs and database tables, with workflows sometimes composed of five or more dependent processes. Managing these workflows with traditional scheduling tools - most notably the cron family - eventually caused us to breach the "complexity comfort zone" that surrounds any large engineering project. But then we had a vision, and it is that vision we are releasing today as tron.
A tron “Job” is a coherent goal, composed of one or more processing steps. Collecting business page metrics on Yelp is a Job and within this Job we aggregate business data, data mine the information, and send an email notification with the metrics. A Job is not a simple atomic command but rather several steps we call Actions. Dividing our Jobs into Actions allows us to pinpoint failures and restart them accordingly. Additionally, we may uniquely configure each Action. Only when all the Actions complete successfully is the Job instance marked a success.
A common need when configuring a Job is managing the prerequisites for certain Actions within it. For example, before you can decide which business pages were the most viewed this month, you must must first tally up how many times each business page was viewed. These dependencies are strict in that if tallying up the business monthly views fails, it is useless (and possibly dangerous) to compute top viewed business pages. To handle this problem we say “Get Top Viewed Businesses" Action requires “Tally Business Views" Action.
An Action can require any number of other Actions, and what results is a DAG of dependencies shown in the example below.
The diagram above depicts the full Job of "Collecting Business Metrics". When one of the Job instances is kicked off, the Action that starts is "Tallying Monthly Business Views" because it is the only Action that has no dependencies. Once it finishes successfully, both “Getting Top 10 Businesses Viewed” and “Getting High Fluctuation Businesses” can start. And you can only “Send Email with Metrics” if the "Top 10" and "High Fluctuation" Actions complete successfully. If any Action fails along the way, no dependent Actions will start.
Note on circular dependencies: Circular dependencies (if allowed) would cause some Actions to never run. This is solved in configuration by only allowing you to reference Actions defined previously in the file.
tron runs on a Master/Slave configuration by running on the Master and sending commands to run on Slaves through SSH. This allows us to:
- keep all of our configuration in one place
- apply load balancing techniques
- easily schedule jobs that span multiple machines
The figure below shows the Master/Slave relationship. The Master runs the tron daemon and the daemon opens SSH connections to the Slaves when it's time to run an Action instance.
tron then executes the Action’s command via SSH and collects stdout, stderr and the exit status. The results are viewable through tronview (one of our tron tools).
Note on Nodes: We added features into tron that allow you to customize which Actions are run on which Nodes. The default is that all Actions within a Job instance run on the same Node.
A large part of a batch process scheduler is, well, the scheduling. Similiar to cron, tron is able to run Jobs on different time intervals as well as more interesting schedules such as every Monday and Friday at 3p.m. We have also added tron variables you can add to an Action’s command that interpolate the scheduled date or time. This is especially useful if a script requires a date or time argument. For instance, if you want to tally all business views for a given date, you can use a tron variable to pass the script that date. tron variables are also persistent for a given Job instance. So if you want to cancel a Job instance and restart it days later, it passes the same date or time to the script as originally planned.
tronview (a tron tool) displays tron status on four levels:
- All Jobs: Displays all Jobs, their status and last success
- Job Runs: Displays the Job history for a given Job
- Action Runs: Displays all Action instances for a given Job instance
- The Action: Displays information on the Action instance including stdout/stderr
tron also has a web interface under active development.
When something goes terribly wrong in the execution of an Action, it goes into a failed state along with the Job containing it. Sometimes this is fine with the user and tron starts the next Job instance when it is scheduled, but if the user wants to resolve this Action they can use tronctl. tronctl allows the user to restart the failed Action and continue the Job. Alternatively the user can restart the entire Job instance (and reprocess succeeded Actions). If the failure is worse than terrible, tronctl also allows users to disable an entire Job, then re-enable it once everything is resolved.
tron is written in python and is available on our GitHub account.
tron depends on the following open source projects:
We have many Jobs running on tron and we are in the process of converting more of our cron-scheduled batch processes. tron is still in active development and more documentation and features are on the way. In particular, we plan to add:
- Improved web interface
- Handling multiple configuration files
- Monitoring Tools (email, IRC)
- Smarter Load Balancing
If you want to learn more about tron, read the tron wiki.