Distributed tracing at Yelp
Prateek A., Software Engineer
- Apr 1, 2016
Yelp is powered by more than 250 services, from user authentication
to ad delivery. As the ecosystem of services has grown since 2011, performance
introspection has become critical. The tool we chose to gain visibility was
Zipkin, an open source distributed tracing framework. Since most of our
services are built with Pyramid, we built instrumentation for
pyramid_zipkin and a Swagger client decorator, swagger_zipkin.
Below, we walk you through how we use these powerful tools at Yelp and how you
can leverage them for your organization.
Why Distributed Tracing is Critical in an SOA Environment
If you get an alert after a deployment that the average timing for a particular endpoint has increased by 2-3x, how would you start debugging this? For us, we rely on distributed tracing. Distributed tracing is a process to gather timing information from all the various services that get invoked when an endpoint is called. In the diagram below, we can see that out of a total ~1.8s, 0.65s were consumed by one service, allowing us to investigate further. As more services get called, this becomes immensely helpful in debugging performance regressions.
Another benefit of distributed tracing is that it helps service owners know about their consumers. Using aggregation and dependency graph tools, the service owner can check which upstream services rely on them. When a particular service wants to deprecate their API, they can notify all their consumers about it. With tracing, they can track when all traffic to the deprecated API stops, knowing it’s safe to remove the API.
To understand distributed tracing we first need to cover some terminology. There are two key terms that need to be explained:
span - is information about a single client/server request and response
trace - is a tree of spans which represent all service calls triggered during the lifecycle of a request
As shown in the diagram above, a span lifetime starts with the request invocation from the client (Service A) and ends when the client receives back the response from the server (Service B).
To create a full featured trace, the system needs to ingest proper span information from each service. The four necessary points to build a span, as seen in the diagram above, (in chronological order) are:
- Client Send (CS): timestamp when upstream service initiated the request.
- Server Receive (SR): timestamp when downstream service receives the request.
- Server Send (SS): timestamp when downstream service sends back the response.
- Client Receive (CR): timestamp when upstream service receives back the response.
CS and CR as client side span information and SR and SS as server side span information.
Zipkin provides a nice UI to visualize the traces and have them sorted by conditions like longest query, etc. At Yelp, we use Kafka as the primary transport for Zipkin spans and Cassandra as its datastore.
Zipkin consists of three basic parts:
- collector: reads from a transport (like Kafka) and inserts the traces to a datastore (like Cassandra)
- query: reads the traces from the datastore and provides an API to return the data in JSON format
- web UI: exposes a nice UI for the user to query from. It makes API calls to the query and renders the traces accordingly
These components are explained in more detail in the Zipkin docs.
One critical component to the system is the tracer. A tracer gets hooked into a service (which is to be traced) and its responsibility is to send the service trace information to the Zipkin collector while making sure it does not interfere with the service logic and does not degrade service performance.
Openzipkin/zipkin provides tracer support for various languages like Scala, Java and Ruby, but currently not Python, which we use extensively. There is a third party tracer available for Django but we wanted a solution for Pyramid so we implemented a Pyramid Python tracer which integrates with our services and sends data to Zipkin. The tracer is called pyramid_zipkin.
Server side span: Introducing pyramid_zipkin
Most of our services run on Pyramid with uWSGI on top for load balancing and monitoring. Pyramid has a feature called tweens, which are similar to decorators and act like a hook for request and response processing. pyramid_zipkin is a tween which hooks into the service and records SR and SS timestamps to create a span. A span will be sent to the collector based on two conditions:
- If the upstream request header contains
X-B3-Sampled, it is accepted. This is only sent if the value is “1”, otherwise it’s discarded.
- If no such header is present,
pyramid_zipkinassumes the current service to be the root of the call stack. It rolls a die to decide whether to record the span or not. The probability is based on a configuration option tracing_percent which can be different for each service.
pyramid_zipkin provides other features like recording custom annotations and
binary annotations. Complete documentation is available here.
Connecting spans: Introducing swagger_zipkin
Before getting into how we record client side spans, let’s explain how we connect
Zipkin spans. This is done by sending
trace_id information in each
request’s header. We do this with the help of our open source package swagger_zipkin.
Most of our services have a corresponding Swagger schema attached to
them. Services talk to each other using our in-house Swagger clients, swaggerpy
(supports schema v1.2) and bravado (supports schema v2.0).
a decorator to wrap these clients which enables sending zipkin-specific headers
with requests. Behind the scenes it calls
pyramid_zipkin’s API to get the necessary
header information. More details are available in the documentation.
A sample service call looks like:
Client side span: Leveraging Smartstack
It might seem obvious that the client spans (
CR) can easily be recorded
swagger_zipkin since it is already used to decorate the client calls. However,
using swagger_zipkin can lead to incorrect numbers when the request is made
asynchronously, where the response is stored as a non-blocking future. The client
may ask for the result from this future at any point which could be five or even
ten seconds after the response was actually returned back by the server. The
decorator will include this waiting time in the
CR resulting in misleading numbers.
We need a way that’s not dependant on whether the call was made by a synchronous
or an asynchronous client. We leverage our SmartStack setup to get
CR numbers. All of Yelp’s services are on SmartStack
which means an upstream service connects to localhost with the port number assigned
to the downstream service. A local HAProxy listens on this port, which
then forwards the request to the actual machine hosting the service. Conveniently,
HAProxy can be configured to log requests and include relevant headers, timestamps,
and durations. Thus by parsing
HAProxy logs, we find
CS (origin timestamp) and
(origin timestamp + time taken for response) and send client span directly to the
Zipkin components: Powered by PaaSTA
As discussed earlier, to complete the infrastructure we need to deploy the three
Zipkin components (
web). These are deployed as PaaSTA
services (more). This part was relatively simple as
already provides Dockerfiles for each of these components.
PaaSTA can read these Dockerfiles and launch the containers.
Wrapping things up
This blog post gave an overview about Zipkin as a distributed tracing tool and how Python services can be integrated with this system. Zipkin is still under heavy active development thanks to its healthy open source community. We are trying to improve its Python support to be more powerful and friendly to work with. We built pyramid_zipkin and swagger_zipkin for this purpose. Go check them out!
Want to help build such distributed systems tools?
Like building this sort of thing? At Yelp we love building systems we can be proud of, and we are proud of our robust infrastructure. Check out the Software Engineer - Infrastructure positions on our careers page if you like building systems you can be proud of too!View Job