Engineering Blog

September 29th, 2014

Intern Project: Real-Time Log Tailing with Franz, our Kafka Wrapper

At Yelp, we often need to analyze data in real time. Right now most of our data is aggregated and logged using a system called Scribe. This means that real-time processes currently depend on tailing Scribe, reading messages as they are added to the end of the logs.

Simply tailing the logs means that any time a tailer is not running, due to maintenance or an unexpected error, it misses all the information logged during that time. We want a more flexible system that allows processes to pick up where they left off, so our analyses account for all of the data available.

In comes Kafka, a different kind of logging system. One of the big differences from Scribe is that Kafka provides you with the ability to start tailing from any point in the log, allowing past messages to be reprocessed if necessary. We’re already starting to use Kafka for some applications at Yelp, and all the data being written to Scribe is also available from Kafka.

One thing has been stopping us from switching to Kafka for real-time processing: the lack of a simple, efficient way to tail Kafka. What we do have is an internal service with an HTTP API for interacting with our Kafka clusters, named Franz. As an HTTP service, though, it can only respond to requests, not initiate communication. That means a client tailing Kafka with this API must poll continuously to see if new messages have been written. In addition, Franz’s current API requires clients to keep track of low-level, Kafka-specific details about their position in the logs, further inhibiting adoption.

This summer, I worked on adding a high-level WebSocket endpoint to Franz, designed to improve the Kafka tailing experience. WebSocket is a relatively new internet protocol that lets clients and servers establish long-lived, two-way connections with each other. With such an endpoint, a client could simply start one WebSocket connection with Franz to start tailing Kafka. As messages became available, Franz could then use the connection to forward the messages, without any further action from the client. This endpoint also manages the position of clients, so that reconnecting clients automatically start at the position they left off at.

Because this is the first time we’re using WebSocket at Yelp, I had a lot of freedom in the implementation. The existing parts of Franz were implemented in Java with Dropwizard, but Dropwizard is only designed for regular HTTP endpoints. Ultimately, I decided to use Atmosphere, a Java framework that supports WebSocket, and added an Atmosphere servlet to the Dropwizard environment.

Using the endpoint is fairly straightforward: you can use a Python client like ws4py, tornado.websocket, or even a Chrome extension to establish a connection to the WebSocket endpoint. Once you’re connected, you send a topic message

{
    "topic": "service_log",
    "batch_size": 20
}

where the batch_size specifies how many Kafka messages you want, max, per message from Franz. Franz will then start streaming messages back to you, max batch_size at a time. A response from Franz is pretty simple, it’s just an object containing an array of messages with some metadata on them:

{
    "messages": [
        { "topic": "<topic_name>", "partition_id": <partition_id>, "offset": <offset>, "message": <json_dict>},
        ...
    ]
}

We’re currently working on wrapping up the project and many people and teams at Yelp are excited to use it. The first user will likely be the real-time ad metrics display. Since most Yelp applications are written in Python, there is also a Python client in progress to facilitate service use of the new interface. I’m looking forward to my project being used across the organization!

September 18th, 2014

Hackathon 14: Puzzles, Pizza and Projects Galore!

Three times a year the entire engineering team at Yelp gets together and does innovative (sometimes crazy) things like launching a 3D printer into space or flying a quadcopter with human wings…err…arms or figuring out whether Cronuts are more popular than Donuts. We call this tri-annual event…Hackathon! It’s a festival celebrating innovation, creativity, and technical badassery where our smart, talented and witty engineers get 48 hours to work on anything they like. Needless to say, a relentless supply of delicious food also plays a key role in this event.

Our hackers showing off their projects in a science fair style exhibition

Our hackers showing off their projects in a science fair style exhibition

The 14th version of our Hackathon, which was held this past month, saw around 80 projects across all of our engineering offices, covering a wide variety of topics ranging from mining our rich dataset to developing visualization tools to building robots.

Sometimes our engineers de-stress by attempting to put together ridiculously hard monochromatic jigsaw puzzles custom created with an inside joke

Sometimes our engineers de-stress by attempting to put together ridiculously hard monochromatic jigsaw puzzles custom created with an inside joke

Shahid C., one of our intern extraordinaires this summer, worked on a project that he calls “Yelp Boost,” a nifty visualization tool that tries to address the age-old economics question of supply and demand. Shahid echoes what sounds like a fundamental tenet of Yelponomics 101, “If we can figure out where the demand for a product greatly outweighs supply, we could recommend business owners to set up their shops in those locations in order to meet this demand and boost their sales!” To determine these supply and demand logistics, Shahid dug deep into our search logs and came up with real time visualizations that look like this:

image02

The heat map (the light blue to intense red) represents an increasing demand for pizza, while the red dots with green halos represent pizzerias in San Francisco. You see those big red blobs with dropped pins inside them? There is a high demand for pizza there, but unfortunately, there aren’t many pizzerias nearby. Hmm…wonder what could be done about that.

Did I mention that we also built robots during Hackathon 14? Apart from the physical ones that could roam around and shoot nerf darts at you, a team of engineers, Aditya M., Anthony M., Jon M. and Kurtis F., built a different kind of robot – a robot that tries to understand Yelp the same way traditional web crawlers do. It’s affectionately called BotBot, it’s a web crawler that shows our engineering team how crawlers like Googlebot, Yahoo Slurp, and Bingbot discover and index our content. The team created this useful simulation by using Scrapy to crawl the site, pull out links, and used selenium to process pages that had javascript content.

image03

Pretty cool, eh?

Have the creative engineering gears in your brain started turning? Check out our exciting product and engineering job openings at www.yelp.com/careers and apply today! Who knows, you may be showing off your killer idea at Yelp Hackathon 15.

September 10th, 2014

Yelp Sponsors Women Who Code (WWCode)

We’re happy to announce that we are one of the first official sponsors of Women Who Code! WWCode is an organization whose goal is to help women excel in technology careers.

WWCode and Yelp started working together three years ago when the meetup group was created. We’ve hosted many of their events ranging from Ruby workshops to discussion panels including CEOs and CTOs. Since their launch, WWCode has grown to 14,000 members across 14 countries worldwide. By sponsoring their new non-profit (as of this July!), we’re excited and happy to help them achieve their goals of expanding into 50 cities worldwide by 2015 with 1 million members by 2019.

As WWCode expands, they’re looking to reach out to top technical universities around the nation in order to introduce women to engineering at a younger age. We will be partnering with them at their first pilot university, Waterloo, this fall.

“Yelp has supported most of Women Who Code’s major events over the past three years,” said Alaina Percival, WWCode CEO. “Collaborating with Yelp will be key in reaching our goal of being in 50 cities by the end of the year.”

Want to get involved? You can help support WWCode by attending one of their upcoming events listed here:

Grace Hopper Practice Talks
September 23, 2014
Doors open: 6:30PM
Yelp HQ

Women Who Code Fundraiser – Applaud Her
October 23, 2014
Doors open: 6:00PM
Zendesk HQ

September 8th, 2014

Making Sense of CSP Reports @ Scale

CSP is Awesome

Content Security Policy isn’t new, but it is so powerful that it still feels like the new hotness. The ability to add a header to HTTP responses that tightens user-agent security rules and reports on violations is really powerful. Don’t want to load scripts from third party domain? Set a CSP and don’t.  Trouble with mixed content warnings on your HTTPS domain? Set a CSP and let it warn you when users are seeing mixed content. Realistically, adding new security controls to a website and a codebase as large as Yelp needs to be a gradual process. If we apply the new controls all at once, we’ll end up breaking our site in unexpected ways and that’s just not cool. Fortunately, CSP includes a reporting feature – a “lemme know what would happen, but don’t actually do it” mode. By using CSP reporting, Yelp is able to find and fix problems related to new CSP controls before they break our site.

Reading Sample CSP Report

CSP reports are JSON documents POSTed from a user’s browser to Yelp. An example report might look like:

{
 "csp-report": {
   "document_uri": "https://biz.yelp.com/foo",
   "blocked_uri": "http://www.cooladvertisement.bro/hmm?x=asdfnone",
   "referrer": "https://biz.yelp.com",
   "source_file": "https://biz.yelp.com/foo",
   "violated_directive": "script-src https:",
   "original_policy": "report-uri https://biz.yelp.com/csp_report; default-src https:; script-src https:; style-src https:"
 }
}

This report says, “I went to https://biz.yelp.com/foo but it loaded some stuff from cooladvertisement.bro over HTTP and I showed a mix content warning.” Looks like www.cooladvertisement.bro needs to get loaded over HTTPS and then all will be good.

Making Sense of CSP Reports @ Scale

It’s easy to read a single CSP report but what if you’re getting thousands of reports a minute? At that point you need to use some smart tools and work with the data to make sense of everything coming in. We wanted to reduce noise as much as possible so had to take a few steps to do that.

Get rid of malformed or malicious reports

Not all reports are created equally.  Some are missing required fields and some aren’t even JSON. If you have an endpoint on your website where users can POST arbitrary data, there will  be a lot of noise mixed with the signal.  The first thing we do is discard any reports that aren’t well formed JSON and don’t contain the necessary keys.

Massage the reports to make them easier to aggregate

It was helpful to group similar reports and apply the Pareto principle to guide our efforts at addressing CSP reports. We take any URI in the report and chop it down to it’s domain, getting rid of the uniqueness of nonces, query params, and unique IDs, making it easier to group

{
 "csp-report": {
   "document_uri": "https://biz.yelp.com/foo",
   "document_uri_domain": "biz.yelp.com"
   "blocked_uri": "http://www.cooladvertisement.bro/hmm?x=asdfnone",
   "blocked_uri_domain": "www.cooladvertisement.bro",
   ...
 }
}

Discard unhelpful reports

Surprisingly, you’ll see a whole lot of stuff that’s not really about your website when you start collecting reports. We found some good rules to discard the unhelpful data.

blocked_uri and source_file must start with http

We see loads of reports with browser specific URI schemes, stuff related to extensions or the inner workings of a browser like chromeinvoke:// or safari-extension://.  Since we can’t fix these, we ignore them. source_file is an optional field in a CSP report, so we apply this rule to source_file only when it has a value.

document_uri must match the subdomain the report was sent to

If we’re looking at CSP reports that were sent to biz.yelp.com then we’re only interested in reports about documents on biz.yelp.com.  All sorts of strange proxy services or client side ad injectors will land up serving a modified copy of your page and generate reports for things you can’t fix.

Retain some useful data

We don’t want to lose useful data that came in as part of the POST request, so we tack it onto the report. Info like user-agent can be super helpful in tracking down a “Oh… that’s only happening on the iPhone” issue.

{
 "csp-report": {
   "document_uri": "https://biz.yelp.com/foo",
   "document_uri_domain": "biz.yelp.com"
   ...
 },
 "request_metadata": {
    "server_time": 1408481038,
    "yelp_site": "biz",
    "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53",
    "remote_ip": "127.0.0.1"
  }
}

Throw this all in a JSON log

Once we’ve got a nice, well formed JSON report with some helpful extras, we throw it into a log. Logs aggregate from our various datacenters and make themselves available as a stream for analysis tools.

Visualize, monitor, and alert for the win

The Yelp security team is a huge fan of Elasticsearch/Logstash/Kibana.  Like we do with pretty much any log, we throw these CSP reports into our ELK cluster and visualize the results.

CSP Kibana Dashboard

This image shows a rapid decrease in incoming CSP report volume after fixing a page that caused mixed content warnings

From there it’s easy for our engineers to view trends, drill into specific reports, and make sense of reports at scale. We’re also adding monitoring and alerting to the reports in our Elasticsearch cluster so it can let us know if report volumes rise or new issues crop up.

Give it a try

We’re making sense of CSP reports at scale and that’s super useful in monitoring and increasing web application security. We’d love to hear from you about how you’re using CSP. Let us know at opensource@yelp.com.

September 4th, 2014

Maker’s Day – Or How Thursdays Became Every Yelp Engineer’s Dream

It’s been a busy past few years here at Yelp Engineering. With our 10th anniversary this year and our recent launch in Chile, we think it’s safe to say we’re on to something. But it would be foolish of us to stop here. At the end of the day, we’re engineers: we live for the fact that there are still so many challenging problems to solve, features to improve, and datasets to explore. With the size of the projects we’re tackling nowadays, our Engineering and Product Management teams need to be in constant contact to coordinate development, testing, and release. We fully embrace the rapid iterative process customary of Agile, Scrum, and XP programming, so you’ll often see a product manager and engineer hashing an idea out at one of our large built-in whiteboards or in the team’s pod. Soon enough, though, we found ourselves with schedules like this:

makers-day-calendar

Coordination is incredibly important, but we also need time to actually build all those cool features we come up with. That’s why, about a year ago, we introduced Maker’s Day here at Yelp.

So what is Maker’s Day? The concept is pretty simple: meetings, interviews, and general interruptions aren’t allowed for engineers on Thursdays. Some teams even cancel standups on those days while others use them as a quick way to unblock folks so that there are fewer disruptions later on. If any questions come up, we use email instead of showing up at a person’s desk or pinging them over IM. Outside of those general guidelines, how engineers use Maker’s Day is really up to them: some make it into a long, uninterrupted coding period, others prefer it for reviewing designs and diving deep into a topic. And by the way, for those engineering managers out there, this applies to us, too.

We’re certainly not the first to come up with this idea. Back in 2009, Paul Graham, in his “Maker’s Schedule, Manager’s Schedule” post, wrote how the partners at YC Combinator were implementing the idea: “You can’t write or program well in units of an hour. That’s barely enough time to get started.” Craig Kerstiens of Heroku mentioned, as part of his How Heroku Works series, how the value of Maker’s Day had increased exponentially as the company had grown. Intel even jumped into the discussion with hard facts from their “Quiet Time” pilot. Closer to the Python Community, Daniel Greenfeld tweeted what everyone was thinking back in 2012:

So how has Maker’s Day done here at Yelp? We don’t have spreadsheets of numbers to prove its success. However, on Thursdays, you’ll notice the engineering floors are a tad quieter, and folks are eager to get to their desks and jump into whatever task they’ve lined up for that day. That’s enough for us to stick with it.

In the end, Maker’s Day was a good step, but we don’t think it’s the be-all end-all solution. Similar to our software development strategy, we’re also constantly iterating with our processes within Engineering. If you love thinking about these kinds of problems, we’re always looking for great Engineering Managers to help grow our talented team of engineers.