Engineering Blog

Introducing LogFeeder - A log collection system

Introduction The collection and processing of logs is essential to good security. One of the primary functions of a security team is to keep organizations safe by eliminating blind spots in infrastructure. Breach investigations without logs result in a lot of guesswork. Worse, the activities of an attacker can easily remain undiscovered without adequate logging. To ensure we have a robust log storage and visualization platform, we use Elasticsearch, Logstash and Kibana (ELK). These tools form part of the toolset that we use in our Security Incident and Event Monitoring (SIEM) solution. ElastAlert is the primary means by which alerts...

Continue reading

Celebrating the Women of Yelp: AWE the Book

As a recruiter, I talk to a lot of people about what it’s like to work at Yelp. Most often, I find myself answering questions about the work environment and individual growth opportunities. During my four and a half years at Yelp, I would summarize the people here as very sharp and intelligent, while also humble and open minded. This spirit has fostered an environment that encourages individuals to learn by trying things for themselves (new hires get to push code out their first week!) and empowers them to ask questions. This collaborative work culture invites tremendous opportunity and gives...

Continue reading

Making 30x performance improvements on Yelp’s MySQLStreamer

Introduction MySQLStreamer is an important application in Yelp’s Data Pipeline infrastructure. It’s responsible for streaming high-volume, business-critical data from our MySQL clusters into our Kafka-powered Data Pipeline. When we rolled out the first test version of MySQLStreamer, the system operated at under 100 messages/sec. But for it to keep up with our production traffic, the system needed to process upwards of thousands of messages/sec (MySQL databases at Yelp on an average receive over hundreds of millions of data manipulation requests per day, and tens of thousands of queries per second). In order to make that happen, we used a variety...

Continue reading

Yelp Dataset Challenge Round 11 Announcement and Kaggle Weekly Kernel Award

Yelp Dataset Challenge Round 11 Is On! The eleventh round of the Yelp Dataset Challenge has opened. It will run until June 30, 2018. As in the past, the Yelp Dataset Challenge gives college students access to reviews and businesses from 11 metropolitan areas scattered over 4 different countries. This time around, there are a staggering 5.26 million reviews written by 1.3 million users about 175,000 businesses, as well as 146,350 check-ins and 1.1 million tips left by these users. Moreover, we have added photos about these businesses in a separate file, for convenience. With such a trove of data,...

Continue reading

Scaling Gradient Boosted Trees for CTR Prediction - Part II

Growing Cache Friendly Trees In case you missed it, Part I of this blog series outlines how we built a distributed machine learning platform to train gradient boosted tree (GBT) models on large datasets. While we were able to observe significant improvements in offline metrics, the resulting models are too large for standard XGBoost prediction libraries to meet our latency requirements. As result, we were unable to launch the models in production as we needed to serve ads within 50ms (p50) and evaluating these large models caused time-out exceptions. This article will discuss how we compressed and reordered the trees...

Continue reading

Scaling Gradient Boosted Trees for CTR Prediction - Part I

Building a Distributed Machine Learning Pipeline As a part of Yelp’s mission to connect people with great local businesses, we help businesses reach potential customers through advertising and organic search results. The goal of Yelp’s advertising platform is to show relevant businesses to users without compromising their experience with the product. In order to do so, we’ve built a click-through-rate (CTR) prediction model that determines whether or not to serve ads from a particular advertiser. The predicted CTR determines how relevant the business is to the user’s intention and how much would need to be bid to beat a competitor...

Continue reading

Yelp Dataset Challenge Round 9 Winner

Yelp Dataset Challenge Round 9 Winners The ninth round of the Yelp Dataset Challenge ran throughout the first half of 2017 and, as usual, we received a large number of highly impressive and interesting submissions. Needless to say, we were struck by the quality of the entries: keep up the good work! Today, we are proud to announce the grand prize winner of the $5,000 award: “CORALS: Who are My Potential New Customers? Tapping into the Wisdom of Customers’ Decisions” by Ruirui Li, Chelsea J-T Ju, Jyunyu Jiang, and Wei Wang (from the Department of Computer Science of the University...

Continue reading

Keeping Yelp two steps ahead: How we built GSET to protect employee email

Earlier this year, Gmail users across the globe were affected by one of the largest phishing attacks of its kind. Yelp emails were among the many corporate email systems that experienced this Google Docs phishing attack. Fortunately, our security engineers had already prepared for this level of security threat and were able to delete the suspicious emails before impacting employees. As phishing attacks have become more and more prevalent, the need for new tools and countermeasures to protect users has become more important than ever. According to the last IBM X-Force Threat Intelligence Index report, the amount of spam email...

Continue reading