Engineering Blog

AMIRA: Automated Malware Incident Response and Analysis

Brave malware analysts at Yelp have spent a lot of time looking at the digital forensics from potentially infected macOS systems, gathered using our open source project, OSXCollector. Early on, we automated parts of the analysis process, augmenting the initial set of digital forensics collected from the machines with the information gathered from the threat intelligence APIs and internal blacklists. This involved identifying potentially suspicious domains, URLs and file hashes but our approach to the analysis still required a certain degree of configuration and manual maintenance which was tedious for the malware response team. In this blog post I will...

Continue reading

More Than Just a Schema Store

This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into data stores like Redshift, Salesforce, and Elasticsearch. Read the posts in the series: Billions of Messages a Day - Yelp's Real-time Data Pipeline Streaming MySQL tables in real-time to Kafka More Than Just a Schema Store PaaStorm: A Streaming Processor Data Pipeline: Salesforce Connector Streaming Messages from Kafka into Redshift in near...

Continue reading

How We Scaled Our Ad Analytics with Apache Cassandra

On the Ad Backend team, we recently moved our ad analytics data from MySQL to Apache Cassandra. Here’s why we thought Cassandra was a good fit for our application, and some lessons we learned that you might find useful if you’re thinking about using Cassandra! Why Cassandra? First, a little bit about our application. We have over 100,000 paying advertisers. Every day, we calculate the numbers of views and clicks each ad campaign received the previous day and the amount of money spent by each campaign. With these analytics, we generate bills and many different types of reports. Back in...

Continue reading

Yelp Dataset Challenge Round 6 Winner

Yelp Dataset Challenge Round 6 Winners The sixth round of the Yelp Dataset Challenge ran throughout the second half of 2015 and we were really impressed with the projects and ideas that came out of the challenge. Today, we are proud to announce the grand prize winner of the $5,000 award: “Topic Regularized Matrix Factorization for Review Based Rating Prediction” by Jiachen Li, Yan Wang, Xiangyu Sun, Chengliang Lian, and Ming Yao (from the Language Technologies Institute, School of Computer Science, at Carnegie Mellon University). The authors created a recommender system to inform Yelpers about which business they might be...

Continue reading

Streaming MySQL tables in real-time to Kafka

This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into data stores like Redshift, Salesforce, and Elasticsearch. Read the posts in the series: Billions of Messages a Day - Yelp's Real-time Data Pipeline Streaming MySQL tables in real-time to Kafka More Than Just a Schema Store PaaStorm: A Streaming Processor Data Pipeline: Salesforce Connector Streaming Messages from Kafka into Redshift in near...

Continue reading

Yelp API v3 Developer Preview

For the past few months we’ve been working on revamping our API based off your feedback of wanting more Yelp data and functionality. Today, we’re excited to announce that the newest version of our API is entering developer preview. What’s new? We’re exposing two new features as part of the developer preview: autocomplete and transaction search. As a user performs a search, autocomplete will help them find what they want (some might even say we have the ability to read their minds). With autocomplete, a user’s search experience will feel much more intuitive. The API now exposes a search endpoint...

Continue reading

Billions of Messages a Day - Yelp's Real-time Data Pipeline

This post is part of a series covering Yelp's real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL and Cassandra data at real-time, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into data stores like Redshift, Salesforce, and Elasticsearch. Read the posts in the series: Billions of Messages a Day - Yelp's Real-time Data Pipeline Streaming MySQL tables in real-time to Kafka More Than Just a Schema Store PaaStorm: A Streaming Processor Data Pipeline: Salesforce Connector Streaming Messages from Kafka into Redshift in near...

Continue reading

Yelp Hackathon 19: Color Code

One of the values that we cherish at Yelp is to “Be Unboring”. It’s the quality of never accepting “standard” as okay and the guiding principle for creating new and remarkable things. One of the ways we foster that creative spirit at Yelp Engineering is through our internal hackathons. Many remarkable projects have come out of them - some of them open sourced, some of them revealing interesting demographic insights and some of them pushing the boundaries of science & technology. The 19th edition of our internal hackathon wasn’t any different. Close to 80 fantastic projects across our engineering offices...

Continue reading