At Yelp, we’re very big fans of Amazon’s RedShift data warehouse. We have multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe and many other teams.
While it minimizes a lot of the work the RedShift team has done to call RedShift a simple fork of Postgres 8.4, RedShift does share a common code ancestry with PG 8.4. This means that much of the advanced query functionality of Postgres is available, which, when combined with the petabyte scale of RedShift, offers some amazingly powerful analytics tools.
Because most of the PG 8.4 query syntax is available, I often find that directly referencing the Postgres 8.4 documentation for query syntax is more readable and useful than trying to navigate Amazon’s version of the same documentation.
CTEs (Common Table Expressions): Scaling Your Queries
When dealing with OLAP (online analytical processing, or warehousing) queries, especially with more snowflake schemas, it’s very common for the number of joins in a query to get large. Snowflake schemas are those where dimension tables are designed to be joined to other dimension tables, which is typical when portions of a transaction schema are mirrored into the data warehouse. RedShift (and Postgres) are well optimized for large numbers of joins, but unfortunately our brains are not.
Correctness of analytics queries is paramount; basing your business decisions on faulty data can be an extremely costly mistake. One of the reasons SQL has gotten a bad reputation for doing analytics work is complexity; a traditional procedural language has functions and procedures that let you encapsulate and compose blocks of logic, while SQL does not.
CTEs (Common Table Expressions) bring this same power of encapsulation and composability to SQL; by allowing you to compose multiple independent queries into a single statement, you can write SQL that is much clearer and easier to verify and debug.
An Example CTE: Using The WITH Statement
One of the common things we have to do inside the SeatMe codebase is determine when a restaurant’s opening and closing times for various meals occur (internally referred to as scheduled shifts). It’s very common to compute things based on these scheduled times, such as how busy the restaurant is. A (much simplified) version of this query looks like:
The query itself, with its 2 joins, is understandable and independently verifiable. We then use this with a CTE in our analytics to compute things like reservations per shift.
Conceptually you’ve created a temporary table called scheduled_shifts with the results of the first query that you can join against in the second query.
One of the benefits of using CTEs when composing queries is that if they are getting re-used frequently, you can create a view over the same statement. If performance of the statement being used in the CTE is a concern and the data can be cached without hurting correctness, you can also trivially create a temporary table with the results of the CTE with only minimal change and very low risk to the overall query correctness.
It does bear saying: CTEs in both RedShift and Postgres represent an optimization barrier. When using a CTE the optimizer is unable to perform optimizations across the query in the body of the CTE and the main query, though it does optimize each of them individually. While this can be an issue, in the real world we’ve found the conceptual benefits greatly outweigh the performance drawbacks.
Window Functions or “Dang, I didn’t know SQL could do that.”
PostgreSQL Window Functions, which are available in RedShift, are extremely complex and difficult to explain. My goal here is to give a broad overview of the concepts and enough information to encourage people to try them out. Ultimately you’ll need to read and refer to the PostgreSQL documentation on Window Functions and Window Function Calls, along with the tutorial when using them in your own queries.
Window functions are a special class of analytic functions that are applied to windows of rows. Windows are defined by an OVER (...) clause which defines a set of rows related to the current row to which the function applies. While that’s extremely abstract, the diverse functionality available from the different window functions doesn’t really lend itself to a simpler definition.
The two main components of the window are:
PARTITION BY which is the logical analog of GROUP BY in a traditional query. Window functions that aggregate will do so across all rows in the partition.
ORDER BY which is required for certain functions that look forward or backward in the window.
The FRAME clause, which is harder to cover, and I’ll not go into in depth. The frame is another logical concept only used by functions that are relative to the frame (like first_value / last_value)
I think of window functions as falling into two categories:
Functions that are also available as traditional analytics functions, such as count, sum, avg, etc.
Functions that are only available when using windows, such as lead, lag, ntile, etc. These expose truly novel functionality unavailable without using windows.
For functions that are also available when using GROUP BY, the primary advantage of using them with window functions is it becomes possible to do multiple different grouping operations in a single query. When combined with the power of subqueries and CTEs, this can let you do very powerful business logic all in a single statement.
It would be natural to assume that doing multiple grouping operations in a single query would be just as costly in terms of execution time as doing multiple single operations. In practice, we haven’t seen this to be the case. There is of course a cost, but we typically see it be much smaller than a 100% overhead depending on the query and the grouping.
Window Function Examples: Using Multiple Traditional Aggregates
The following query illustrates the use of multiple count functions over different partitions to compute the percent of reservations that a given restaurant accounts for by locality (city). The things to note in this query are:
Use of DISTINCT: Since window functions append columns to each row, without a DISTINCT operator, the query will give you back 1 row for every row in the join. For this query, this would be 1 row per reservation rather than a row per restaurant and city.
The two count operations each have a different PARTITION BY clause, one counting by restaurant and the other counting by locality.
The final query, which references the two columns produced by the window function in a CTE and computes a percentage using them.
Window Function Examples: Using ntile To Compute Percentiles
Frequently, Yelp needs to look at distributions of user activity and compute percentile buckets based on their activity. The query below uses the ntile function to augment a per-user count of lifetime review behavior. Things to note about this query:
The ntile(100) is PARTITION BY signup_country, so it will compute the percentile per signup country.
The ntile(100) is ORDER BY review_count, which means the rows will be bucketed in order of the review_count.
Each row will get a number from 1-100, that is the logical bucket that the row falls into, added as a new column called percentile.
I’ve touched on two of the most powerful features for Redshift analytics, window functions and CTEs, but there’s a lot more functionality in Postgres, much of which is also in RedShift. One of my favorite Postgres sessions is Postgres: The Bits You Haven’t Found, which showed me a whole huge set of Postgres functionality, including first exposing me to window functions.
In addition, brushing up on your psql chops pays dividends over time as you start to become fluid with the advanced functionality in the Postgres CLI. If you write a lot of date based reports, which I suspect we all do, I would also recommend digging into the date/time functionality (in particular date_trunc). date_trunc makes doing date based roll ups extremely fast and easy, letting you quickly truncate dates to useful things to months, quarters, weeks, etc.
2014 was a busy year for us at Yelp Engineering and with 2015 having just kicked off, we wanted to highlight some of our most exciting events from last year.
Yelp loves open source. We love contributing back to the developer and open source communities and last year was no exception to that. We saw a large number of our engineers contributing to many different projects. In July, to help organize all of the projects we worked on, we launched a new open source site along with the revamped developer site.
Over the course of the year, we open sourced 16 new projects, and saw 3,000 commits across the 48 projects we have on GitHub. Some of the biggest projects we released, including MOE (source), dockersh (source), pre-commit (source), and pyleus (source), saw a lot of excitement from the developer community and we’re very happy to have open sourced them!
Yelp Engineers at Grace Hopper
We partner a lot with the developer community to give people and groups ample opportunities to to learn, interact, and share knowledge with each other. Throughout the year, our engineers attended 15 different conferences, like LAUNCH Scale and Grace Hopper. At LAUNCH Scale, VP of Engineering Michael Stoppelman shared a little bit about how we’ve scaled traffic at Yelp, discussing some of the changes we made to our infrastructure to handle the increased traffic over time. Our engineers returning from Grace Hopper were excited to share some of their experiences and feedback from the conference, hoping to help more women enter the tech industry.
One of the things we’re very proud of is our partnership with Women Who Code, which, in September, grew even stronger when we became one of the first official sponsors, helping them achieve their goal of connecting 1 million women in tech.
Three times a year, all the engineers at Yelp gather for two days of hacking, food, and fun. We put together some Yelpy puzzles, flew remote-controlled sharks, and designed our own henna tattoos. Last year, our engineers participated in threeseparatehackathons where we saw a total of 221 projects between the three.
On to 2015!
There’s already a lot of fun and exciting things planned for 2015 so if you want to be part of the fun follow us on Twitter and Facebook to catch all the announcements!
We use Macs a lot at Yelp, which means that we see our fair share of Mac-specific security alerts. Host based detectors will tell us about known malware infestations or weird new startup items. Network based detectors see potential C2 callouts or DNS requests to resolve suspicious domains. Sometimes our awesome employees just let us know, “I think I have like Stuxnet or conficker or something on my laptop.”
When alerts fire, our incident response team’s first goal is to “stop the bleeding” – to contain and then eradicate the threat. Next, we move to “root cause the alert” – figuring out exactly what happened and how we’ll prevent it in the future. One of our primary tools for root causing OS X alerts is OSXCollector.
OSXCollector is an open source forensic evidence collection and analysis toolkit for OS X. It was developed in-house at Yelp to automate the digital forensics and incident response (DFIR) our crack team of responders had been doing manually.
Performing Forensics Collection
The first step in DFIR is gathering information about what’s going on – forensic artifact collection if you like fancy terms. OSXCollector gathers information from plists, sqlite databases and the local filesystem then packages them in an easy to read and easier to parse JSON file.
osxcollector.py is a single Python file that runs without any dependencies on a standard OS X machine. This makes it really easy to run collection on any machine – no fussing with brew, pip, config files, or environment variables. Just copy the single file onto the machine and run it. sudo osxcollector.py is all it takes.
Details of Collection
The collector outputs a .tar.gz containing all the collected artifacts. The archive contains a JSON file with the majority of information. Additionally, a set of useful logs from the target system logs are included.
The collector gathers many different types of data including:
install history and file hashes for kernel extensions and installed applications
details on startup items including LaunchAgents, LaunchDaemons, ScriptingAdditions, and other login items
OS quarantine, the information OS X uses to show ‘Are you sure you wanna run this?’ when a user is trying to open a file downloaded from the internet
file hashes and source URL for downloaded files
a snapshot of browser history, cookies, extensions, and cached data for Chrome, Firefox, and Safari
user account details
email attachment hashes
The docs page on GitHub contains a more in depth description of collected data.
Performing Basic Forensic Analysis
Forensic analysis is a bit of an art and a bit of a science. Every analyst will see a bit of a different story when reading the output from OSXCollector – that’s part of what makes analysis fun.
Generally, collection is performed on a target machine because something is hinky: anti-virus found a file it doesn’t like, deep packet inspect observed a callout, endpoint monitoring noticed a new startup item, etc. The details of this initial alert – a file path, a timestamp, a hash, a domain, an IP, etc. – is enough to get going.
OSXCollector output is very easy to sort, filter, and search for manual forensic analysis. By mixing a bit of command-line-fu with some powerful tools like like grep and jq a lot of questions can be answered. Here’s just a few examples:
Get everything that happened around 11:35
Just the URLs from that time period
Just details on a single user
Performing Automated Analysis with OutputFilters
Output filters process and transform the output of OSXCollector. The goal of filters is to make it easy to analyze OSXCollector output. Each filter has a single purpose. They do one thing and they do it right.
For example, the FindDomainsFilter does just what it sounds like: it finds domain names within a JSON entry. The domains are added as a new key to the JSON entry. For example, given the input:
the FindDomainsFilter would add an osxcollector_domains key to the output:
This enhanced JSON entry can now be fed into additional OutputFilters that perform actions like matching domains against a blacklist or querying a passive DNS service for domain reputation information.
Finds domain names in OSXCollector output and adds an osxcollector_domains key to JSON entries.
Compares data against user defined blacklists and adds an osxcollector_blacklist key to matching JSON entries.
Analysts should create blacklists for domains, file hashes, file names, and any known hinky stuff.
Breaks an initial set of file paths into individual file and directory names and then greps for these terms. The RelatedFilesFilter is smart and ignores usernames and common terms like bin or Library.
This filter is great for figuring out how evil_invoice.pdf landed up on a machine. It’ll find browser history, quarantines, email messages, etc. related to a file.
ChromeHistoryFilter and FirefoxHistoryFilter
Builds a really nice browser history sorted in descending time order. The output is comparable to looking at the history tab in the browser but contains more info such as whether the URL was visited because of a direct user click or visited in a hidden iframe.
Threat API Filters
OSXCollector output typically has thousands of potential indicators of compromise like domains, urls, and file hashes. Most are benign; some indicate a serious threat. Sorting the wheat from the chaff is quite a challenge. Threat APIs like OpenDNS, VirusTotal, and ShadowServer use a mix confirmed intelligence information with heuristics to augment and classify indicators and help find the needle in the haystack.
Looks up an initial set of domains and IP with the OpenDNS Umbrella API and finds related domains. Threats often involve relatively unknown domains or IPs. However, the 2nd generation related domains, often relate back to known malicious sources.
OpenDNS & VirusTotal LookupDomainsFilter
Looks up domain reputation and threat information in VirusTotal and OpenDNS.
The filters uses a heuristic to determine what is suspicious. These can create false positives but usually a download from a domain marked as suspicious is a good lead.
ShadowServer & VirusTotal LookupHashesFilter
Looks up hashes with the VirusTotal and ShadowServer APIs. VirusTotal acts as a blacklist of known malicious hashes while ShadowServer acts as a whitelist of known good file hashes.
AnalyzeFilter – The One Filter to Rule Them All
AnalyzeFilter is Yelp’s one filter to rule them all. It chains all the previous filters into one monster analysis. The results, enhanced with blacklist info, threat APIs, related files and domains, and even pretty browser history is written to a new output file.
Then Very Readable Output Bot takes over and prints out an easy-to-digest, human-readable, nearly-English summary of what it found. It’s basically equivalent to running:
and then letting a wise-cracking analyst explain the results to you. The Very Readable Output Bot even suggests new values to add to your blacklists.
This thing is the real deal and our analysts don’t even look at OSXCollector output until after they’ve run the AnalyzeFilter.
Yelp has used MySQL since the beginning, and in more recent years has adopted the Percona fork. We like MySQL because it fits our data needs, is configurable, fast, and scalable. The Percona fork includes performance enhancements and additional features on top of stock MySQL, we have used their training services in the past, and are currently a support customer. MySQL also has a *huge* online community. I love that a quick search using your favorite search engine will quickly get you started down the right path to solving most problems!
A Little More Detail
We’re running roughly 100 database servers between our dev, staging, and production environments. Our database is split up into several functional shards. Different database servers house different data sets, with different tables and schemas, but all of the data for a single table resides within one shard. The InnoDB storage engine provides us with the ability to handle a fair number of concurrent transactions at a time on masters, and we rely upon it for our transactional consistency. To scale database reads, we use MySQL statement-based replication to propagate data from master DBs to local and remote replicas, which reside in 2+ read-only data centers/locations. Our replication hierarchy uses an intermediate master database in each datacenter to reduce replication bandwidth and to facility easy master promotions:
MySQL is also extremely portable, running on almost anything, and has a massive number of variables to tune your performance if you’re inclined. We’re a Python shop, so we use the MySQLdb connector, and SQLAlchemy.
Tools we use with MySQL
At Yelp, we have a saying: “it’s not in production until it’s monitored.” We monitor MySQL with Nagios and Sensu, and graph metrics using Ganglia, Graphite, and Grafana. We also use the Percona Toolkit‘s command-line utilities to analyze database query traffic, log deadlocks, collect data during times of database stress, verify the consistency of our data between replicas and MySQL versions, online schema change, and to kill poorly performing queries.
Yelp DBAs <3 Developers: schema changes
In order to keep Yelp fresh and deliver new features, we push new versions of our application multiple times a day. This means we also have to support frequent schema changes too! There are two rules for our schema changes to keep things moving smoothly:
They must be backwards-compatible with the previous version of the application
They must either be online operations, or we must be able to run them using pt-online-schema-change
What’s so important about backwards-compatible schema changes? While I love dropping columns and tables just as much as (if not more!) than the next person, part of pushing often is also being able to safely roll back. If we need to do something like drop a column, we will first deploy code that no longer uses it, take the extra step of renaming the column first, verify that things look good, and then drop the column, each over its own push. We also want to keep in mind that neither database changes nor rolling out a new version of our application happens in a single instant. We always make our database/schema changes before the code deployment part of a push, and there can be a period of time during deployment when two versions of our application are running.
Thus we make our database changes before the code is deployed, and we use pt-online-schema-change to prevent replication delay as the schema change is made. pt-online-schema-change alters a table online by:
Creating an empty copy of the table to alter
Running the alter statement against the new, empty copy
Copying the rows from the original table into the new table
Replacing the original table with the new one
Yelp DBAs <3 Developers: queries and ORMs
This wasn’t part of my original presentation but there was an interesting question asked after my talk: do we use an ORM (Object Relational Mapping) or do we write out all of our SQL as direct queries or stored procedures? As I mentioned above, we do use SQLALchemy and while we have some hand-written queries, we prefer to use the ORM. This debate exists because ORMs can write some, well, interesting (read: awful) queries that can really hurt database performance. However, they also greatly enhance developer productivity! We choose to favor getting new tables and the code that uses them developed and out in the wild quickly, and then iterate from there if we find that specific queries need to be optimized by hand.
We have a couple of practices and/or tools that assist us here:
Monitoring: we monitor our databases to alert us if application or database timings change
We have a “gross query checker,” which I’ve blogged about before: it alerts us to poorly performing queries before we leave the development environment
We run pt-kill: a tool from the Percona Toolkit that will kill long-running queries in production
In the slides, you can also find a couple of notes about backups, where to find additional resources when working with MySQL and some book titles I’ve read that I think are worth checking out!
The open, supportive community surrounding MySQL fits in well with Yelp Engineering’s culture of collaboration, playing well with others, and technical tenacity. On top of providing the databases, documentation, and tools, it’s important to make sure that our developers have an easy venue to discuss ideas and ask questions. Once a week, the DBA team at Yelp hosts DBA office hours – a time when we’re available to answer any and all database questions, talk about performance, swap ideas, or just generally chat.
At Yelp, we solve a lot of different information retrieval problems ranging from highlighting reviews for businesses to recommending nearby restaurants. We recently improved one system called business matching, moving from a naive ranking function to a machine learned approach using Learn to Rank.
The Business Matching Problem
Business matching is a system that accepts a description of a business (e.g. name, location, phone number) and returns the Yelp businesses that description best fits. We can use this in several different contexts. For example, in 2012 Yelp announced a partnership spearheaded by the City of San Francisco to allow municipalities to publish restaurant inspection information to Yelp. The data feed shared by the City of San Francisco contained restaurants with business pages on Yelp but we didn’t know the exact business IDs needed to correctly update their information.
Originally, we had a number of similar ad-hoc implementations across different engineering teams but, as we grew, concerns like codebase complexity, testability and maintenance costs, as well as matching quality started coming up. To solve these problems, we built a unified system to replace all the existing ad-hoc implementations. The system, similar to other information retrieval systems, takes a semi-structured description of a business (e.g. name, location, phone number) as input, searches through our database and returns a ranked list of Yelp businesses with scores that measure how good they match the description.
We use Elasticsearcha lot at Yelp. It should come as no surprise that here we use it as our core engine. Originally, the system architecture looked fairly simple: it normalized the input, built a big Elasticsearch query, searched the index, got the results, and did some filtering. In order to get the most relevant results, we needed to leverage information included in the input for each component. Components can be things like “name,” “location text,” “geo-distance” “phone number,” etc., which meant that each query we sent to Elasticsearch contained subqueries for each component. For example, we could build a name subquery that matches business names using TF-IDF scoring logic and a distance subquery that gives businesses higher scores if they are closer to the input location. The final query logic would linearly combine scores for each subquery or component and output a final score for each business candidate.
To measure the effectiveness of the system, we built an evaluation framework for it. A sample of some past business matching requests were pulled and we manually labeled the returned businesses as correct or incorrect matches. These results created the “gold dataset” and the system was then evaluated by running against this dataset and recording whether it returns the correct matching businesses for each request. Some standard information retrieval metrics like Precision/Recall/F1 are used to measure the matching quality.
Addressing The Downsides
The system worked fairly well but there’s always room for improvement. As you can imagine, there are many things we need to balance in the scoring logic here:
How much weight should we give for address text matching?
Is phone number a good indicator for matching?
Is linearly combining each component score a good solution?
Originally, we addressed these questions by doing ad hoc experiments manually. We would poke around with different ideas, changing values and seeing if the F score improved. If we found a change where the evaluations looked good, we would push this change. However, performing these manual ad-hoc experiments is expensive and time-consuming.
To make things even harder, different clients may want to use the business matching system to solve different problems. Each client might want a slightly different ranking logic. Combining this time consuming task with the expense of ad-hoc manual parameter tuning led to a large amount of human time being spent on tuning the system. We reached a point where we felt that improving ranking logic by running ad-hoc experiments manually was reaching a bottleneck. In order to optimize the system further we wanted to have some automated, algorithmic solution.
Learning to Rank
Learning to Rank is a method of automatically creating a ranking model by using machine learning. In our business matching problem, we wanted to build a machine learning model that would optimize our ranking algorithm in an automatic and systematic way.
How it works
Given the nature of our problem, we adopted the “Pointwise Approach,” which approximates matching businesses as a regression problem. More specifically, for each candidate business we want the machine learning model to predict a numeric score, which serves as the estimation for how good this business matches the input description. As for features, we included the scores for each component returned from Elasticsearch. Some additional features reflect the original ranking given by Elasticsearch are included as well. We generated the training data by running regular evaluation on the gold dataset and recording relevant information returned from Elasticsearch.
Putting all of this together, we now have our new architecture. We still construct the query to Elasticsearch and get results containing a list of candidate businesses with scores for each component/subquery. But instead of linearly combining them, we throw them into the trained rescoring model and, finally, a ranked list of rescored business candidates is returned. Essentially, for this system we’re using Elasticsearch (which is really good at getting a pool of potential relevant candidates) as a “recall engine”. We extract the core ranking logic out of Elasticsearch and use the more powerful machine learning algorithms to achieve better matching quality.
Instead of searching for a set of optimal parameters of a linear ranking function manually, the learning algorithm makes this optimization process flexible. Now the ranking function can be either linear or non-linear by applying different learning algorithms with the parameters of the ranking function being learned in an automatic and systematic way. Moreover, since we are frequently updating the database, the TF/IDF and other scoring factors change over time. The new machine learning ranking model provides certain stability on top of Elasticsearch. Our evaluation results showed that our new learning to rank approach boosted F1 score from 91% to 95%.
With these improvements, we can treat our business matching system as a general business retrieval system framework that can be configured for new problems or clients, solving a much broader set of problems. For example, we are currently working on improving our business deduplication system, which discovers and automatically merges duplicate businesses in our data. To use business matching system we just need to retrain the machine learning model with a modified set of features on a different training set.
By using machine learning to approach our business matching problem, our retrieval system’s matching quality significantly improved and also became more flexible, more stable and more powerful.