One of the most engaging features of Yelp is our photos and videos gallery. When you visit a Yelp Business Page inside of the mobile app, there is a photo at the top of the page to provide visual context. It also serves as a compelling entry point to our photo viewer if you pull it down. We wanted to have this same effect on our mobile site, so we set out to develop a nice, smooth animation to pull down this photo and delight mobile web users with the same experience they’re used to on our mobile applications.
I was tasked with implementing this animation as part of my internship. Having little prior experience, all I knew was that when the user touches to pull down on the photo, its CSS properties should update over time to generate what is pictured above.
To feel smooth, this animation needs to run at 60 frames per second (fps). This sets our frame budget to 16ms. 16ms to perform all animation-related work necessary to render each frame: sounds like a challenge!
Scoping Out Animation
The first step was to place the background image behind the top of the page so that we can expand and scale it in the future.
After completing this, I started to manually test out how the image would expand and thought about which CSS properties were needed to accomplish the animation effect.
When the photo is being pulled down, several CSS properties should be animated over time:
margin-top, to control the photo’s top offset
opacity, to fade-in the photo as the user pulls down and to fade-out the rest of the business details page as the user pulls down
height/width, to scale the photo up as the user pulls down
Using Chrome Developer Tools I manually fiddled with the CSS properties of the respective DOM Elements to replicate the desired animation and after some experimentation, things looked okay. Still left to do: animate those CSS properties based on a user’s touch movements.
Handling Touch Movements
When researching about touch events, I learned that, on mobile devices, there are three main events triggered when a user touches a screen: touchstart, touchmove, and touchend.
Hence the following action plan to handle touch movements:
On touchstart: keep track of the initial y coordinate (let’s call it initialY) and store it for future comparisons.
On touchmove: get the current vertical coordinate (currentY) and compare:
if currentY > initialY, the user is pulling down on the photo. In this case, (currentY – initialY) represents how much a user has pulled down thus far, and can serve as a basis for CSS properties updates (more on that later.)
if currentY < initialY, user is trying to scroll normally
On touchend: redirect the user to our photo viewer or animate the page back to its original state
After coding this, I started to test how well touch events integrate with the CSS animations.
Initial Test Results
As seen below, we were blowing through our 16ms frame budget, resulting in a significantly laggy and choppy animation.
This is a screenshot from Chrome’s profiling tool in frames mode while the pull-down animation is running. Each vertical bar represents a frame. Its height indicates the time it took to compute it. Its coloring indicate the type of work done by the browser to compute it. Read more here.
Two things are causing this:
margin-top, height, and width are poor CSS properties to animate. Since updates can’t be offloaded to the GPU, animating on any of these properties takes a heavy toll on the browser, especially on mobile.
Each touchmove event triggers the rendering of a new frame. This is too much rendering work for the renderer, which explain the frames dropped and the choppy animation.
As Jon Raasch explains in a post on HTML5 Hub: “The renderer tends to choke on the large number of rendering tasks, and often isn’t able to render a certain frame before it has already received instructions to render the next. So, even though the browser renders as many frames of the animation as possible, the dropped frames still make for a choppy-looking animation, which is not to mention the performance implications of overloading the processor with as many tasks as possible.”
transform: translateY(), to control the photo’s top offset
opacity, to fade-in the photo as the user pulls down
opacity, to fade-out the rest of the business details page as the user pulls down
transform: scale(), to scale the photo up as the user pulls down
In addition to using more efficient CSS properties, I promoted each DOM element taking part in this animation to its own layer by styling them with transform:translateZ(0). This is essential because it offloads rendering work to the GPU and prevents layout thrashing (since the animated elements are on their own layers, the non-animated elements don’t need to be re-laid-out/re-painted).
To prevent frames from getting dropped due to too many rendering requests, I used requestAnimationFrame. requestAnimationFrame takes a callback that executes when the browser pushes a new frame to the screen. Essentially, the browser pulls for work at each frame, instead of us pushing work for each new touch event. This allows for concurrent animation to fit into one reflow/repaint cycle. As a result, it makes animations look much smoother because the frame rate is consistent.
Problems solved but could implementation be better?
I had the essentials for a neat photo pull down animation. However, the animation was composed of several independent animations on different DOM elements. Manually computing CSS properties’ values at each frame was unnecessarily complex. I needed a more standard & organized solution to create and run DOM animations.
GitHub and Shifty to the Rescue!
Using Shifty would provide us with:
the calculation of CSS properties at a certain point in an animation, given a start/end value and a desired duration
the ability to easily seek to a certain point in an animation
However, the things Shifty wouldn’t provide us were:
the ability to directly apply the calculated CSS properties to specific DOM elements
the ability to orchestrate multiple animations simultaneously
How can we build on top of Shifty to help us with our use cases?
The first class is called DomTweenable. It’s the same as a Tweenable object from Shifty except that you can attach additional DOM element to the Tween. Moreover, when you seek to a specific part of the DomTweenable’s tween, the CSS properties are automatically applied to the DOM element.
The second class is called Timeline. Same as a Tweenable object from Shifty except that you can attach multiple DomTweenable objects at specific point in the Timeline’s tween. When you seek to a specific part of the Timeline’s tween, it seeks on each of the DomTweenable objects at (specified position – starting position on timeline.)
We now have an easy way to animate on multiple DOM Elements and create smooth animations! And hey, look! Under 60 frames per second:
Thanks to Simon Boudrias and Arnaud Brousseau for the help!
At Yelp, we’re very big fans of Amazon’s RedShift data warehouse. We have multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe and many other teams.
While it minimizes a lot of the work the RedShift team has done to call RedShift a simple fork of Postgres 8.4, RedShift does share a common code ancestry with PG 8.4. This means that much of the advanced query functionality of Postgres is available, which, when combined with the petabyte scale of RedShift, offers some amazingly powerful analytics tools.
Because most of the PG 8.4 query syntax is available, I often find that directly referencing the Postgres 8.4 documentation for query syntax is more readable and useful than trying to navigate Amazon’s version of the same documentation.
CTEs (Common Table Expressions): Scaling Your Queries
When dealing with OLAP (online analytical processing, or warehousing) queries, especially with more snowflake schemas, it’s very common for the number of joins in a query to get large. Snowflake schemas are those where dimension tables are designed to be joined to other dimension tables, which is typical when portions of a transaction schema are mirrored into the data warehouse. RedShift (and Postgres) are well optimized for large numbers of joins, but unfortunately our brains are not.
Correctness of analytics queries is paramount; basing your business decisions on faulty data can be an extremely costly mistake. One of the reasons SQL has gotten a bad reputation for doing analytics work is complexity; a traditional procedural language has functions and procedures that let you encapsulate and compose blocks of logic, while SQL does not.
CTEs (Common Table Expressions) bring this same power of encapsulation and composability to SQL; by allowing you to compose multiple independent queries into a single statement, you can write SQL that is much clearer and easier to verify and debug.
An Example CTE: Using The WITH Statement
One of the common things we have to do inside the SeatMe codebase is determine when a restaurant’s opening and closing times for various meals occur (internally referred to as scheduled shifts). It’s very common to compute things based on these scheduled times, such as how busy the restaurant is. A (much simplified) version of this query looks like:
The query itself, with its 2 joins, is understandable and independently verifiable. We then use this with a CTE in our analytics to compute things like reservations per shift.
Conceptually you’ve created a temporary table called scheduled_shifts with the results of the first query that you can join against in the second query.
One of the benefits of using CTEs when composing queries is that if they are getting re-used frequently, you can create a view over the same statement. If performance of the statement being used in the CTE is a concern and the data can be cached without hurting correctness, you can also trivially create a temporary table with the results of the CTE with only minimal change and very low risk to the overall query correctness.
It does bear saying: CTEs in both RedShift and Postgres represent an optimization barrier. When using a CTE the optimizer is unable to perform optimizations across the query in the body of the CTE and the main query, though it does optimize each of them individually. While this can be an issue, in the real world we’ve found the conceptual benefits greatly outweigh the performance drawbacks.
Window Functions or “Dang, I didn’t know SQL could do that.”
PostgreSQL Window Functions, which are available in RedShift, are extremely complex and difficult to explain. My goal here is to give a broad overview of the concepts and enough information to encourage people to try them out. Ultimately you’ll need to read and refer to the PostgreSQL documentation on Window Functions and Window Function Calls, along with the tutorial when using them in your own queries.
Window functions are a special class of analytic functions that are applied to windows of rows. Windows are defined by an OVER (...) clause which defines a set of rows related to the current row to which the function applies. While that’s extremely abstract, the diverse functionality available from the different window functions doesn’t really lend itself to a simpler definition.
The two main components of the window are:
PARTITION BY which is the logical analog of GROUP BY in a traditional query. Window functions that aggregate will do so across all rows in the partition.
ORDER BY which is required for certain functions that look forward or backward in the window.
The FRAME clause, which is harder to cover, and I’ll not go into in depth. The frame is another logical concept only used by functions that are relative to the frame (like first_value / last_value)
I think of window functions as falling into two categories:
Functions that are also available as traditional analytics functions, such as count, sum, avg, etc.
Functions that are only available when using windows, such as lead, lag, ntile, etc. These expose truly novel functionality unavailable without using windows.
For functions that are also available when using GROUP BY, the primary advantage of using them with window functions is it becomes possible to do multiple different grouping operations in a single query. When combined with the power of subqueries and CTEs, this can let you do very powerful business logic all in a single statement.
It would be natural to assume that doing multiple grouping operations in a single query would be just as costly in terms of execution time as doing multiple single operations. In practice, we haven’t seen this to be the case. There is of course a cost, but we typically see it be much smaller than a 100% overhead depending on the query and the grouping.
Window Function Examples: Using Multiple Traditional Aggregates
The following query illustrates the use of multiple count functions over different partitions to compute the percent of reservations that a given restaurant accounts for by locality (city). The things to note in this query are:
Use of DISTINCT: Since window functions append columns to each row, without a DISTINCT operator, the query will give you back 1 row for every row in the join. For this query, this would be 1 row per reservation rather than a row per restaurant and city.
The two count operations each have a different PARTITION BY clause, one counting by restaurant and the other counting by locality.
The final query, which references the two columns produced by the window function in a CTE and computes a percentage using them.
Window Function Examples: Using ntile To Compute Percentiles
Frequently, Yelp needs to look at distributions of user activity and compute percentile buckets based on their activity. The query below uses the ntile function to augment a per-user count of lifetime review behavior. Things to note about this query:
The ntile(100) is PARTITION BY signup_country, so it will compute the percentile per signup country.
The ntile(100) is ORDER BY review_count, which means the rows will be bucketed in order of the review_count.
Each row will get a number from 1-100, that is the logical bucket that the row falls into, added as a new column called percentile.
I’ve touched on two of the most powerful features for Redshift analytics, window functions and CTEs, but there’s a lot more functionality in Postgres, much of which is also in RedShift. One of my favorite Postgres sessions is Postgres: The Bits You Haven’t Found, which showed me a whole huge set of Postgres functionality, including first exposing me to window functions.
In addition, brushing up on your psql chops pays dividends over time as you start to become fluid with the advanced functionality in the Postgres CLI. If you write a lot of date based reports, which I suspect we all do, I would also recommend digging into the date/time functionality (in particular date_trunc). date_trunc makes doing date based roll ups extremely fast and easy, letting you quickly truncate dates to useful things to months, quarters, weeks, etc.
2014 was a busy year for us at Yelp Engineering and with 2015 having just kicked off, we wanted to highlight some of our most exciting events from last year.
Yelp loves open source. We love contributing back to the developer and open source communities and last year was no exception to that. We saw a large number of our engineers contributing to many different projects. In July, to help organize all of the projects we worked on, we launched a new open source site along with the revamped developer site.
Over the course of the year, we open sourced 16 new projects, and saw 3,000 commits across the 48 projects we have on GitHub. Some of the biggest projects we released, including MOE (source), dockersh (source), pre-commit (source), and pyleus (source), saw a lot of excitement from the developer community and we’re very happy to have open sourced them!
Yelp Engineers at Grace Hopper
We partner a lot with the developer community to give people and groups ample opportunities to to learn, interact, and share knowledge with each other. Throughout the year, our engineers attended 15 different conferences, like LAUNCH Scale and Grace Hopper. At LAUNCH Scale, VP of Engineering Michael Stoppelman shared a little bit about how we’ve scaled traffic at Yelp, discussing some of the changes we made to our infrastructure to handle the increased traffic over time. Our engineers returning from Grace Hopper were excited to share some of their experiences and feedback from the conference, hoping to help more women enter the tech industry.
One of the things we’re very proud of is our partnership with Women Who Code, which, in September, grew even stronger when we became one of the first official sponsors, helping them achieve their goal of connecting 1 million women in tech.
Three times a year, all the engineers at Yelp gather for two days of hacking, food, and fun. We put together some Yelpy puzzles, flew remote-controlled sharks, and designed our own henna tattoos. Last year, our engineers participated in threeseparatehackathons where we saw a total of 221 projects between the three.
On to 2015!
There’s already a lot of fun and exciting things planned for 2015 so if you want to be part of the fun follow us on Twitter and Facebook to catch all the announcements!
We use Macs a lot at Yelp, which means that we see our fair share of Mac-specific security alerts. Host based detectors will tell us about known malware infestations or weird new startup items. Network based detectors see potential C2 callouts or DNS requests to resolve suspicious domains. Sometimes our awesome employees just let us know, “I think I have like Stuxnet or conficker or something on my laptop.”
When alerts fire, our incident response team’s first goal is to “stop the bleeding” – to contain and then eradicate the threat. Next, we move to “root cause the alert” – figuring out exactly what happened and how we’ll prevent it in the future. One of our primary tools for root causing OS X alerts is OSXCollector.
OSXCollector is an open source forensic evidence collection and analysis toolkit for OS X. It was developed in-house at Yelp to automate the digital forensics and incident response (DFIR) our crack team of responders had been doing manually.
Performing Forensics Collection
The first step in DFIR is gathering information about what’s going on – forensic artifact collection if you like fancy terms. OSXCollector gathers information from plists, sqlite databases and the local filesystem then packages them in an easy to read and easier to parse JSON file.
osxcollector.py is a single Python file that runs without any dependencies on a standard OS X machine. This makes it really easy to run collection on any machine – no fussing with brew, pip, config files, or environment variables. Just copy the single file onto the machine and run it. sudo osxcollector.py is all it takes.
Details of Collection
The collector outputs a .tar.gz containing all the collected artifacts. The archive contains a JSON file with the majority of information. Additionally, a set of useful logs from the target system logs are included.
The collector gathers many different types of data including:
install history and file hashes for kernel extensions and installed applications
details on startup items including LaunchAgents, LaunchDaemons, ScriptingAdditions, and other login items
OS quarantine, the information OS X uses to show ‘Are you sure you wanna run this?’ when a user is trying to open a file downloaded from the internet
file hashes and source URL for downloaded files
a snapshot of browser history, cookies, extensions, and cached data for Chrome, Firefox, and Safari
user account details
email attachment hashes
The docs page on GitHub contains a more in depth description of collected data.
Performing Basic Forensic Analysis
Forensic analysis is a bit of an art and a bit of a science. Every analyst will see a bit of a different story when reading the output from OSXCollector – that’s part of what makes analysis fun.
Generally, collection is performed on a target machine because something is hinky: anti-virus found a file it doesn’t like, deep packet inspect observed a callout, endpoint monitoring noticed a new startup item, etc. The details of this initial alert – a file path, a timestamp, a hash, a domain, an IP, etc. – is enough to get going.
OSXCollector output is very easy to sort, filter, and search for manual forensic analysis. By mixing a bit of command-line-fu with some powerful tools like like grep and jq a lot of questions can be answered. Here’s just a few examples:
Get everything that happened around 11:35
Just the URLs from that time period
Just details on a single user
Performing Automated Analysis with OutputFilters
Output filters process and transform the output of OSXCollector. The goal of filters is to make it easy to analyze OSXCollector output. Each filter has a single purpose. They do one thing and they do it right.
For example, the FindDomainsFilter does just what it sounds like: it finds domain names within a JSON entry. The domains are added as a new key to the JSON entry. For example, given the input:
the FindDomainsFilter would add an osxcollector_domains key to the output:
This enhanced JSON entry can now be fed into additional OutputFilters that perform actions like matching domains against a blacklist or querying a passive DNS service for domain reputation information.
Finds domain names in OSXCollector output and adds an osxcollector_domains key to JSON entries.
Compares data against user defined blacklists and adds an osxcollector_blacklist key to matching JSON entries.
Analysts should create blacklists for domains, file hashes, file names, and any known hinky stuff.
Breaks an initial set of file paths into individual file and directory names and then greps for these terms. The RelatedFilesFilter is smart and ignores usernames and common terms like bin or Library.
This filter is great for figuring out how evil_invoice.pdf landed up on a machine. It’ll find browser history, quarantines, email messages, etc. related to a file.
ChromeHistoryFilter and FirefoxHistoryFilter
Builds a really nice browser history sorted in descending time order. The output is comparable to looking at the history tab in the browser but contains more info such as whether the URL was visited because of a direct user click or visited in a hidden iframe.
Threat API Filters
OSXCollector output typically has thousands of potential indicators of compromise like domains, urls, and file hashes. Most are benign; some indicate a serious threat. Sorting the wheat from the chaff is quite a challenge. Threat APIs like OpenDNS, VirusTotal, and ShadowServer use a mix confirmed intelligence information with heuristics to augment and classify indicators and help find the needle in the haystack.
Looks up an initial set of domains and IP with the OpenDNS Umbrella API and finds related domains. Threats often involve relatively unknown domains or IPs. However, the 2nd generation related domains, often relate back to known malicious sources.
OpenDNS & VirusTotal LookupDomainsFilter
Looks up domain reputation and threat information in VirusTotal and OpenDNS.
The filters uses a heuristic to determine what is suspicious. These can create false positives but usually a download from a domain marked as suspicious is a good lead.
ShadowServer & VirusTotal LookupHashesFilter
Looks up hashes with the VirusTotal and ShadowServer APIs. VirusTotal acts as a blacklist of known malicious hashes while ShadowServer acts as a whitelist of known good file hashes.
AnalyzeFilter – The One Filter to Rule Them All
AnalyzeFilter is Yelp’s one filter to rule them all. It chains all the previous filters into one monster analysis. The results, enhanced with blacklist info, threat APIs, related files and domains, and even pretty browser history is written to a new output file.
Then Very Readable Output Bot takes over and prints out an easy-to-digest, human-readable, nearly-English summary of what it found. It’s basically equivalent to running:
and then letting a wise-cracking analyst explain the results to you. The Very Readable Output Bot even suggests new values to add to your blacklists.
This thing is the real deal and our analysts don’t even look at OSXCollector output until after they’ve run the AnalyzeFilter.
Yelp has used MySQL since the beginning, and in more recent years has adopted the Percona fork. We like MySQL because it fits our data needs, is configurable, fast, and scalable. The Percona fork includes performance enhancements and additional features on top of stock MySQL, we have used their training services in the past, and are currently a support customer. MySQL also has a *huge* online community. I love that a quick search using your favorite search engine will quickly get you started down the right path to solving most problems!
A Little More Detail
We’re running roughly 100 database servers between our dev, staging, and production environments. Our database is split up into several functional shards. Different database servers house different data sets, with different tables and schemas, but all of the data for a single table resides within one shard. The InnoDB storage engine provides us with the ability to handle a fair number of concurrent transactions at a time on masters, and we rely upon it for our transactional consistency. To scale database reads, we use MySQL statement-based replication to propagate data from master DBs to local and remote replicas, which reside in 2+ read-only data centers/locations. Our replication hierarchy uses an intermediate master database in each datacenter to reduce replication bandwidth and to facility easy master promotions:
MySQL is also extremely portable, running on almost anything, and has a massive number of variables to tune your performance if you’re inclined. We’re a Python shop, so we use the MySQLdb connector, and SQLAlchemy.
Tools we use with MySQL
At Yelp, we have a saying: “it’s not in production until it’s monitored.” We monitor MySQL with Nagios and Sensu, and graph metrics using Ganglia, Graphite, and Grafana. We also use the Percona Toolkit‘s command-line utilities to analyze database query traffic, log deadlocks, collect data during times of database stress, verify the consistency of our data between replicas and MySQL versions, online schema change, and to kill poorly performing queries.
Yelp DBAs <3 Developers: schema changes
In order to keep Yelp fresh and deliver new features, we push new versions of our application multiple times a day. This means we also have to support frequent schema changes too! There are two rules for our schema changes to keep things moving smoothly:
They must be backwards-compatible with the previous version of the application
They must either be online operations, or we must be able to run them using pt-online-schema-change
What’s so important about backwards-compatible schema changes? While I love dropping columns and tables just as much as (if not more!) than the next person, part of pushing often is also being able to safely roll back. If we need to do something like drop a column, we will first deploy code that no longer uses it, take the extra step of renaming the column first, verify that things look good, and then drop the column, each over its own push. We also want to keep in mind that neither database changes nor rolling out a new version of our application happens in a single instant. We always make our database/schema changes before the code deployment part of a push, and there can be a period of time during deployment when two versions of our application are running.
Thus we make our database changes before the code is deployed, and we use pt-online-schema-change to prevent replication delay as the schema change is made. pt-online-schema-change alters a table online by:
Creating an empty copy of the table to alter
Running the alter statement against the new, empty copy
Copying the rows from the original table into the new table
Replacing the original table with the new one
Yelp DBAs <3 Developers: queries and ORMs
This wasn’t part of my original presentation but there was an interesting question asked after my talk: do we use an ORM (Object Relational Mapping) or do we write out all of our SQL as direct queries or stored procedures? As I mentioned above, we do use SQLALchemy and while we have some hand-written queries, we prefer to use the ORM. This debate exists because ORMs can write some, well, interesting (read: awful) queries that can really hurt database performance. However, they also greatly enhance developer productivity! We choose to favor getting new tables and the code that uses them developed and out in the wild quickly, and then iterate from there if we find that specific queries need to be optimized by hand.
We have a couple of practices and/or tools that assist us here:
Monitoring: we monitor our databases to alert us if application or database timings change
We have a “gross query checker,” which I’ve blogged about before: it alerts us to poorly performing queries before we leave the development environment
We run pt-kill: a tool from the Percona Toolkit that will kill long-running queries in production
In the slides, you can also find a couple of notes about backups, where to find additional resources when working with MySQL and some book titles I’ve read that I think are worth checking out!
The open, supportive community surrounding MySQL fits in well with Yelp Engineering’s culture of collaboration, playing well with others, and technical tenacity. On top of providing the databases, documentation, and tools, it’s important to make sure that our developers have an easy venue to discuss ideas and ask questions. Once a week, the DBA team at Yelp hosts DBA office hours – a time when we’re available to answer any and all database questions, talk about performance, swap ideas, or just generally chat.