03/31/2014

MySQL replication, error 1159, and why you might want to upgrade

At Yelp, we use MySQL replication to provide the performance we need to serve our 120 million monthly unique visitors. In this post Josh Snyder walks us through some of the internals of MySQL’s replication system, and discusses an issue that we saw in production.


MySQL replication and the slave I/O thread’s purpose

On a MySQL slave, the task of replication is split between two threads: the I/O and SQL threads. The I/O thread’s job is to download the log of events executed on the master and write them to disk in the relay log. The SQL thread then reads and executes the downloaded events on the slave. For the purposes of this post we’ll discuss the I/O thread.

The slave I/O thread connects to a master server using the MySQL protocol, just like any other client. It executes certain housekeeping queries on the master, and immediately afterward asks for a dump of the binary log starting at a specific log file and position. Once the slave I/O thread requests the binlog dump, the connection will be devoted entirely to the binlog stream until the TCP stream is closed. This means that the results of the housekeeping queries mentioned above will need to be cached for use during the entire lifetime of the TCP stream. Practically speaking this is not a problem, because it’s very cheap to break and reestablish the connection.

What happens when the connection breaks?

Nothing right away. The MySQL slave can’t recognize the difference between silence due to a lack of events in the binary log and silence due to connection issues.

The slave I/O thread can take a guess, though, which is where slave_net_timeout and MASTER_HEARTBEAT_PERIOD come in.

The manual describes slave_net_timeout as, “the number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect.” This is correct, but the actual behavior is somewhat more subtle. As implemented, slave_net_timeout is the number of seconds that any read on the slave I/O socket will block before timing out. It’s passed directly to setsockopt(2) as SO_RCVTIMEO. Therefore, if the I/O thread ever spends more than slave_net_timeout seconds reading from its socket, the read will time out and an error will be returned. MySQL handles this error internally as ER_NET_READ_INTERRUPTED, and its ultimate reaction is to terminate the connection and initiate a new one.

A small value of slave_net_timeout could then pose a problem. If slave_net_timeout is two seconds, MySQL will terminate and reconnect the replication stream whenever the master spends more than two seconds without publishing an event to its binary log. To remedy this, the slave can request that the master send periodic heartbeats using the MASTER_HEARTBEAT_PERIOD replication setting (which defaults to half of slave_net_timeout). If the master heartbeat period elapses and the master has no events to send, it will insert a Heartbeat_log_event into the binary log stream. These events go over the wire, but are never written to disk; they don’t even get recorded in the slave’s relay log.

Heartbeat events serve to “fill the silence” created by an idle master. Busy masters (i.e. those that always have an event to send within one heartbeat period) will never send heartbeat events. This can be verified on a slave by checking that the Slave_received_heartbeats status variable is never incremented when the master is busily accepting writes.

What does a slave reconnection look like?

I set up a master-slave MySQL pair within Docker containers, with an active binary log stream coming down from the master. I set slave_net_timeout to 30 seconds and then broke replication by removing the replication TCP flow from the tables of my stateful firewall.

Using iptables/conntrack to break a replication connection

First, set up a firewall rule that uses conntrack to disallow all invalid TCP sessions:

root@mysql_master # iptables -A INPUT -p tcp --dport 3306 -m state --state INVALID -j LOG
root@mysql_master # iptables -A INPUT -p tcp --dport 3306 -m state --state INVALID -j DROP

Be sure to disable conntrack’s TCP loose mode, which will otherwise allow TCP ACKs to establish valid connections:

root@mysql_master # echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

Then simply delete the conntrack entry for MySQL replication, and our stateful firewall will block any MySQL tcp sessions that were in flight:

root@mysql_master # conntrack -D -p tcp --dport 3306

I set my slave_net_timeout to 30 seconds and then broke replication. As expected, the slave I/O thread stopped advancing. Seconds_Behind_Master displays zero, because the slave SQL thread is not “actively processing updates”. Nothing of consequence happens until slave_net_timeout elapses.

Once that happens, MySQL reconnects and Seconds_Behind_Master spikes to...wait what?

Seconds_Behind_Master: 1634
...
Last_IO_Errno: 0

How did that happen? We spent only the thirty second slave_net_timeout disconnected from the master host, but the slave SQL thread thinks it’s a whopping 1634 seconds behind. This is due to MySQL bug #66921, discussed by Domas Mituzas here. Succinctly, when the connection is reestablished, the slave SQL thread re-executes the format description event from the master’s binary log, noting its timestamp as the latest executed event. Seconds_Behind_Master becomes equal to the age of the format description event in the master’s binary log file, which in this case was 1634 seconds.

MySQL doesn’t bother to print to the error log that the I/O thread recovered from a connection failure. It considers these kinds of issues routine, so it gracefully recovers without comment. This presents an issue for a database administrator who would like to know how frequently reconnects are occurring. So far the only way I’ve found to gather this information is to interrogate the master about the age of its ‘Binlog Dump’ threads. [1]

select id, host, time_ms, date_sub(now(), interval time second) as started
from information_schema.processlist
where command = 'Binlog Dump'
order by time_ms;

In some cases, the I/O thread’s attempt to reconnect will fail. This will cause a cheery error message to appear in the error log:

[ERROR] Slave I/O: error reconnecting to master 'repl@169.254.0.5:3306' - retry-time: 60 retries: 86400, Error_code: 2003

Error code 2003 is the familiar “Cannot connect to MySQL server” error. Replication isn’t currently happening, but MySQL will gracefully handle the situation by continuing its attempts to reconnect to the master [2]. The error message means that the slave I/O thread will reconnect at 60 second (MASTER_CONNECT_RETRY) intervals, and will persist in doing so 86400 (MASTER_RETRY_COUNT) times.

An annoying outage

On one occasion, packet loss due to a network provider issue caused some of Yelp’s MySQL 5.5 series databases to print different and scarier messages:

[ERROR] Slave I/O: The slave I/O thread stops because a fatal error is encountered when it try to get the value of SERVER_ID variable from master. Error: , Error_code: 1159
[ERROR] The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: Error_code: 1593

perror informs us that these errors are network timeout and fatal slave errors, respectively.

$ perror 1159 1593
MySQL error code 1159 (ER_NET_READ_INTERRUPTED): Got timeout reading communication packets
MySQL error code 1593 (ER_SLAVE_FATAL_ERROR): Fatal error: %s

In both cases, MySQL did not gracefully handle the situation. The slave I/O thread did not attempt to reconnect: it was stone dead. A human had to intervene to restart replication. These problems arose because the I/O thread wasn’t correctly handling errors when doing its start-of-connection housekeeping queries. Our investigation revealed that each error message implicates a different code site.

The first error, ER_NET_READ_INTERRUPTED, is already familiar to us. It’s the error that occurs when slave_net_timeout elapses without receiving any data from the master. In this case, the replication client executed SHOW VARIABLES LIKE 'SERVER_ID', and did not receive a response in time. As explained above, MySQL drops the connection and reconnects when it encounters this error during a binlog dump. At this specific code site, MySQL handles errors a little differently. Instead of catching and handling ERR_NET_READ_INTERRUPTED the code checks for any network error. The code that matches the error codes looks like:

bool is_network_error(uint errorno)
{
  if (errorno == CR_CONNECTION_ERROR ||
    errorno == CR_CONN_HOST_ERROR ||
    errorno == CR_SERVER_GONE_ERROR ||
    errorno == CR_SERVER_LOST ||
    errorno == ER_CON_COUNT_ERROR ||
    errorno == ER_SERVER_SHUTDOWN)
    return TRUE;
  return FALSE;
}

As you can see, ER_NET_READ_INTERRUPTED is missing from this list. Because of this, MySQL decided to terminate the slave I/O thread instead of gracefully reconnecting.

The second error was similar. Setting the heartbeat period on the master is one of the housekeeping queries done by the slave before requesting a binlog dump. The query looks like SET @master_heartbeat_period=%s. Due to the way this query’s result was handled, any error at all would cause death of the slave I/O thread.

We worked with Percona to get fixes for these two issues included in the Percona Server 5.5.36 release. The patch, written by Vlad Lesin, also includes numerous test cases to exercise this behavior.

Rubber Meets Road

I wanted to see Vlad’s change in practice, so I constructed a testbed to mimic packet loss. I chose some very aggressive settings in an attempt to get a high frequency of connection breakage.

On the master, I left my existing stateful firewall rules in place. I then emulated 60% packet loss:

tc qdisc add dev eth0 root netem drop 60%

On the slave, I set

SET GLOBAL slave_net_timeout=2;
CHANGE MASTER TO master_connect_retry=1;

I continuously wrote into the master at a rate of 5 queries/second, with each event being 217 bytes. This translates to 1085 bytes/second of binlog traffic.

Using Percona Server 5.5.36 as a baseline, I used tcpdump to determine that these network conditions caused MySQL to reconnect 74 times over a five minute period. At no point did the I/O thread fail entirely. Percona Server 5.5.35 is a different story. Over a five minute period under the same conditions, MySQL reconnected 69 times. On five occasions the I/O thread failed completely:

3x [ERROR] Slave I/O: The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: , Error_code: 1593
2x [ERROR] Slave I/O: The slave I/O thread stops because a fatal error is encountered when it try to get the value of SERVER_ID variable from master. Error: , Error_code: 1159

Based on these results, I’d say the fix is effective. Moreover, it seems unlikely that there are any further bugs of this kind in the reconnection code.

In conclusion

I was very happy to have a checkout of MySQL handy while debugging this. In general, I’ve found the MySQL replication source code to be readable enough (if not simple), especially when you’re searching for a specific error code. Most importantly, though, I wouldn’t be able to live without the ability to spin-up a throwaway MySQL instance for testing.

Footnotes

  1. [1] well, I lied. I know of one other way. You can configure replication in a way that causes the slave I/O thread to print a warning when it connects. The warning will be printed to the slave’s error log. I’ve seen this, for instance, when a MySQL 5.6 replica connects to a MySQL 5.5 master.
  2. [2] whether your application will gracefully handle the resulting replication delay is a different story entirely

03/06/2014

March Events at Yelp HQ: Python Madness & More!

Love Python but can’t make it all the way to Montreal for PyCon this year? Never fear, Yelp is hosting a sneak peak of many of the talks! Instead of buying plane tickets, a hotel room, and a conference pass, you can kick back, drink beer, and have pizza with us! Next week’s Python meetups, both PyLadies and SF Python, will be hosting practice talks for the conference. Let’s dive into two of the talks: a novice session on machine learning, and an intermediate talk on working with external languages.

At a high level, Melanie Warrick will get us started on machine learning. “Big Data” and analytics don’t have to be just scary buzzwords. Python is a great language to start tinkering in this area, and Melanie will show us how! Join us March 11th at the PyLadies event to learn more. Note, this event is open to both PyLadies and PyGents!

On the low level side of things, Christine Spang will describe best practices in invoking subprocesses and wrapping C code. Subprocesses can be used to run other binaries, shell processes, or even other Python processes you’d like to keep in a separate namespace.  But understanding all the options is important to using that power effectively and correctly.  “Dropping down” to C code can be an effective way to speed up a Python program and Python provides a way idiomatically to interact with highly optimized libraries. SF Python will host this talk and others on March 12th.

And there’s plenty more where that came from! Lada Adamic, a data scientist at Facebook, will model information spread at the Products that Count meetup, and at Designers + Geeks we’ll hear about cutting edge techniques to create narrative-driven experiences in the world around us.

See you at Yelp!

02/27/2014

Introducing RequestBucketer: A system for putting HTTP requests in named buckets

JR Heard has many big projects under his belt, and this week we'll get to learn about one of the most recent. Yelp pushes new code almost every day, so it's no surprise we get new features every week. But how do we make sure they're working as intended? JR describes one element of our solution below!


Let's talk about features. Building new features is super fun. Improving pre-existing ones is fantastic, too. What would be less fantastic would be if the your new feature turned out to crumble under production load, or if your untested-gut-feeling improvement to an old feature ended up causing people to use it less. Here at Yelp, we don’t have to worry about that too often, thanks to a system we use both for rolling out new features and for allocating percentages of traffic into the different branches of our A/B tests. Let me tell you about it!

Context

Over the years, we’ve found that the best way to build a big new feature is to break it into small pieces and push each bit to production as it’s completed. There are about a thousand reasons for this, most of which will be familiar to those who’ve worked on a large, long-lived software project (Facebook and The Guardian know what I’m talking about). Fast iteration cycles mean that we get to see how our feature works in the wild much more quickly; on top of that, no matter how well-tested your code is, there’s just no substitute for the peace of mind you get from seeing it run on live traffic.

Of course, when we’re working on a giant new feature that completely replaces an existing page (e.g. our homepage redesign a year and a half ago, not to mention our recent business page redesign!), we can’t just suddenly replace the old page with a blank “Hello world!” page and ask that our users bear with us for a few months. Instead, for each big feature like this, we used to end up writing a function that looked something like:

def should_show_new_homepage(self, request):
  if internal_ip(request.remote_addr):
    return True

  if self.user_id in config.homepage_rollout_user_ids:
    return True

  if self.device_id in config.homepage_rollout_device_ids:
    return True

  if self.user.is_elite and config.homepage_rollout_is_active_for_elite_users:
    return True

  return False

This function lets us control who gets to see our new feature-in-progress; essentially, it implements the logic that lets us whitelist a request into seeing our new feature. So this is great - the only people who get to see our feature-in-development are the people who are supposed to be seeing it, and our users don’t have to put up with an unfinished feature while we implement a redesign.

The catch here is that we’ve got a lot of people working on lots of features. Writing one of these functions from scratch for each feature was a clear violation of the DRY principle. Worse: even though we code-review every line of code we write before shipping it, none of us was comfortable with the possibility of accidentally launching an incomplete feature due to mistakenly including a `not` in the wrong place the fiftieth time we wrote one of these functions. We decided to build a tool to solve this problem once and for all.

Design Constraints

Our ideal tool would be something that took in a string like ‘foo_shiny_feature’ and returned a string like ‘enabled’, ‘disabled’, or possibly some other string(s) depending on the semantics of the feature being gated. Our solution would have to satisfy the following requirements:

Traffic allocation
We should be able to say that, for instance, 5% of traffic gets to see our new feature and 95% of traffic doesn’t.
Extensible whitelisting
We should be able to whitelist users into (or out of!) a particular feature in a number of ways (more on this below), and it should be very simple for maintainers to add new ways to whitelist requests.
Speed
We should be able to quickly ask about the status of hundreds of features/experiments over the course of serving a Web request.
Idiot-proof
One of the main motivations behind building this tool was to minimize the chance of accidentally launching an in-progress feature, so it should have as little room for operator error as possible.
Multi-Purpose
We would want to use this tool for other things besides feature rollouts: for instance, we would also like to use it to distribute traffic among cohorts in A/B tests.

 

We came up with a solution we call RequestBucketer, and we’ve been using it in production for about a year now. You interact with it like this:

request_bucketer.get_bucket('my_shiny_button_experiment')
# => 'bright_red'

request_bucketer.is_feature_enabled('new_biz_page')
# => True

request_bucketer.is_feature_disabled('a_service_being_load_tested')
# => False

RequestBucketer

RequestBucketer gets its name because it lets you say: “My feature has these four buckets; these two buckets have special whitelisting behavior; and here are all four buckets’ traffic percentages. Here’s an HTTP request: what bucket does it fall into?”

Let’s be more specific about what I mean when I talk about how buckets can have “special whitelisting behavior.” Toward the start of a new feature’s life, we want to make sure that the only people who actually see that new feature are the engineers working on it. We can do this in a couple of ways:

  • We can whitelist access to the feature based on the ID of a request’s logged-in user, so that engineers can see the feature from their home computer if they’re logged into yelp.com.
  • That doesn’t let our engineers test out how the feature behaves for logged-out users - to cover that case, we can whitelist access to the feature based on a device-specific ID.

 

Later on, once the feature’s working well enough that it can be beta-tested by other folks, we have a couple of other whitelisting tools at our disposal:

  • We can say that any request that originates from within our internal corporate network gets to see our new feature, but usually we won’t want to do this until the feature is pretty fully-functional, so that other departments don’t have to deal with our feature-in-progress.
  • We also like to roll out features to certain types of logged-in users. For instance, when we added the ability for users to write reviews from their mobile devices, our Elites got to play with that feature weeks before anyone else. We also have a team of Community Managers in cities across the globe, and we love to collect early feedback on new features by giving our CMs early access.

 

RequestBucketer is backed by a simple YAML file with a bunch of entries (we call them BucketSets) that look like this:

foo_shiny_feature:
   type: *FEATURE_RELEASE # as opposed to, for instance, *EXPERIMENT
   buckets:
    disabled:
      percentage: 90
      whitelist:
        user_ids:
          - *JRHEARD_USER_ID # jrheard is a curmudgeon, doesn't want the new version until it's done
    dark_launch:
      percentage: 0
    enabled:
      percentage: 10
      whitelist:
        user_types:
          - *ELITE
          - *COMMUNITY_MANAGER
        ip_ranges:
          - *ALL_INTERNAL_IPS
        yuvs:
          - *CONSUMER_YUVS
          - *PM_YUVS
        user_ids:
          - *MOBILE_TEAM_USER_IDS
          - *WING_USER_ID
          - *A_HOUSECAT_USER_ID

In the example above, when we check what bucket a given request falls into for `foo_shiny_feature`, we’ll first check the buckets’ whitelists. For instance, if my boss Wing is logged in, he’ll be in the ‘enabled’ bucket, guaranteed. If a request isn’t whitelisted into any buckets at all (e.g. it’s made from an IP outside of the Yelp corporate network and doesn’t have a whitelisted user-id or device ID), we’ll fall back to the buckets’ traffic percentages. As you’d expect, 10% of those requests will be assigned to the ‘enabled’ bucket, and the other 90% will be assigned to the ‘disabled’ bucket.

“Hold on a second,” astute readers say - “what happens if jrheard is logged in and is making a request from an internal IP?” Great question! To deal with situations like this, RequestBucketer has a simple concept of a “whitelist match specificity.” Simply put: some types of whitelisting are more specific than others - a device ID is more specific than a logged-in user ID, and a logged-in user ID is more specific than an IP range. If a request has a whitelist match in multiple buckets, the bucket with the most specific match wins. This is all easily configurable, and as you teach RequestBucketer about new ways to whitelist requests, it’s super-simple to teach it how specific these new whitelist matches are - it looks a lot like this:

WHITELIST_MATCH_SPECIFICITY_ORDERING = [
    WhitelistMatchSpecificity.SPECIFIC_DEVICE,
    WhitelistMatchSpecificity.SPECIFIC_USER,
    WhitelistMatchSpecificity.TYPE_OF_USER,
    WhitelistMatchSpecificity.IP_RANGE,
]

Questions?

RequestBucketer’s a simple system, and we use it so frequently that I launched a feature with it halfway through writing this blog post. We use it to power our experiments system, too - but that’s a discussion for another post. Have any questions about how we use RequestBucketer in production or comments on its design? Let us know in the HN discussion thread!

02/20/2014

Yelp Dataset Challenge Round 2 Winner and New Data

The Challenge

The second round of the Yelp Dataset Challenge opened in May 2013, giving students access to our massive Phoenix Academic Dataset, with reviews and businesses from the greater Phoenix metro area. The Yelp team is very excited to provide the academic community with a rich dataset over which to train and extend their models and research. We encourage students to take advantage of this wealth of data to develop and extend their own research in data analysis and machine learning. Students who submit their research are eligible for cash awards and incentives for publishing and presenting their findings.

The dataset was downloaded by thousands of students around the world. From the completed entries we have selected David W. Vinson of University of California, Merced as the Round 2 winner with his submission “Valence Constrains the Information Density of Messages.”

Updating and Extending the Dataset

We are excited to announce that we have updated and extended the original Phoenix Academic Dataset! The original dataset, released in March 2013, has been well-received by the academic community, and has already been cited in papers and included in presentations around the world. For more information on past winners and their papers, please check out the Yelp Dataset Challenge site.

The new dataset builds upon this foundation by not only refreshing it with new content created over the past year but also including new data like business attributes, the social graph and tips.

The new Phoenix Academic Dataset incorporates the following updates and new data types:

  • Businesses - 15,585 (+4,048 new businesses!)
  • Business Attributes - 111,561 (new!)
  • Check-in Sets - 11,434 (+3,152 new check-in sets!)
  • Tips - 113,993 (new!)
  • Users - 70,817 (+26,944 new users!)
  • User Connections - 151,516 (new!)
  • Reviews - 335,022 (+105,115 new reviews!)

This new data is available for immediate download at www.yelp.com/dataset_challenge and replaces the previous Phoenix Academic Dataset. We are eagerly anticipating seeing the projects and research that will be built using this data. We are especially excited to see the research related to the new content: from micropost analysis on tips to inferring business attributes from reviews to mining the rich social graph for insights. We look forward to what you come up with!

Round 3 is Now Live

Along with the updated dataset, we’re also happy to announce the next iteration of the Yelp Dataset Challenge. The challenge will be open to students in the US and Canada and will run from February 11th, 2014 to July 31, 2014. See the website for the full terms and conditions. This data can be used to train a myriad of models and extend research in many fields. So download the dataset now and start using our data right away!

02/11/2014

Yelp’s got style (and the guide to back it up)

We’re excited to publicly share Yelp’s styleguide — a living document that we’ve used internally since May 2013 to create visual consistency across Yelp and reduce technical debt with modular, reusable markup and styles. To get this project off the ground, our front-end and design teams worked together more closely than ever before. We’re very pleased to be able to share it with anyone interested in hearing how we moved to a fast-paced UI design and development model.

This kind of design and development requires a lot of discipline across product and engineering teams. “There are no special cases” has become our mantra. When working on a new feature, we hold fast to these rules:

1. Use the pre-established patterns.

2. No, really, please use the pre-established patterns.

3. If the pre-established patterns do not solve your design problem, you have two options:

a. Alter a pre-existing pattern to solve your problem, and implement that change across all of Yelp via a change to the styleguide.

b. Establish a new pattern, and integrate it into the canon of Yelp’s UI patterns for future use.

Now, onto the whys and hows:

Yelp has been evolving at a rapid pace over the last nearly 10 years. Each new feature has brought improvements to the product, but also introduced more and more markup and styles. Our front-end code base was getting out of control and while the Photoshop mock-up to code workflow wasn’t exactly broken, it wasn’t as efficient as it could be either.

As designs for our new business listing page began to take shape, it became clear we were establishing the future of Yelp’s look and feel. This would be the last time we’d go from Photoshop mock-up to coding the UI from scratch. Even before development on the new page began, we started pulling components out of the design and building them in the styleguide. Using Sass mixins, we applied the new grid system to all of our existing layouts. With the components already built, it was easy to refresh the search page and homepage with the typography, forms and containers from the styleguide. 

Solve something once, why solve it again?

The styleguide surfaces existing solutions for designers as well as developers. When working on a new feature, the designer doesn’t need to think about how to call out information on the page: the “island” pattern does just that.

When implementing the front-end for the same feature, the developer doesn’t have to think about how to build that island from scratch. They can simply use the documented markup. No new css necessary.

This saves a lot of time and frees up mind space so we can stop thinking about what certain tabs should look like and instead focus on designing and building engaging user interfaces.

It’s alive, it’s alive!

Our patterns are documented in live production code, so what you see in the styleguide is exactly what you’ll see on the site. This is great for cross-browser testing our components and, unlike static design documents, there’s no need to worry about it becoming out of date.

We explored a number of options for live css documentation. An early version ran on PyKSS (the python flavor of Kyle Neath’s KSS), a framework for generating live styleguides from descriptions and markup in css comments. We enjoyed working with PyKSS, but ultimately chose to develop a custom solution. This allowed us to make use of existing partial templates. It also made it easier to provide code snippets that developers should actually include in templates, rather than the markup that those snippets output.

No component left behind

The styleguide makes cross site changes a breeze. Since development on the business listing page began, we’ve made several visual tweaks to our patterns. Our grays got warmer. Our islands lost their embossed look. The style was updated in one file and the changes were reflected everywhere the component was used, as well as in the styleguide. 

See it in action



References

While creating our styleguide we took inspiration from a number of awesome sources:

 


Want to work closely with top-notch designers to build engaging user interfaces for 120 million monthly visitors? We’re hiring!

02/07/2014

Fab February Events @Yelp

It’s a big month for awesome events at Yelp HQ!  Of special interest to those wondering how Yelp uses Hadoop is the Hadoop User Group on the 19th.  I’ll be presenting on where Hadoop fits into our Big Data stack.  Unfortunately, Hadoop isn’t quite pixie dust you can sprinkle on data and transmorph it into insights.  To get the most of the system, we’re careful what we feed in, how we schedule jobs, and what we do with the results.  Please RSVP to join me for a great discussion!

Other meetups this month include the always diverse and interesting Designers + Geeks, Python, and Data Science groups.  For those of you working hard at Developer Week, take a load off and refuel at Yelp's free official after party - food, drinks, and shenanigans guaranteed. See you at Yelp!

02/03/2014

3D Printing at Yelp: Space, Drones and Monopoly

Yoni D. is a UI Designer on the mobile team by day, but at night he transforms into a hardware hacker extraordinaire. Actually, the hacking often occurs during the day, too... and sometimes on weekends. Well, I better let Yoni explain it himself!


The desktop 3D printing evolution has taken the world by storm. For me, it all started a little over two years ago when Yelp bought its engineers a MakerBot Thing-O-Matic 3D Printer to play around with. The sounds and smells that accommodate desktop 3D printing have since become second nature to our corner of the office.

Image01

While we haven't yet found a way for 3D printing to help our users connect with great local businesses, it has been a great driver of innovation and fun in our engineering team. What started out as a few engineers printing toys has since spiralled into super creative projects that have stirred up great support from everyone here at Yelp.

It was a group of mobile developers that assembled Yelp’s first 3D printer. I was quickly drawn to the machine after seeing it print a ridiculously bad looking (but functional) whistle. Soon after, a small group of engineers and myself rallied behind printing our own copy of the infamous Turtle Shell Racer.

The large parts of the Turtle Shell Racer pushed the Thing-O-Matic to its limits and forced us to tweak and modify the printer constantly. The result however was amazing! We turned a spool of plastic and a few lunch breaks and evenings into a cute little R/C car. Seeing how far we had come empowered us to see what designs we could come up with ourselves.

We wouldn’t have to wait long to put our imagination to the test. A few times a year, Yelp has a two-day period called Hackathon where our engineers set aside their everyday responsibilities to work on projects that are purely innovative and fun. At the next Hackathon, my team and I decided to try and send an iPhone to Space via weather balloon. The idea was to Check-In from Space using Yelp’s mobile app.

The weather balloon’s tracking service that was shared with our Yelp colleagues proved very popular, as the 3rd party service soon went down due to an overload of traffic. With the help of 3D printed parts of our own design and a lot of tenacity, our iPhone did reach the edge of Space. However, on reentry we lost our balloon somewhere in the hills near Pyramid Lake in Reno Nevada.

It wasn’t until the next Hackathon a few months later, when we returned with an autonomous reconnaissance drone loaded with custom 3D printed parts, that we were able to locate and recover our balloon payload. Our drone consisted of a Bixler airframe, an Ardupilot system, a GoPro camera and a custom short range “First Person View” system.

Image02
Image00

Amazed by the drone we had managed to build, we decided to build a bigger and better drone to explore and share the possibilities of this new technology. We would also open source all our code, parts and 3D models. To our amazement the “Burrito Bomber” concept drone we made and its accompanied video went viral. Before long our drone and it’s Thing-O-Matic 3D printed parts had a quarter of a million views on YouTube and had been covered by CNN, The Huffington Post, Forbes and more! They seem to really like the idea over at Amazon too. You're welcome Jeff.

 

Our next project would truly test if the sky’s the limit for 3D printing. We decided to try and print in the Earth’s Stratosphere at 100,000 feet altitude. Making a printer that’s light enough to be carried by a weather balloon but able to 3D print in an icy -50° Fahrenheit would be our biggest challenge yet. Using a modified Printrbot Simple, custom g-code and a lipo battery we decided to give it a shot. After only one failed attempt, we succeeded in printing a small Yelp / Printrbot logo at no less than 111,159 feet. That’s a full year before NASA’s 3D printer will join the 20-mile-high club.

Image03
Image05

Not all 3D printed projects here at Yelp involve boy toy themes like space or drones though. Currently we’re experimenting with 3D scanning using a Makerbot Digitizer and printing on the more reliable Makerbot Replicator 2. At our most recent Hackathon we clay sculpted and 3D scanned our Yelp mascots before 3D printing them at “Honey, I Shrunk the Kids” size. Using a resolution of 100 microns and spray painting the models revealed the perfect game pieces to accommodate a custom Yelp Monopoly board. Oh, and obviously we also open sourced them (we even added a free bonus goat).


Image04

Image06

Now, in an effort to start cloning talented engineers, I’ve been using a Microsoft Kinect to scan engineers and print their busts. So far the clones have been lacking in the productivity department though. If you’re interested in building a great product, playing around with fun technology and want to save me the trouble of trying to perfect 3D printing cloning technology, you should think about applying to Yelp.

12/19/2013

Yelp Internship Program Summer 2013

This summer, our interns spent their weekdays in downtown San Francisco alongside full-time engineers developing some of Yelp’s latest releases: mobile reviews, Yelp’s launch in Brazil, and others still in the works. Mark M. and Olivia G., interns from our mobile and community teams, shared a little about their projects. Though rivals on their college campuses, our Cal and Stanford interns put aside their differences to engineer some really neat stuff.


Markm

Mark M. (Mobile Team - UC Berkeley)

“I spent the summer creating an entirely new framework built around data analytics. My tool allows us to visualize and analyze critical user flows within our mobile apps and perform and monitor A/B tests. The most exciting part of this project was that I was allowed to work on the full stack from start to finish. From using mrjob (Yelp’s open-source map reduce framework on Hadoop), to collecting and parsing the important information, to building an entirely new front-end interface using javascript frameworks such as Angular.js, and D3, my project definitely taught me a lot about different technologies! One of my favorite parts about Yelp is that the interns don’t feel like interns, they feel like full-time engineers. I felt like a full-time engineer because of the large scope of my project and because of how autonomous I was allowed to be in deciding the technical design and implementation of the tool. After I gave a quick technical talk about my project to the entire engineering department, there was interest from many of the different teams at Yelp to integrate their metrics projects with mine. I could immediately tell that the project I had been working on was extremely useful and will continue to make an impact in the future.”

 

Grubert_headshot_face

Olivia G. (Community Team - Stanford)

“I joined the Yelp team on April Fool's Day 2013 and was, respectfully, only mildly pranked on that first day. Luckily, my internship offer was not a joke, and I quickly settled into our team project to revamp The Weekly Yelp. The Weekly Yelp is a fun newsletter that gets sent to over 100 distinct markets, nearly half of which are international, and reaches millions of subscribers., Our project sought to improve every step of providing the content, from the way our data was structured and stored, to how featured content was selected and presented, to the process by which the emails were queued, rendered, and sent. In my six months on the team, I have been able to help on this project, from creating a new repository for our code right at the outset, to one of the final essential tasks of wiring up the email template and queueing them to send to our subscribers. A few takeaways from this project: deciding on naming conventions is hard, but worth it; fit the use cases of the past, but don't let old code restrict your thinking; and designing a flexible structure will not only help future users but may also help you during development, as decisions are made and changed. I have loved making concrete progress on The Weekly Yelp project every day and have loved working with such a supportive team and company. But watch out for darts!”

12/06/2013

Cool New Space, Cool New Tech

On Wednesday, November 20, Yelp opened up the doors of our new space and invited tech industry friends to come by and see what we’re up to. Scott Clark and John B., two of our engineers, gave presentations about current technology challenges we’re working on here at Yelp. Search Engineering Manager Chris T. gives us a play-by-play below!


The crowd started arriving at our new building in San Francisco even before the official start time of 6pm. I ended up chatting with some folks in front of the building and did not get up to the party on the 8th floor until around 6:30. By then it was packed! The bar was serving up drinks as fast as they could and hors d'oeuvres were brought to attendees by servers walking through the crowd.

Image00

Normally, our coffee-bar. Tonight, our bar-bar. Only empty at this moment because tech talks were going on.

_MG_0730

Another shot of the new space, a bit earlier in the day.

After enjoying some food, drinks and conversation, our tech talks started. First up was Scott, who gave a great talk about how to apply techniques from optimal learning - such as bandit algorithms and bayesian global optimization - to automatically improve the performance of our experiment framework. These techniques are already being applied in production here at Yelp.

Next up was John, who described how we’ve integrated ElasticSearch at Yelp. Prior to working with ElasticSearch, we had built out search using custom services. John’s talk gives a great overview of some of the reasons we chose ES for our future development, as well as general tips for folks building out search on their site.

Video and slides of the tech talks:

After the tech talks concluded, everyone mingled on our 8th Floor. We marked off various corners with topical discussion, such as Data Mining, Backend, Ops and so on. A lot of people ran into folks they crossed paths with previously and were happy to both reminisce and discuss the new things they’re working on.

Keep following this blog for more updates about the exciting meetups and events we host, like the upcoming Intern Networking event, Yelp NITE, Python meetups and more.

11/07/2013

Whoa! That Embedded Web View Looks Hot in Your iOS App!

This post comes to us from Allen C., an engineer on our mobile team. The mobile team has dozens of innovations under their belt, and today Allen explains how the iOS team uses HTML views to quickly roll out features that originate on the web.


In the third quarter of 2013, the Yelp mobile app was used on more than 11 million unique mobile devices on a monthly average basis. We’re continuously pushing the envelope to make the app user experience as great as possible. A common requirement in our app is displaying embedded web content for a variety of different features. One such feature is our new Yelp Platform, which allows users to order food from participating businesses directly from our site and mobile apps. In this blog post we’re going to walk you through building a seamless embedded web content experience for your native app on iOS.

Why do you need to embed web content? Well, sometimes it makes sense in order to take advantage of the great mobile website you've already built. Other times, the content you want may only exist on the web. Here are some techniques that we use at Yelp to display gorgeous web content, while preserving the great experience our users have come to expect from the app.

The typical method for displaying web content in an iOS app is to create a UIWebView and pass it a URL to load. If you only do that, you might end up with something that looks like this:

Image01

This works, but it can be a jarring experience for users, depending on what type of web content you are showing. It feels as though they've left your app and entered a scaled-down version of Safari. One striking example, highlighted in the screen capture above, is the lack of a dedicated loading graphic - your user will see a blank screen while waiting for the first page to load. There are several things that are not optimal about this experience:

  1. The look and feel of the web view may be entirely different than the rest of the app.
  2. Navigation and page transitions within the web view are most likely different than the rest of the app.
  3. There is no way for the user's interaction with web content to directly affect native views in the app.

Since problem 1 is dependent to a large part on the specific app look and feel, this post will focus on overcoming problems 2 and 3. To tame UIWebView we need to give it a controller that implements UIWebViewDelegate. UIWebViewDelegate is a protocol that defines a set of methods which give the view controller significantly expanded control over its web view. We'll be focusing only on one of those methods:

- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request navigationType:(UIWebViewNavigationType)navigationType;

UIWebView calls the method from its delegate (delegation is how Apple’s UIKit framework implements the Model View Controller pattern; a view’s delegate is typically its controller) whenever it is about to load a URL request, either as a result of a user's action, or loaded programmatically from the app. It passes 3 arguments: itself, the URL request about to load, and a navigation type. The return value is where things get interesting: it's a flag that tells the UIWebView whether or not to actually load the URL request. If you return NO, the web view will simply not load the request and do nothing. However, that doesn’t mean our app will do nothing - we will add our own code in this method which will specify the app’s response to this URL request.

Native Looking Transitions

The first thing we can do is create native looking animations when loading web pages. Typically, when a user clicks a link in a web view, the web view displays a loading screen and then displays the new page in place once it is finished loading, just like on a mobile browser, and this is what will happen if we return YES in the delegate method. Instead of doing that, we will return NO as discussed above, and then we can take the URL request that the web view wanted to load, and open it in another web view! The new web view can animate into the screen in whatever way best matches the look and feel of our app. The pseudo-code to do this looks roughly like this:

- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request navigationType:(UIWebViewNavigationType)navigationType {
  
  // Load the first request in place, because there is no web view currently showing
  if (self.makingFirstRequest) {
    self.makingFirstRequest = NO;
    return YES;
  }

  // The web view that is currently showing originated the request
  if (webView == self.visibleWebView) {
    [self.hiddenWebView loadRequest:request];

    [UIView animateWithDuration:duration animations:^{
      // Some desired animation here
    } completion:^(BOOL finished) {
      UIWebView *oldVisibleWebView = self.visibleWebView;
      self.visibleWebView = self.hiddenWebView;
      self.hiddenWebView = oldVisibleWebView;
    }
    return NO;
  }

  return YES;
}

Voila! We've just roughly implemented native looking animations between web page transitions. This isn't complete yet, but the basic idea is here. One thing we learned while implementing this is that not every new URL request should be loaded in a new web view and animated in. For example a site might load an iframe which relies on being part of the original page. In this case, just opening the iframe URL in a new web view would be incorrect.

Web View Events

We can extend the same concept in order to implement dynamic interactions between web content and the native app. In this case, we simply define a new URL scheme: for example, 'mobile-event'. When web content needs to interact with the native app, it can simply tell the browser to open a URL with this scheme. At Yelp, we do this by having the mobile site load an iframe with this custom URL and immediately close it. The app will detect this URL being opened, and must respond appropriately to the "web view event" in the delegate method. Here is some pseudo code:

- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request navigationType:(UIWebViewNavigationType)navigationType {

  // Detect a web view event
  if ([request.URL.scheme isEqualToString:@"mobile-event"]) {
  
    // Execute code here for the event

    // Make sure to return NO or the web view will try to load a fake URL
    return NO;
  }
  
  // Execute normal URL request handling logic
}

What the application does in response to a web view event is context dependent, but one example is to either load a new non-web view, or to pop back to an existing view on the navigation stack. This allows the flow on our web content to integrate seamlessly with the native app.

Putting it All Together - Yelp Platform

Let's look at an example from Yelp's new Platform feature, which allows users to order food from participating businesses directly from the Yelp iOS app (this is available on web and Android too, but let’s focus on iOS right now). The Yelp Platform flow is currently implemented through the mobile site and displayed in the iOS app on a web view. From the Yelp business page, the user can tap the Order Pickup or Delivery button, which loads a web view starting the platform flow on the order form.

Image00 Image02

From there the web view controller uses native looking transitions to animate the menu onto the screen.

Image05

Once the user reaches the checkout page and completes the purchase, our mobile site sends a web view event, notifying the iOS app that the purchase is complete. The iOS app then pops back to the business view, now with a nice little alert that the order has been placed and an email confirmation has been sent.

Image04 Image03

What’s Next

We’ve currently got several new features in the works for integrating web content into the Yelp mobile apps and making our user experience that much better. Hopefully, this post will also give you a few ideas for how your own iOS apps can integrate dynamic native looking web content.