Engineering Blog

November 12th, 2014

Scaling Elasticsearch to Hundreds of Developers

Yelp uses Elasticsearch to rapidly prototype and launch new search applications, and moving quickly at our scale raises challenges. In particular, we often encounter difficulty making changes to query logic without impacting users, as well as finding client library bugs, problems with multi-tenancy, and general reliability issues. As the number of engineers at Yelp writing new Elasticsearch queries grew, our Search Infrastructure team was having difficulty supporting the multitude of ways engineers were finding to send queries to our Elasticsearch clusters. The infrastructure we designed for a single team to communicate with a single cluster did not scale to tens of teams and tens of clusters.

Problems we Faced with Elasticsearch at Yelp

Elasticsearch is a fantastic distributed search engine, but it is also a relatively young datastore with an immature ecosystem. Until September 2013, there was no official Python client. Elasticsearch 1.0 only came out in February of 2014. Meanwhile, Yelp has been scaling search using Elasticsearch since September of 2012 and as early adopters, we have hit bumps along the way.

We have hundreds of developers working on tens of services that talk to tens of clusters. Different services use different client libraries and different clusters run different versions of Elasticsearch. Historically, this looked something like:

image02

Figure 1: Yelp Elasticsearch Infrastructure

What are the Problems?

  • Developers use many different client libraries, we have to support this.
  • We run multiple version of Elasticsearch, mostly 0.90.1, 1.0.1 and 1.2.1 clusters.
  • Multi-tenancy is often not acceptable for business critical clients because Elasticsearch cannot offer machine level resource controls and its JVM level isolation is still in development.
  • Having client code spread over a multitude of services and applications makes auditing and changing client code hard.

These problems all derive from Elasticsearch’s inevitably wide interface. Elasticsearch developers have explicitly chosen a wide interface that is hard to defend due to lack of access controls, which makes sense given the complexity they are trying to express with an HTTP interface. However, it means that treating Elasticsearch as just another service in a Service Oriented Architecture rapidly becomes difficult to maintain. The Elasticsearch API is continually evolving, sometimes in backwards incompatible ways, and the client libraries built on top of that API are continually changing as well, which ultimately means that iteration speeds suffer.

Change is Hard

As we scaled usage of Elasticsearch here at Yelp, it became harder and harder to change existing code. To illustrate these concerns let us consider two examples of developer requests on the infrastructure mentioned in Figure 1:

Convert Main Web App to use the RequestsES Client Library

This involves finding all the query code in our main web app and then, for each one:

1. Create secondary paths that use RequestsES
2. Setup RequestBucketer groups.
3. Write duplicate tests.
4. Deploy the change.
5. Remove duplicate tests.

We can make the code changes fairly easily but deploying our main web app takes a few hours and we have a lot of query code that needs to be ported. This would take significant developer time due to the amount of complexity involved in deploying our main web application. The high developer cost of changing this code outweighs the infrastructure benefits, which means this change is not pursued.

Convert Service 4 to elasticsearch-py and move them to Cluster 4

Service 4’s SLA has become stricter and they can no longer tolerate the downtime caused by Service 1’s occasionally expensive facet queries. Service 4’s developers also want the awesome reliability features that Elasticsearch 1.0 brought such as snapshot and restore. Unfortunately, our version of the YelpES client library does not support 1.X clusters, but the official Python client does, which is ok because engineers in Search Infrastructure are experts in porting YelpES code to the official Python client. Alas, we do not know anything about Service 4. This means we have to work with the team that owns Service 4, have them build parallel paths, and tell them how to communicate with the new cluster. This takes significant developer time because of coordination overhead between our teams.

It is easy to see that as the number of developers grows, these development patterns just do not scale. Developers are continually adding new query code in various services, using various client libraries, in various programming languages. Furthermore, developers are afraid to change existing code because of long deployment times and business risk. Infrastructure and operations engineers must maintain multi-tenant clusters housing clients with completely different uptime requirements and usage patterns.

Everything about this is bad. It is bad for developers, infrastructure engineers, and operations engineers, and it leads to the following lesson learned:

Systems that use Elasticsearch are more maintainable when query code is separated from business logic

Our Solution

Search Infrastructure at Yelp has been employing a proxy service we call Apollo to separate the concerns of the developers from the implementation details so that now our infrastructure looks like this:

image01

Figure 2: Apollo

Key Design Decisions

Isolate infrastructure complexity

The first and foremost purpose of Apollo is to isolate the complexity of our search infrastructure from developers. If a developer wants to search reviews from their service, they post a json blob:

{"query_text": "chicken tikka masala", "business_ids": [1, 2, 3] }

to an Apollo url:

apollo-host:1234/review/v3/search

The developer need never know that this is doing an Elasticsearch query using the elasicsearch-py client library, against an Elasticsearch cluster running in our datacenter that happens to run Elasticsearch version 1.0.1.

Validation of all incoming and exiting json objects using json-schema ensures that interfaces are respected and because these schemas ship with our client libraries we are able to check interfaces in calling code, even when that calling code is written in Python.

Make it easy to iterate on query code

Every query client is isolated in their own client module within Apollo, and each client is required to provide an input and output schema that governs what types of objects their client should accept and return. Each such interface is bound to a single implementation of a query client, which means that in order to write a non-backwards compatible interface change, one must write an entirely new client that binds to a new version of the interface. For example, if the interface to review search changes, developers write a separate module and bind it to /review/v4/search, while continuing to have the old module bound to /review/v3/search. No more “if else” experiments, just self contained modules that focus on doing one thing well.

A key feature of per module versioning is that developers can iterate on their query client independently and the Apollo service is continuously delivered, ensuring that new query code hits production in tens of minutes. Each client can also be selectively turned off or redirected to another cluster if they are causing problems in production.

As for language, we chose Python due to Yelp’s mature Python infrastructure and the ease in which consumers could quickly define simple and complicated query clients. For a high throughput service like Apollo, Python (or at least Python 2) is usually the wrong choice due to high resource usage and poor concurrency support, but by using the excellent gevent library for concurrency and the highly optimized json parsing library ujson, we were able to scale Apollo to extremely high query loads. In addition, these libraries are all drop-ins so clients do not have to design concurrency into their query logic, it comes for free. At peak load Apollo with gevent can do thousands if not tens of thousands of concurrent Elasticsearch queries on a single uwsgi worker process, which is pretty good compared to the single concurrent query that normal Python uwsgi workers can achieve.

Make it easy to iterate on infrastructure

Because the only thing that lives in Apollo is code that creates Elasticsearch queries, it is easy to port clients to new libraries or move their client to a different cluster in a matter of minutes. The interface stays the same and end to end tests ensure functionality is not broken.

Another key capability is that from the start we designed these modules to implement a simple interface that is composable. This composable-first architecture has allowed us to provide wrappers like:

  • SlowQueryLogger: A unary wrapper that logs any slow requests to a log for auditing and monitoring.
  • Tee: A binary wrapper that allows us to make requests to two clients but only wait on results from one of them. This is useful for dark launching new clients or load testing new clusters.
  • Mux: A n-ary wrapper that directs traffic between many clients. This is useful for gradual rollouts of new query code or infrastructure.

As an example, let us assume there are two query clients which differ only in the client library they use and Elasticsearch version they expect: ReviewSearchClient and OfficialReviewSearchClient. Furthermore, let us say our operations engineer has just provisioned a new shiny cluster running Elasticsearch 1.2.1 that lives in the cloud and is ready to be load tested. An example composition of these clients within Apollo might be:


This maps to the following request path:

image00

Figure 3: Life of a Request in Apollo

In this short amount of Python code we achieved the following:

1. If any query takes longer than 500ms, log it to our slow query log for inspection
2. Send all traffic to a Mux that muxes between an old PyES implementation and our new official client implementation. We can change the Mux weights at runtime without a code push
3. Separately send traffic to a cloud cluster that we want to load test. Do not wait for the result.

Most importantly of all, we never had to worry about which consumers are making review search requests because there is a well defined interface that is well tested. Additionally, because Apollo uses Yelp’s mature Python service stack we have performance and quality metrics that can be monitored for this client, meaning that we do not have to be afraid to make these kinds of changes.

Revisiting developer requests

Now that Apollo exists, making changes goes from weeks to days, which means our organization can continue to be agile in the face of changing developer needs and backwards incompatible Elasticsearch versions. Let us revisit those developer requests now that we have Apollo:

Convert Main Web App to use the RequestsES Client Library

We have to find all the clients in Apollo that the Main Web App queries and implement their interfaces using the RequestsES client library. Then we wire up a Mux for each client that allows us to switch between the two implementations of the interface, deploy our code (~10 minutes) and gradually roll out the new code using configuration changes. From experience, query code like this can get ported in an afternoon. Having minute long deploys to production makes all the difference because it means that you can get multiple pushes to production in one day instead of one week. Also, because the elasticsearch query crafting code is separate from all the other business logic, it is easier to reason about and feel confident in changes.

Convert Service 4 to elasticsearch-py and move them to Cluster 4

We can implement Service 4’s interface using the new client library, re-using existing tests to ensure functional equivalence between the two implementations. Then we set up a Tee to the new cluster to make sure our new code works and the cluster can handle Service 4’s load. Finally, we wait a few days to ensure everything works and then we change the query client to point at the new cluster. If we really want to be safe we can setup a Mux and gradually roll it over. This whole process takes a few days or less of developer time.

Infrastructure Win

Now that Yelp engineers can leverage Apollo, along with our real time indexing system and dynamic Elasticsearch cluster provisioning, they can develop search applications faster than ever. Whereas before Search Infrastructure was accustomed to telling engineers “unfortunately we can’t do that yet”, today we have the flexibility to support even the most ambitious projects.

Since the release of Apollo just a few months ago, we have ported every major Yelp search engine running on Elasticsearch to use Apollo as well as enabled dozens of new features to be developed by other teams. Furthermore, due to the power of Apollo we were able to seamlessly upgrade to Elasticsearch 1.X for a number of our clients where prior to this that would have been nearly impossible given our uptime requirements.

As for performance, we have found that the slight overhead of running this proxy have proved more than worth it in deployment, cluster reconfiguration, and developer iteration time, enabling us to make up for the request overhead by deploying big win refactors that improve performance.

At the end of the day Apollo gives us flexibility, fast deploys, new Elasticsearch versions, performant queries, fault tolerance and isolation of complexity. A small abstraction and the right interface turns out to be a big win.

November 11th, 2014

November Events at Yelp

image00

This month we have a handful of exciting events and a few new ones! We kicked off the month with Hackathon 15.0 where Yelpers created and shared some amazing projects (more on that in a few weeks!).

Now that the hackathon dust has settled, we’re starting off by hosting the Bay Area Girl Geek Dinner to help encourage networking between girl geeks. On top of that, we’ll be at AnDevCon November 18 – 21 so make sure to find us there too.

Events happening at Yelp HQ:

  • Tuesday, November 11, 2014 – 5:30PM – Yelp Girl Geek Dinner (Bay Area Girl Geek Dinner)
  • Thursday, November 13, 2014 – 7:00PM – Hooked: How to Build Habit-Forming Products (Designers + Geeks)
  • Tuesday, November 18, 2014 – 6:30PM – Docker Meetup at Yelp (Docker)
  • Wednesday, November 19, 2014 – 6:45PM – You Are What You Buy (Products That Count)
  • Thursday, November 20, 2014 – 6:30PM – Become a Mad Scientist through Failure and Determination (BAKG)
November 6th, 2014

Yelp Dataset Challenge Round 3 Winners and Dataset Tools for Round 4

Yelp Dataset Challenge Round 3 Winners

We recently opened the fourth round of the Yelp Dataset Challenge. This announcement included an update to the dataset, adding four new international cities and bringing the total number of reviews in the dataset to over one million. You can download it and participate in the challenge here. Submissions for this round are open until December 31, 2014. See the full terms for more details.

With the opening of our fourth iteration of the challenge, we closed the third round, which ran from February 1, 2014 to July 31, 2014. We are proud to announce two grand prize winners of the $5,000 award from round three:

  • Jack Linshi from Yale University with his entry “Personalizing Yelp Star Ratings: A Semantic Topic Modeling Approach.” Jack proposed an approximation of a modified latent Dirichlet allocation (LDA) in which term distributions of topics are conditional on star ratings, allowing topics to have an explicit sentiment associated with them.
  • Felix W. from Princeton University with his entry “On the Efficiency of Social Recommender Networks.” Felix constructed metrics for measuring the efficiency of a network in disseminating information/recommendations and applied them to the Yelp social graph, discovering that it is quite efficient.

These entries were selected from many submissions for their technical and academic merit. For a full list of all previous winners of the Yelp Dataset Challenge, head over to the challenge site.

Dataset Example Code

We maintain a repository of example code to help you get started playing with the dataset. These examples show different ways to interact with the data and how to use our open source Python MapReduce tool mrjob with the data.

The repository includes scripts for

Other Tools

There are many ways to explore the vast data within the Yelp Dataset Challenge Dataset. Below are some examples of some of the many cool tools that can be used with our data:

CartoDB is a cloud based mapping, analysis, and visualization engine that shows you how you can transform reviews into insightful visualizations. They recently wrote a blog post demonstrating how to use their tools to gain interesting insights about the Las Vegas part of the dataset.

Statwing is a tool used to clean data, explore relationships, and create charts quickly. They loaded the dataset into their system for people to play with and explore interesting insights.

Yelp Dataset Challenge Round 4

Submissions for this round are open until December 31, 2014. See the full terms for more details. This dataset contains over one million reviews from five cities around the world, along with all of the associated businesses, tips, check-ins, and users along with the social graph. We are excited to see what you come up with!

 

November 5th, 2014

Reflections from Grace Hopper (Part 2)

Welcome back! Today we have Wei, Rachel, Jen, Virginia and Anusha sharing their experiences. Wei is an engineer on the consumer team and she brings amazing user experiences to our customers. Rachel and Jen are both engineers on our international team, bringing the power of Yelp to all of our international communities. Virginia works as an engineer on our partnerships team and Anusha is an engineer on our infrastructure team.

Overall, we all had a blast getting to know each other, meeting other amazing women in the industry, hearing some great stories from inspiring women role models, and sourcing some future talent for Yelp. If you missed us at the career fair and are interested in working at Yelp,  check out yelp.com/careers. We are always interested in talented women engineers to grow our community.

Wei W.

Software Engineer, Mobile Site

I was very inspired by Yoky Matsuoka’s talk. She reminded us to constantly evaluate our level of passion and engagement with what we’re doing, and to change courses if we find those levels waning. As a professional tennis player turned robotics professor turned VP of technology at Nest, Yoky is proof that our interests may lead us on many different paths throughout our life and that, it turns out, is actually okay.

Rachel and Wei eating ice-cream

We loved the afternoon ice-cream snacks

Rachel Z.

Software Engineer, International

My time at Grace Hopper was split between interviewing and attending talks. I was a bit overwhelmed by the volume of students coming by our career booth everyday but I was also really glad that we got to interview some strong women engineers while at the conference. It was great to see the interest in technical as well as future leadership possibilities.

Jo Miller’s talk on Winning at the Game of Office Politics was fascinating. I always viewed office politics as evil. Her talk provided a way to look at it from a different angle and assured us that it’s possible to navigate office politics without becoming a political animal.

Some other personal highlights: speaking with industry folks during workshops, eating ice cream, dancing at the parties, and reconnecting with friends. A couple things I hope the conference improves upon next year are to have less scheduling conflicts between interesting talks, and increased availability for the more popular talks.

Jen W. taking a break from interviews

Jen W. taking a break from interviews

Jen W.

Software Engineer, International

I have been working in the industry for over 10 years, always in male-dominated workplaces. What struck me the most was being surrounded by so many women. It was inspiring to meet people from such a wide diversity of experiences and backgrounds, all of them bright, eager, and happy to chat (Hi Yenny! Hey Kanak!).

Those who are in school or looking at career growth will truly benefit from attending future Grace Hopper conferences. For the rest of us, it’s still a great opportunity to meet new people, attend the amazing (in more ways than one!) plenary panels and learn what’s going on in tech in the rest of the world.

Our booth looked awesome and was very popular for its swag :) Clearly all of us were trying to tweet about it while we took this picture!

Our booth looked awesome and was very popular for its swag! Clearly all of us were trying to tweet about it while we took this picture!

Virginia T.

Software Engineer, Partnerships

For me, the most useful and interesting workshop was “The Dynamics of Hyper-Effective Teams.” I loved that there was role-playing involved. For instance, do you know someone on your team who’s an “Airtime Dominator,” someone who speaks at least 20% of the time? Or a “Silent Expert,” the one who has all the knowledge but rarely speaks up? What about “The Naysayer” who constantly refutes others’ ideas?

The workshop concluded with volunteers sharing their stories and advice on working with these personalities. Having personally been through some similar experiences, it was an eye-opener to learn how these (sometimes clashing) personalities can effectively work together. Can’t wait to experiment with it!

Anusha R.

Software Engineer, Infrastructure

I loved attending Grace Hopper for the first time this year! There were a bunch of inspirational speakers at the conference. Just like my co-worker and friend Wei, I also enjoyed Yoky’s talk about how her passions led to her finding her career path. I was also inspired by Barbara Birungi, one of the award winners. Her non-profit is helping women in Uganda get involved in  technology through mentorship and coaching.

There was a good mix of technical and career building talks and workshops. I attended several career building sessions and enjoyed the workshops on office politics, difficult conversations and building hyper-effective teams.

But most of all, I enjoyed talking to the students who visited our booth at career fair. Many of them had interesting projects to talk about. I was excited to see this level of talent and wish them the best in their careers.

November 4th, 2014

Reflections from Grace Hopper (Part 1)

As you probably heard, Yelp attended Grace Hopper this year. Nine software engineers from different teams attended and, for many of us, it was our first time.

It was a unique experience to see so many talented women in one place. In addition to the talks and panel discussions, we also had the opportunity and pleasure to represent Yelp at the career fair. It was amazing to see a consistent flow of students and industry talent, all happy customers, stop by our booth to speak with us and tell us their stories of using Yelp.

We had such a great time that we wanted to share some highlights with you. This is the first of two posts where each of us have added an excerpt from our experience in our own words.

Our group at Grace Hopper

Our first day at the career fair. From left: Jen F., Wei W., Susanne L., Tasneem M., Carmen J., Emily F., Virginia T.

Tasneem M.

Engineering Manager, Ads

I joined Yelp about a month ago and was super excited to be part of this journey with fun and interesting women from the company. I have worked in the software industry since the early 2000s and have grown as an engineer, a manager and a leader partly because I have had inspiring role models throughout my career. My mentors have challenged me to take on opportunities that I felt I wasn’t quite ready for.

My experience at Grace Hopper was a reminder of how far I have come and yet how far I still can go. It has re-inspired me to continue mentoring women and help them take charge of their careers in a male dominated industry. I am also motivated to help tackle the diversity issues within my local community.

The highlights for me were being at the career fair, attending the keynotes and technology lightning talks. Having spent a lot of time at the career fair, I was impressed by the talent and deep interest in data mining and machine learning. I look forward to seeing some of you join our growing community of awesome engineers at Yelp. I also loved Pat Kirkland’s talk on “Executive Presence.” I appreciated her role-playing and practical guidance on the different personas (prey, predator and partner) and have been able to successfully experiment with some of her tips at work. The lightning talks were a great way to hear several unique stories and perspectives within an hour. I hope that we can be on stage next year sharing some of our experiences.

Susanne L.

Database Engineer, Operations

I am a database engineer at Yelp, working on a team where we all make sure that our persistent stores are reliable, scalable, fast and give our Yelp consumers a pleasant experience with our product. I thought it was awesome to see so many great women in computing coming to Grace Hopper. Some of them were looking for an internship or a full time position and they stopped by our booth genuinely interested in Yelp. It was exciting to hear how our product makes consumers happy.

A lot of these young women had remarkable experiences but didn’t really know what to highlight and how to sell themselves. My suggestion to some of them was to:

  • Consider having an “elevator pitch” prepared. Sell yourself in 60 seconds: try a pitch that includes your name, year, major, minor (if you have one), and why you want this job.
  • Prepare to talk about a distinguishing skill/project. For example, “I did a hackathon, where I developed an android app..” Be prepared to answer any questions about the project and what you learned from it.
  • Know what you are looking for. “What kind of internship at Yelp would you be interested in?” Backend/Frontend/Mobile/Web? Which programming language do you like? Why? Avoid “I don’t know, anything is fine” unless you are a freshman!

Jen F.

Systems Administrator, Corporate Infrastructure

I have been at Yelp since August 2012. For me,  the talk I will remember most is “Beyond the Buzzwords: Test-Driven Development” where I got a quick overview of how one can practice test driven development (TDD). While TDD isn’t the most exciting technical topic, the speaker, Sabrina B. Williams from Google, did an awesome job making it relatable. She was able to share a live demo of a project from concept to the tests to fully-implemented program.

The keynotes were full of rock stars of the technology world that were also surprisingly engaging (and occasionally controversial!).  Who hasn’t gone to a technology conference keynote and expected a glorified sales pitch that was easily forgettable?  None of that was here at Grace Hopper.

I also appreciated how diverse the talks were – from recommendations on how to get your first job to dealing with office politics to in-depth technical talks. However, one thing I’d love to see more at Grace Hopper next year is talks about advancing your career in non-managerial roles. All in all, this is a great conference and I only wish I had been able to participate when I was just getting into the industry!