09/23/2011

Calling All Data Miners!

Summers are tough here at Yelp's HQ in San Francisco. It's hard to keep up with all the hot new businesses, order the best food at every restaurant you visit, or even just find that fancy dive bar your friends have been chatting up. If you're anything like me, you aren't satisfied just using these features on Yelp - you want to tear them apart, see how they tick, and make them better. The trick? All these features are powered by our incredible review data.

The data

Yelp is providing the reviews for nearly seven thousand businesses at 30 universities for students and academics to research and explore, along with some examples to get things started. Check out the examples on our GitHub page, and get the data from our Academic Dataset page (you'll need to be associated with an academic institution to qualify for access).

Some numbers:

  • 30 schools
  • 3 data types (reviews, businesses, users)
  • Over 150k reviews of nearly 7k businesses

Cool, huh? How about we try building something with the dataset?

Positive shmositive

First off, let's define the problem we want to solve. Personally, I've always been interested in sentiment analysis, so the first thing I thought of was this: what are the most positive and negative words for each category? (The following results are generated by the positive_category_words example, in case you want to follow along.)

The simplest means of accomplishing this is to find the average star rating of all the reviews each word shows up in. We'll use mrjob for this, since we're working with a fair amount of data. Here's simplified version of the job:

class PositiveWords(MRJob):

    def review_mapper(self, _, data):
        if data['type'] != 'review':
            return

        unique_words = set(words(data['text']))
        for word in unique_words:
            yield word, data['stars']

    def positivity_reducer(self, word, ratings):
        yield word, avg(ratings)

The full solution is implemented in positive_category_words/simple_global_positivity.py, and produces the following results:

The 10 most positive words in ranked order: gem, treasure, knowledgeable, impeccable, topnotch, incredible, talented, compliments, macadamia, perfection

The 10 most negative words in ranked order: worst, tasteless, flavorless, awful, rude, disgusting, horrible, terrible, poorly, zero

Making it better

So what's wrong? The first thing that stands out is that all these words are very general - 'gem,' in particular, shows up in over 17k reviews, and could be used to describe pretty much anything. Let’s break it down by category, instead. The code for this portion lives in positive_category_words/weighted_category_positivity.py, and closely mirrors simple_global_positivity, with the addition of a category term.

So, for a category near-and-dear to my heart, Italian:

The 10 most positive words: melts, naples, exquisite, vino, promise, hooked, ovens, heaven, char, succulent

The 10 worst words: disgusting, worst, horrible, flavorless, terrible, tasteless, awful, rude, manager, subpar

These are looking a lot better, but there’s still room for improvement. We could use NLTK to distinguish parts of speech, and break down positive adjectives versus positive nouns. Or we could use a stemmer, and just look at root words (dream / dreams / dreaming might be better if they were bucketed together).

Curious about the results for over 400 other categories? Check out our full examples page.

Going further

We’ve provided two more examples in our github repo - a tool that guesses the category of a business given a review and a simple markov-chain review generator (a tool that, given a category and some starting text, fills in the rest of the review), but we’re most excited about what you come up with. Head over to http://www.yelp.com/academic_dataset for information on how to get started and a list of the schools we’re providing data for.

Have any other questions? Shoot us an email at dataset@yelp.com.

08/22/2011

Yelp Hackathon #5 Brings You KegTime and More

If you read this blog regularly, you're probably already familiar with Yelp Hackathons. They're 48 hour periods of time we devote to anyone in the company to crank on projects that they’re excited about.

  

In the past, our Hackathons have included special guests, pinatas and, oh yeah, hacking! This year’s didn’t disappoint, with 28 total projects built, a petting zoo in the office (What? Your company doesn’t have those?) and a High School Science Fair type set-up so our hundreds of Yelp employees in San Francisco could wander, sip some Belgian ale and take a gander at what their fellow colleagues dreamed up.

 

We wanted to showcase some of our favorite projects, so once again we created a video that takes a humorous look at just a handful of our insanely creative employees. While these projects aren't (yet) live on Yelp, they may be something that we incorporate shortly! Take a look at Review Low-Lights, Side-by-Side and Kegmate 2.0 featuring KegTime.

Let us know what you think, and if you’re interested in joining our ranks, check out our job openings at http://www.yelp.com/careers.

07/09/2011

Day in the Life of a Yelp Engineer

Check out the latest post on the main Yelp blog about a day in the life of one of our engineers, JR Heard!

06/15/2011

Understanding Git

Yelp converted from using Subversion for source control over to Git over a year ago. As it turns out, however, Git (and distributed version control systems in general) can sometimes be daunting for some developers to understand, especially if they're used to more traditional centralized versioning solutions. Git can also be a bit daunting for new developers just starting to use source control - it tends to assume that everyone is a power user, offering a high amount of potential, but sometimes at the cost of user-friendliness.

In an effort to help more people get a deeper understanding of just what exactly is going on underneath the hood when it comes to Git, one of our recent weekly learning groups here at Yelp focused on Git fundamentals. Since we believe that others outside of Yelp might also benefit from a better understanding of one of the most powerful and yet free version control systems out there, we're making the video and slides of the session available for anyone who wishes to peruse them. Enjoy!


04/19/2011

Parcelgen: A Code Generator for Android Data Objects

When I switched from working on Yelp's iPhone app to our Android app, one of the first things I encountered was the radical difference between the equivalent classes to handle what I normally consider a "screen" or "page" of an app. On Android, an Activity handles what on iOS you use a UIViewController for, but they work in fundamentally different ways. One of the biggest differences is that on Android you can't just instantiate a new Activity and display it like you can with a UIViewController. Instead you create an Intent and tell the system to start an Activity based on that Intent.

This works fine for simple Activities, but things get complicated when you have a complex data object which you want to display in a new Activity. Android doesn't let you pass arbitrary objects between Activities. As this Android google group discussion points out, data passed to Activities must be placed in globally-accessible state, stored on the device's flash memory, or passed inside an Intent. A static singleton works for a few objects, but it requires having a place to store all the objects being passed and is generally considered poor design. If you use a static collection to pass multiple objects, the receiver must make sure to remove the objects lest they leak memory. Writing an object to flash is expensive, slow, and requires having a reliable way to write and read an object (such as sqlite or Serializable). Intent seems like it should be great, except all objects written to an Intent must be of a limited set of Java types and classes, or know how to read and write themselves to a Parcel by implementing the Parcelable protocol.

After trying a number of different techniques to pass objects between Activities, we began making the basic objects in our application conform to the Parcelable protocol. This greatly simplified the process of launching new Activities, and allowed our application to save all its objects when suspended via onSaveInstanceState() so recovering from being stopped by the Android system was much easier to handle. As our application grew and we added more and more data objects, I tired of writing similar Parcel- and JSON-related code for each class.

Enter Parcelgen

Inspired by Android's existing code generation in aapt, instead of manually writing classes that implement Parcelable, I wrote a Parcelable code generator in Python. All it needs to know is the class name, instance variables and types, and a little metadata about each object to create.

I realized that I needed the exact same information to read an object from a Parcel that I needed to read an object from a JSONObject, so I enhanced my script to generate code to do that too.

How it Works

For each object to generate, write a small json description of the object's members and their types:

{
    "do_json": true,
    "package": "com.yelp.parcelgen",
    "props": {
        "String": [
            "id", "name", "imageUrl", "url", "mobileUrl",
            "phone", "displayPhone", "ratingImageUrl",
            "ratingImageUrlSmall", "snippetText", "snippetImageUrl"
        ],
        "int": ["reviewCount"],
        "double": ["distance"],
        "Location": ["location"] 
    },
    "json_map": {
        "ratingImageUrl": "rating_img_url",
        "ratingImageUrlSmall": "rating_img_url_small"
    }
}

This is the parcelable description for (a subset of) a Business returned by the Yelp API as used in a sample app I wrote for parcelgen.

Then execute the parcelgen python script to generate the java code for the object:

$ python ~/parcelgen/parcelgen.py parcelables/Business.json src/

This creates two files: _Business.java and Business.java. _Business contains the parcel and json reading/writing logic, and Business contains a CREATOR static variable as required by Parcel. Business doesn't have any dependencies on the object's properties. If the json description changes and you re-run parcelgen _Business.java will be overwritten, but not Business.java. This lets you add data and application logic to Business without losing the flexibility to modify the parcel description later (any members added to Business won't automatically get saved to a Parcel).

Creating and Passing Parcelgen(erated) Objects

Want to pass an object to a new Activity in an Intent? Just use Intent.putExtra() (BusinessesActivity.java):

intent.putExtra("business", mBusiness);

Then, in your Activity's onCreate():

Business business = getIntent().getParcelableExtra("business");

Want to create a list of objects from a JSON array of dictionaries? Check out how the sample app does it:

JSONObject response = new JSONObject(result);
List businesses = JsonUtil.parseJsonList(
    response.getJSONArray("businesses"), Business.CREATOR);

Using parcelgen saves a lot of repetitive code and developer time in applications that are based around a web API. In Yelp for Android we use parcelgen to handle businesses, users, reviews, and other basic data objects. Whenever you tap a business from the search results list, it's passed to the Activity to display the business through a Parcel with the help of parcelgen.

Read Up

Detailed instructions on how to use parcelgen in your project and some more advanced features are outlined in parcelgen's readme on github.

There is a working sample Android application on github here you can use as a reference for how to use parcelgen.

Fork me on github

Parcelgen is open source under the Apache 2.0 license and available for anyone's use and modification on github. Please feel free to submit patches, feature requests, and such on github.

If you use parcelgen in a project, please drop us a line. We'd love to know about other people using it! You can contact me directly through Yelp or through github.

04/06/2011

Gotta Bounce... to an Engineering Offsite

Here in Yelp Engineering, we spend lots of time working on useful and cool stuff for Yelp users of all sorts—consumers, business owners, and even our own Community Managers. Sometimes though, it's nice to take a break and go hang out with your coworkers for a bit.

What better place to "hang out" than a few feet up in the air? Our most recent offsite took us to House of Air, a giant indoor trampoline park. After getting a brief safety introduction from HoA's "Flight Crew" the team hit the trampolines, and well... I think these say it all:

Waiting People1 3flip Jason

Of course, why stop at just trampolines, when you can add big red dodgeballs?

Rules

That's right - not only does House of Air have a giant trampoline grid, they also have a trampoline dodgeball colliseum. After some quick team organization, the dodgeball matches began:

Dodgeball0 Dodgeball1 Dodgeball2 Dodgeball3

More cool pictures are after the "jump"... though if you're on a mobile device, you might want to refrain!

Continue reading "Gotta Bounce... to an Engineering Offsite" »

03/21/2011

MySQL Minutiae & InnoDB Internals

At Yelp, we store nearly all of our data in MySQL. At any given time we're issuing tens of thousands of SQL queries to our database cluster per second, with some individual servers going above the 10k qps mark. Our database cluster consists of billions of rows. In response to a lot of different problems we've had to optimize the snot out of our MySQL installation, and we've learned some interesting things along the way.

A colleague and I recently gave a presentation to some of our coworkers, titled MySQL Minutiae & InnoDB Internals. The talk covered some good background knowledge that every developer should have about MySQL (transaction isolation levels, replication, etc.) and some more advanced topics such as InnoDB locking semantics.

There's some good stuff in the talk for people of almost all skill levels, but to whet your appetite I'm going to dive into an interesting deadlock example.

Continue reading "MySQL Minutiae & InnoDB Internals" »

03/09/2011

After Hours Project: Kinect Hacking

Here at Yelp, we're passionate about building things; it's at the core of our engineering philosophy. In fact, we enjoy it so much that many of us keep on building after we finish work. I recently found some spare time to work on an interesting project with the Microsoft Kinect. I think it's a cool start and I've open sourced the code so that others can build something even cooler.


Continue reading "After Hours Project: Kinect Hacking" »

02/24/2011

Upcoming Tech Events at Yelp

Over the next couple of weeks, Yelp is going to be hosting two open-to-the-public events for members of the software development community:

PyPy Just-in-Time Interpreters

March 3rd, 6pm

Armin Rigo of the PyPy project will be giving a presentation on achievements made by PyPy, the "fastest, most compatible, and most stable 'alternative' Python interpreter." Special attention will be given to advancements in the area of dynamic (JIT) interpreters. You can find more information on SFPython's Meetup page. If you plan to come, make sure to RSVP at least a day in advance so that security will allow you into the building.

San Francisco Hadoop User Meetup

March 9th, 6pm

The third SF Hadoop meetup will be taking place at Yelp! The meetup is discussion-based, using an "unconference" format. Agenda and topics are determined at the beginning of the meeting (and anyone may volunteer a topic), but a proposed theme for the upcoming session is "integration." The session is expected to last approximately 2 hours, and more information is also available on the SF Hadoop Meetup page. Again, please make sure to RSVP at least a day in advance in order to be admitted past security.

02/18/2011

Towards Building a High-Quality Workforce with Mechanical Turk

In addition to having written over 15 million reviews, Yelpers also contribute hundreds of thousands of business listing corrections each year.  Not all of these corrections are accurate, though, and there are quite a few jokers out there (e.g. suggesting the aquariums category for popular seafood restaurants... very funny!).  Yelp is serious about the correctness of business listings, so in order to efficiently validate each and every change, we’ve turned to Amazon’s Mechanical Turk (AMT) as well as other automated methods.  We recently published a research paper [1] at the NIPS 2010 Workshop on Computational Social Science and the Wisdom of Crowds reporting on our experiences.

Vetting Workers On Test Tasks
Our experiences agree with several other studies in finding that the AMT workforce has many high-quality workers but also many spammers who don’t perform tasks reliably.  In particular, only 35.6% of workers passed our basic multiple-choice pre-screening test.  We used expert-labeled corrections in order to test worker performance and found that the variance of worker accuracies was very high:

Image00 Please see our paper for a full discussion of our observations.  Previous studies have proposed mechanisms to correct for the sort of worker biases we observed.  However, these mechanisms correct results as a post-processing step after workers have been paid for completing all tasks.  Given our experiences and financial goals, we find that a useful mechanism must vet workers online as they complete tasks.

Continue reading "Towards Building a High-Quality Workforce with Mechanical Turk" »

Copyright © 2004-2010 Yelp | Privacy Policy