Comparing searches at the DNC and RNC
David K., Data Scientist
- Nov 4, 2016
Ah - politics. It’s frustrating, it’s funny, it’s serious business. It is perhaps one of the best windows we have into the nature of the human condition.
So we think it’s fitting that we combine politics with another window into human nature - search data. In particular, Yelp’s PR team was interested in looking at how the Republican and Democratic National Conventions affected what people looked for on Yelp. Could we confirm certain political stereotypes? Would there be surprises?
From a data analysis perspective, we first need to precisely state the question before we can get to tackling it. We want to look at how the DNC and RNC affected people’s search behavior - but what specifically does that mean in terms of the data?
Well, we should obviously look at the searches made at the relevant time and place: during each convention, in the area surrounding the corresponding cities (Cleveland for RNC, Philadelphia for DNC). Now, Yelp stores a lot of information on these searches, including which business category (restaurants, bars, home services, etc.) the search was for. Looking at these business categories for these searches seems like a natural thing to do.
But can we merely count the number of searches made in a category? Can we just say, “There were 120 searches for ‘Guns and Ammo’ during the DNC, and 116 during the RNC”?
|# of searches during DNC||# of searches during RNC|
Of course not. The conventions were held in two different cities, with different populations and levels of Yelp usage. One convention may simply have had more searches due to these factors.
Okay, then - can we look at what percent of all searches were made in a given category? Can we just state, “0.955% of the searches during the DNC were for ‘Cheese Steaks’, and it was 0.083% during the RNC”?
|# of searches during DNC||# of searches during RNC||% of all searches during DNC||% of all searches during RNC|
Well, no - the two cities have different cultures and historical searches, and we need to take that into account. The DNC had a much greater search percentage for ‘Cheese Steaks’, but that was only because it was held in Philadelphia. If we stopped our comparison here, we’d be mostly characterizing the cities, and not the conventions.
How about comparing each city with itself to mitigate those effects? We can compare the searches during the convention with the searches from the week before, and look at the changes. We would then get statements like “Searches for ‘Comfort Food’ increased during the RNC, from making up 0.091% of all searches to 0.112%”. Now that’s getting somewhere. That is an increase that we can actually attribute, at least in part, to the RNC.
|% of all searches before RNC||% of all searches during RNC||change in % of searches, RNC|
But we’re still not quite there yet. You can see this with something like ‘Churches’. During the RNC, searches for ‘Churches’ increased from making up 0.117% of all searches to 0.121% - this means that Republicans like churches, right?
|% of all searches before RNC||% of all searches during RNC||change in % of searches, RNC|
Well, maybe. Or this may just be the effect of a bunch of out-of-towners coming into the city over a weekend, which is not actually particular to the RNC. In fact, the DNC saw a similar effect, except the difference was even larger: ‘Churches’ increased from 0.085% of all searches to 0.100% during the DNC. Similar effects, where both conventions increase the fractional searches for a category, can also be seen for ‘Gay Bars’ and ‘Pizza’.
|change in % of searches, RNC||change in % of searches, DNC|
To improve the comparison further, we need to then look at the differences in these changes. We look at the increase in the percentage of searches in a given category, and take the difference of these values across the conventions. So at the DNC, ‘Vegan’ food searches increased from 0.39% of all searches to 0.53%. But can we attribute that increase specifically to the DNC, and not to just an influx of out-of-towners or the effect of the searches being made later in the summer?
|% of all searches before DNC||% of all searches during DNC||change in % of searches, DNC|
We have good reason to believe we can - because in the RNC, ‘Vegan’ searches did not really move at all. In fact it decreased slightly compared to the week before, as a percentage of all searches. This gives us some confidence when we say that the increase in the ‘Vegan’ category is specifically attributable to the DNC.
|% of all searches before DNC||% of all searches during DNC||change in % of searches, DNC||change in % of searches, RNC|
So, after considering all these things, we can now make a precise statement of the question. For each category, we want to find the percentage of all searches that are taken up by that category. We then calculate the change in this value compared to the week before the convention. Lastly, we take the difference in these changes, between the DNC to the RNC. The categories that have the most extreme values of differences are the ones that can be said to be most disproportionately affected by the national conventions.
|change in % of searches, DNC||change in % of searches, RNC||difference in changes in %, DNC-RNC|
But this gives us a final value that’s difficult to explain and interpret. So, when we report the results, we just give the percentage increase in the number of searches in these heavily affected categories, for each convention.
This methodology can certainly be refined a great deal, but it’s enough to give us a decent first look.
Getting to the final story
Obviously, we can do more to improve our list. We can put error bars on the final values, and look for statistical significance in these differences. We can consider only searches made by non-local users. We can consider a different and more comprehensive “baseline” search set, instead of just looking at one week prior. We can use different cutoff points for throwing out small categories, or use different metrics to measure the changes.
But before we dive into more sophisticated data analysis techniques, we should consider the original goal: we wanted a story we can tell about the national conventions, using Yelp data. So then, what is the easiest thing we can do to make an interesting story about the DNC and RNC?
We already have a list, which we believe reflects the effects of the national conventions, at least in part. Additional data analysis techniques will slightly rearrange the order of the categories. This will make the final list more “accurate”, but that may not be necessary - it’s not like we’re trying to derive each party’s platforms from Yelp’s data.
And for that purpose, the low-hanging fruit is to make the story more interesting by focusing on interesting categories. Consider ‘Adult Entertainment’, for example. Seeing how the conventions impacted that category is far more interesting than determining exactly where the ‘Restaurants’ category falls on the overall list. In fact, the ‘Adult Entertainment’ category would be more interesting even if it turned out not to have been influenced by the conventions at all.
So, as the last part of this project, we curated the above list for interesting categories which also showed strong signs of being impacted by the national conventions. The results are presented on the Yelp blog.
For you statistics geeks out there, we also did a Bayesian analysis on whether the categories we presented really leaned the way we said. It turned out that for over 90% of the categories, we are at least 90% sure, and very frequently more than 99% sure, of the direction of the relative difference in searches due to the national conventions. Even for the least certain category, we’re more than 75% sure. So our conclusions are quite defensible statistically, even though “interesting” was our main goal.
The final product is a unique mix - yes, it does confirm some political stereotypes. But there are also surprises that go against these stereotypes, and still other elements that don’t fit neatly into a typical political narrative yet humanize the people gathered at these conventions. It’s a small, slightly unexpected glimpse into human nature, that serves as both window and mirror.
Our methodology is incredibly flexible. It can be used to generate many other comparisons similar to the DNC/RNC story. Do you want to know how California is different from Texas? Simply compare the searches from the two states, in the same way we compared the searches from a convention city to the week before. Do you want to know how California and Texas celebrate the 4th of July differently? Compare California searches during the 4th of July to the week before, and do the same for Texas. Then compare them against each other. Other questions of this kind include:
- How are red states and blue states different from one another?
- What are the most distinctive categories in each city in relation to the rest of the country?
- How do the people in Japan and Australia celebrate Christmas differently?
- How did the searches in Rio de Janeiro change from 2015 to 2016, as it prepared for the Olympics?
- What do people search for in the evenings vs. during the day?
- How does the day-to-evening search changes compare in New York vs. in Hawaii?
If we’re willing to expand the methodology just a bit, we can include things like user information and slicing around things other than business categories. Then we can address questions like:
- Which categories are more popular among the Yelp Elites vs. regular users?
- What time of the day are Elites more active compared to regular users?
- Which state has the most, or least, gender disparity in what each gender searches for?
As you can see, there’s a lot we can do with this. You can expect more upcoming stories - in fact, the red state/blue state analysis is now up on the Yelp blog!