Google-Wide Association Studies

April 26, 2016 by Sam Wang

I will comment on the East Coast primaries at the end of the post. First I will write about something more interesting: Google Correlate!

>>>

In human genetics there is a form of analysis called a genome-wide association study (“GWAS”). In this kind of analysis, the researcher looks for bits of DNA that show up more often in people with some trait or disease. Motivations for doing this kind of study include (a) finding genetic variations that contribute to a condition, so they can be studied; and (b) providing a way of estimating the chance that a condition will occur. However, GWAS is full of challenges. One of my research interests is autism. Autism is strongly driven by combinations of genes, yet GWAS has only succeeded in identifying a small fraction of the risk. Many of these bits of DNA have all kinds of other effects (this is a project in my lab…and hey, I’m recruiting!).

The Google Correlate method for political prediction is analogous to GWAS…but better! In this analogy, Google search terms are the “genes.” Thousands (maybe millions) of Google search terms are statistically associated with the frequency at which a state votes for Donald Trump, Ted Cruz, John Kasich, Hillary Clinton, or Bernie Sanders supporters. Some of these terms make intuitive sense; others are mind-bending.

I wrote about this idea the other day. (To learn how the method works, and to do it yourself, read this first.) Today I want to explain a little further. I will show some fascinating and often hilarious results.

Reader N. turned me on to Google Correlate, which is basically part of the engine behind Google Trends. Correlate takes a pattern you give it – baked bean sales by state, robbery rates over time, or whatever – and gives back the search terms that have a similar pattern. There are billions of search terms – similar to the number of DNA “letters” in the human genome.

N. created a text file of vote shares by state. Here is what the first lines of a file for Trump would look like:

Iowa,0.513
New Hampshire,0.186
South Carolina,0.357
Nevada,0.302
Alabama,0.306
Alaska,0.492
Arkansas,0.455
Georgia,0.347
Massachusetts,0.125
…

These Trump support numbers are fractions of the total Trump+Cruz+Kasich vote. Percentages are okay too, since Google Correlate rescales everything to a range of -1 to +1. Typos are not okay – Correlate is very unforgiving of misspelled state names.

If you upload Trump, Cruz, and Kasich files, Correlate gives back a list of the most-correlated search terms. Those lists look like this:

These lists were generated using vote-share data that excluded Ohio (Kasich’s home state) and Wyoming (a unusually nonrepresentative voting process, even by the standards of caucuses). That was N.‘s decision, after playing around with the data a bit.

The way to read the table is as follows: state-by-state, Trump support is correlated with the frequency of “DeGrassi season 13” with a correlation coefficient of 0.7438. Why? If a term shows up on this list, it doesn’t necessarily mean that the person doing the search supports Candidate X. It could also mean that relatives or neighbors of Candidate X’s voters tend to make that search.

To return to the GWAS analogy, such indirect connections are essential to genomic analysis: the snippet of DNA that is tracked is usually close to, but hardly ever identical to, the snippet that causes a trait. In genetics, even when there is a causal connection, it is not so obvious what is going on. For example, one gene, whose protein was thought to be mainly for how blood cells adhere to one another, turns out to be important in how synapses adhere as well – with implications for schizophrenia. The point is, we should be careful to avoid overinterpreting or overgeneralizing from the search terms above. But we should also keep our minds open about what we find!

Now, to examine some of these “hits”:

John Kasich. Places that like Kasich are richer in some fairly policy-wonkish search terms: “net cost,” “renewable portfolio standard,” the economist Joseph Stiglitz, Financial Times writer Martin Wolf, and Vox writer Dylan Matthews. These terms have a ring of plausibility. They might be good fodder for small talk…if you are talking with a Kasich supporter!

But then there are terms that I don’t entirely understand: Route 73 and Haven Pizza. Maybe someone can explain those to me. It is also true that with billions of search terms to choose from, occasionally a correlation will arise by chance. These might be false positives.

Ted Cruz. Many Cruz-related search terms are related to domestic life of a certain kind: family photos, felt Christmas stockings, scentsy plug ins, balloon animals, Baby Trend car seats, and DIY cribs. Easy enchiladas are particularly Cruz-y. Mmmm, enchiladas. And udder covers…I wasn’t expecting that one. Maybe the Cruz campaign could start distributing Cruz-themed udder covers!

Donald Trump. Note that the correlations are weaker. That could be because Trump support is broad-based in the Republican Party. Or it could be that the connection between the voter and the Google-searcher is indirect (i.e. they are different individuals who live near ne another).

At first, this list was quite puzzling. A prominent cluster of search terms is pop culture-related: DeGrassi (that’s a TV show that focuses on the problems of teens), Kids’ Choice Awards, and Nickolodeon star Alexa Nikolas. And…”never had a boyfriend.” Wait, what? I thought Trump voters skewed older.

N.‘s coworker had a thought. According to correlations from Neil Irwin and Josh Katz at The Upshot, Trump supporters are abundant in communities where many people have not finished high school, or places with lots of mobile home dwellers and “old economy” jobs like manufacturing. She suggests that in some families, the children of Trump voters are the only ones in the house who use the Internet. And at least some of them are looking for entertainment and relationship advice online. I do not think it is totally advisable to ask Google about boyfriends…but these searchers are helping us to predict voting patterns. So…you go, girls!

>>>

The final step in making a prediction is converting these search terms to predicted vote share. Each of the search terms has a correlation score attached to it. One could do a weighted sum using those scores…but they are not that different, so N. calculated a simple average of the terms’ coefficients, state by state. She then did a simple linear regression between that average and the vote-share for that candidate. The resulting linear fit is a formula that can be used to predict vote share in places that have not voted yet. N. repeated this for the other two candidates. The result is a set of inferred values of vote share – what a genomics researcher would call “imputed” values.

And what about today’s primaries? The PEC poll estimate and the Correlate-based estimate are in close agreement:

GOP-2016-Google-Correlate-vs-poll-median-25apr2016

Because none of these five states have voted yet, none of them went into the Google-Correlate approach. The two green columns of estimates were made independent of one another. Today we will see how these two methods do. They both suggest that in the East Coast primary, Donald Trump will get about 150 delegates (this includes 43 out of 54 district-level Pennsylvania delegates, whose estimated faithfulness is 0.8), over 85% of the total to be given today.

I have one final thought on the analogy with GWAS. These days, modelers who attempt to predict votes from demographic factors are relying on linkages that they suppose to exist between populations and voting behavior. The education/Trump connection is an example of that. In human genetics, that would be called a “candidate gene” approach. Such approaches can turn up a good result if the hypothesis happens to be true. Often, they lead to results that have not held up so well over time.

In both human genetics and political modeling, constructing a good model requires the modeler to be an extremely good guesser. Google Correlate has the potential to search millions of possibilities at once. I think a clever modeler could really go to town with this tool.

>>>

Democrats, I have not forgotten you. Here as an added bonus, search terms that correlate with Clinton and Sanders support. The age/race divide is quite apparent. Where Sanders supporters are found, you can evidently get quinoa soup, PBS Nova, and people who do not know what “wonky” means. And baked goods. A lot of baked goods. It is a veritable cornucopia of Stuff White People Like.

Near Clinton supporters it’s cheap bedroom furniture, Nicki Minaj fans, and pink hoverboard shoppers. And “career in” – Google auto-complete as a job counselor!

Update, April 26th, 11:00pm: In today’s East Coast primaries, Google Correlate did better than any poll-based method:

GoogleCorrelate-performance-EastCoastprimaries_

Topics:

26 Comments

Froggy says:

Google-Wide Association Studies

26 Comments

Leave a Reply Cancel reply

Related Content