Protected: Two ways to estimate primary outcomes without polls (transcript)
This content is password protected. To view it please enter your password below: Password:...
Senate: 48 Dem | 52 Rep (range: 47-52)
Control: R+2.9% from toss-up
Generic polling: Tie 0.0%
Control: Tie 0.0%
Harris: 265 EV (239-292, R+0.3% from toss-up)
Moneyball states: President NV PA NC
Click any tracker for analytics and data
I will comment on the East Coast primaries at the end of the post. First I will write about something more interesting: Google Correlate!
>>>
In human genetics there is a form of analysis called a genome-wide association study (“GWAS”). In this kind of analysis, the researcher looks for bits of DNA that show up more often in people with some trait or disease. Motivations for doing this kind of study include (a) finding genetic variations that contribute to a condition, so they can be studied; and (b) providing a way of estimating the chance that a condition will occur. However, GWAS is full of challenges. One of my research interests is autism. Autism is strongly driven by combinations of genes, yet GWAS has only succeeded in identifying a small fraction of the risk. Many of these bits of DNA have all kinds of other effects (this is a project in my lab…and hey, I’m recruiting!).
The Google Correlate method for political prediction is analogous to GWAS…but better! In this analogy, Google search terms are the “genes.” Thousands (maybe millions) of Google search terms are statistically associated with the frequency at which a state votes for Donald Trump, Ted Cruz, John Kasich, Hillary Clinton, or Bernie Sanders supporters. Some of these terms make intuitive sense; others are mind-bending.
I wrote about this idea the other day. (To learn how the method works, and to do it yourself, read this first.) Today I want to explain a little further. I will show some fascinating and often hilarious results.
Reader N. turned me on to Google Correlate, which is basically part of the engine behind Google Trends. Correlate takes a pattern you give it – baked bean sales by state, robbery rates over time, or whatever – and gives back the search terms that have a similar pattern. There are billions of search terms – similar to the number of DNA “letters” in the human genome.
N. created a text file of vote shares by state. Here is what the first lines of a file for Trump would look like:
Iowa,0.513
New Hampshire,0.186
South Carolina,0.357
Nevada,0.302
Alabama,0.306
Alaska,0.492
Arkansas,0.455
Georgia,0.347
Massachusetts,0.125
…
These Trump support numbers are fractions of the total Trump+Cruz+Kasich vote. Percentages are okay too, since Google Correlate rescales everything to a range of -1 to +1. Typos are not okay – Correlate is very unforgiving of misspelled state names.
If you upload Trump, Cruz, and Kasich files, Correlate gives back a list of the most-correlated search terms. Those lists look like this:
These lists were generated using vote-share data that excluded Ohio (Kasich’s home state) and Wyoming (a unusually nonrepresentative voting process, even by the standards of caucuses). That was N.‘s decision, after playing around with the data a bit.
The way to read the table is as follows: state-by-state, Trump support is correlated with the frequency of “DeGrassi season 13” with a correlation coefficient of 0.7438. Why? If a term shows up on this list, it doesn’t necessarily mean that the person doing the search supports Candidate X. It could also mean that relatives or neighbors of Candidate X’s voters tend to make that search.
To return to the GWAS analogy, such indirect connections are essential to genomic analysis: the snippet of DNA that is tracked is usually close to, but hardly ever identical to, the snippet that causes a trait. In genetics, even when there is a causal connection, it is not so obvious what is going on. For example, one gene, whose protein was thought to be mainly for how blood cells adhere to one another, turns out to be important in how synapses adhere as well – with implications for schizophrenia. The point is, we should be careful to avoid overinterpreting or overgeneralizing from the search terms above. But we should also keep our minds open about what we find!
Now, to examine some of these “hits”:
John Kasich. Places that like Kasich are richer in some fairly policy-wonkish search terms: “net cost,” “renewable portfolio standard,” the economist Joseph Stiglitz, Financial Times writer Martin Wolf, and Vox writer Dylan Matthews. These terms have a ring of plausibility. They might be good fodder for small talk…if you are talking with a Kasich supporter!
But then there are terms that I don’t entirely understand: Route 73 and Haven Pizza. Maybe someone can explain those to me. It is also true that with billions of search terms to choose from, occasionally a correlation will arise by chance. These might be false positives.
Ted Cruz. Many Cruz-related search terms are related to domestic life of a certain kind: family photos, felt Christmas stockings, scentsy plug ins, balloon animals, Baby Trend car seats, and DIY cribs. Easy enchiladas are particularly Cruz-y. Mmmm, enchiladas. And udder covers…I wasn’t expecting that one. Maybe the Cruz campaign could start distributing Cruz-themed udder covers!
Donald Trump. Note that the correlations are weaker. That could be because Trump support is broad-based in the Republican Party. Or it could be that the connection between the voter and the Google-searcher is indirect (i.e. they are different individuals who live near ne another).
At first, this list was quite puzzling. A prominent cluster of search terms is pop culture-related: DeGrassi (that’s a TV show that focuses on the problems of teens), Kids’ Choice Awards, and Nickolodeon star Alexa Nikolas. And…”never had a boyfriend.” Wait, what? I thought Trump voters skewed older.
N.‘s coworker had a thought. According to correlations from Neil Irwin and Josh Katz at The Upshot, Trump supporters are abundant in communities where many people have not finished high school, or places with lots of mobile home dwellers and “old economy” jobs like manufacturing. She suggests that in some families, the children of Trump voters are the only ones in the house who use the Internet. And at least some of them are looking for entertainment and relationship advice online. I do not think it is totally advisable to ask Google about boyfriends…but these searchers are helping us to predict voting patterns. So…you go, girls!
>>>
The final step in making a prediction is converting these search terms to predicted vote share. Each of the search terms has a correlation score attached to it. One could do a weighted sum using those scores…but they are not that different, so N. calculated a simple average of the terms’ coefficients, state by state. She then did a simple linear regression between that average and the vote-share for that candidate. The resulting linear fit is a formula that can be used to predict vote share in places that have not voted yet. N. repeated this for the other two candidates. The result is a set of inferred values of vote share – what a genomics researcher would call “imputed” values.
And what about today’s primaries? The PEC poll estimate and the Correlate-based estimate are in close agreement:
Because none of these five states have voted yet, none of them went into the Google-Correlate approach. The two green columns of estimates were made independent of one another. Today we will see how these two methods do. They both suggest that in the East Coast primary, Donald Trump will get about 150 delegates (this includes 43 out of 54 district-level Pennsylvania delegates, whose estimated faithfulness is 0.8), over 85% of the total to be given today.
I have one final thought on the analogy with GWAS. These days, modelers who attempt to predict votes from demographic factors are relying on linkages that they suppose to exist between populations and voting behavior. The education/Trump connection is an example of that. In human genetics, that would be called a “candidate gene” approach. Such approaches can turn up a good result if the hypothesis happens to be true. Often, they lead to results that have not held up so well over time.
In both human genetics and political modeling, constructing a good model requires the modeler to be an extremely good guesser. Google Correlate has the potential to search millions of possibilities at once. I think a clever modeler could really go to town with this tool.
>>>
Democrats, I have not forgotten you. Here as an added bonus, search terms that correlate with Clinton and Sanders support. The age/race divide is quite apparent. Where Sanders supporters are found, you can evidently get quinoa soup, PBS Nova, and people who do not know what “wonky” means. And baked goods. A lot of baked goods. It is a veritable cornucopia of Stuff White People Like.
Near Clinton supporters it’s cheap bedroom furniture, Nicki Minaj fans, and pink hoverboard shoppers. And “career in” – Google auto-complete as a job counselor!
Update, April 26th, 11:00pm: In today’s East Coast primaries, Google Correlate did better than any poll-based method:
Very enjoyable and interesting post. About “haven pizza,” New Haven style pizza is a type of pizza that originated in Connecticut, and has spread out from there. https://en.wikipedia.org/wiki/New_Haven-style_pizza How that connects with John Kasich, I’m not quite sure.
This is mindbendingly interesting. It’s rare to find an analysis that is so completely new. I admit I had expected search terms high on the lists to be more obviously associated with the candidates (religion and abortion with Cruz, unemployment benefits with Trump, etc).
This is more like Neural Nets, where peeking at the values at the hidden nodes gives you almost no insight. One can come up with “just so” stories that sound pretty plausible (like Trump voters’ kids searching for help getting a boyfriend) but how to confirm that?
very creative technique. very similar also to attribution modeling.
if pollsters get data on browsing habits for respondents, they can run a relatively easy analysis to tie specific searches more concretely to voting habits. To Dr Wang’s point though, this still would be correlational, not causal.
Interesting. Most pundits think Trump will win around 100 delegates, +/- 5 or 6 today. If he wins 154 delegates, he will cross the psychologically important mark of 1000 before the race goes to Indiana.
Wow. When sentient computers arrive, we won’t able to understand them.
Very interesting. Many thanks for publishing it. Why did you not publish predictions for Bernie and Hillary using the same method?
Because I think that race is over.
It would be nice to see the Dem predictions not because we can’t see the ultimate result in the race but because they might help test the model.
Do you have the text files with the candidates vote shares available somewhere?
A couple of posts ago Sam posted the .csv files for Trump, Cruz, Kasich. The post is titled “Two independent ways of predicting GOP primaries lead to highly similar forecasts”. If you go to the comments, Sam responds to one of them by posting links to the files
What’s the Joe Stiglitz connection to Kasich? Or is it just he’s most popular in districts with liberals?
I’m still a little bit confused with the final step to make predictions. So I know that “Degrassi season 13” is correlated with Trump at 0.7438. I then average this with the corr coefficients of the term for the other candidates?
I’m really missing this final step. Anybody wants to explain?
When he says “Each of the search terms has a correlation score attached to it.” he meant a relative popularity score of the search term for each state, not what he shows in the picture. you can download an example csv here: https://www.google.com/trends/correlate/csv?e=id%3Au4WEFVag_a7&t=all Those state by state popularity numbers are linearly correlated to the state by state vote shares. You then take an average over all of the 100 search terms for each state, regress that average against the vote share, and then use the average and the regression to predict the unknown states.
That link doesn’t seem to work, try this one and click “Export data as CSV” https://www.google.com/trends/correlate/search?e=id:u4WEFVag_a7&t=all
I cannot vote for a man correlated to quinoa soup.
white quinoa (more mushy) or
red quinoa(more crunchy)
I think it’s more likely that the baked goods are strongly correlated with each other despite people searching for them independently? So because he’s correlated with one of them, he’s closely correlated with many of them.
(Also, this wouldn’t be affected by another Sanders – what we’re getting correlations for is the voting percentage pattern.)
Colonel Sanders?
Intuitively, I prefer your modified neighbor-joining technique. But that’s probably because I don’t completely understand this one.
Route 73 is probably not a random correlate with Kasich. Route 73 is a stretch of scenic highway in southern Ohio.
I’d be interested in seeing Clinton Sanders with open primary/caucus vs closed. With his huge advantage with independent voters (the majority of voters) this would be much more informative and predictive.
Google correlate is pure comedy.
However, many search terms [pizza, slang, pop culture, highways with sunset views] may tie in with student demographics. Surely one of CURRENT main reasons to research Kasich is a compare and contrast school assignment. Extra credit exam essay _with_ browsing priveleges? Whatevs.
So is this person doing basically the same technique?
https://www.reddit.com/r/hillaryclinton/comments/4fqofz/candied_nuts_predictions_update_all_hail_candied/
You search for “wonkish” when you don’t know what it is OR when you want to read something wonkish…
You search for ‘wonkish’ when you’re reading Paul Krugman’s blog for the first time. Wanna bet?
Udder covers are a brand of nursing cover. So they VERY much fit in with the family searches (esp for a conservative mother who would be more likely to cover up while breastfeeding)