Princeton Election Consortium

A first draft of electoral history. Since 2004

Two independent ways of predicting GOP primaries lead to highly similar forecasts

April 22nd, 2016, 3:15pm by Sam Wang

Today’s update: Trump median 1285 delegates (IQR: 1239-1320). Probability of a pledged majority is 75%.

Despite the usual complaints, primary polls do reasonably well when aggregated. To understand a state, it is far worse to have no polls at all. As the joke goes, “That restaurant’s food is terrible. And such small portions!”

Unfortunately, we have no public polls for the Republican primary in Indiana (update – just in, we have Trump +6%, very close to both of today’s estimates, Trump +7% and Trump +5%). Indiana is pivotal to whether Donald Trump can get to a majority of pledged delegates. You’d think data pundits would rush to fill this void. But that has not been the case.

For any data pundit, the absence of polling has been a serious problem if the question is anywhere close to a tie. At the New York Times, The Upshot has made a demographics-based effort, but I believe that calculation missed Wisconsin (and lacks details). The Great Argental Satan seems to favor Cruz for Indiana in a fairly vague way. He and his staff make extremely weak use of demographics-based analysis, perhaps appropriately so; as far as I am aware, their approach is not strong enough to repair inaccurate polling (for instance, the Michigan Democratic primary). For better performance, there is a need for a method that uses state-level information that is more specific than general demographic composition.

Which brings us to today’s topic. I will show you two independent methods for estimating Trump/Cruz/Kasich support without demographics or polls. This is a long post.

Bottom line: the two methods agree in all important respects. Trump is favored in the remaining Eastern states, including Indiana. Cruz is favored in the remaining states west of the Mississippi (Washington, Oregon, Nebraska, and South Dakota). The only point of disagreement is New Mexico. Both methods indicate that Trump is on a path to more than 1237 pledged delegates.

The first method is the border-county-based analysis that I use for PEC’s calculation (see the banner above). The second method was suggested to me by PEC reader N., using Google Correlate.

As I wrote the other day, all four states surrounding Indiana (Illinois, Michigan, Kentucky, and Ohio) have already voted. Looking at 17 previous states, if Cruz finished ahead of Trump in more border counties, then he always won. If Trump had more border counties, then he almost won. Overall, this approach works 15 out of 17 times. One miss, Wisconsin, is right at the threshold. So this measure is correct 15/17=88% of the time.

Here is a full list that includes future states.

These are the margins that I am using to calculate the overall distribution of outcomes.


Now let’s consider a second, independent way of inferring the preferences of a state’s primary voters. Princeton Election Consortium reader N. used Google Correlate, which is basically the inverse of Google Trends: you give it an overall pattern (for instance, vote share in a list, with the corresponding states), and it gives back what search terms have the most similar patterns as the input. N. then used these to predict states that haven’t voted yet.

The results agree closely with the border-county-based estimates above—including a narrow Trump win in Indiana. In some cases it goes out on a limb: it predicts Trump +53 in Delaware, making that state the Trumpiest in the country. The tool also predicts a similar Trump margin in Rhode Island.

To test the method, N. removed states from the data, then re-ran the correlations without them. This gives pretty good retroactive predictions. For example, for Wisconsin it predicts Cruz 46.6, Trump 38.5, Kasich 14.9, whereas the vote was Cruz 49.5, Trump 36.0, Kasich 14.5. So right there it performs as well as polls, and better than demographic models and pundits.

It also worked in an extreme state, Idaho:
Predicted: Trump 28.3% Cruz 61.8% Kasich 9.9%
Actual: Trump 34.7% Cruz 56.1% Kasich 9.2%

Here’s the Google Correlate-based prediction for the remaining states. N. used relative vote share between Trump/Cruz/Kasich here, so in early states where Rubio did well, the numbers don’t capture what the true vote shares would have been.

Using poll medians as an independent measure of voter opinion, we can see if these numbers pass a smell test. N.‘s outputs have the same rank order (i.e. Trump first, then Cruz second, or whatever) in all six states (NY, CT, MD, PA, CA, NJ). The correlation coefficient is +0.87.

However, the exact percentages were not quite on target for this week’s election. In NY the polls were Trump 54%, Cruz 18%, Kasich 22%; voting was 60%, 15%, 25%. In that case N.‘s output was too favorable to Kasich/unfavorable to Trump.

At first, I was surprised to see Cruz favored in Oregon and Washington. Originally I had skipped doing the border-county analysis there because those states use proportional allocation. It didn’t seem critical. But wait! Six out of six counties bordering Washington favored Cruz over Trump. In Oregon, it’s 4 counties for Trump, 5 counties for Cruz. So the two methods basically match. That is awesome.

N. did curate the inputs a bit. She left out Ohio, because she found that “including it dramatically skewed Kasich’s correlates in a way that really didn’t look like it made sense.” And of course this would underestimate Kasich by a large margin. Taking the same leave-out approach for Texas didn’t change Cruz’s pattern much, consistent with Kasich’s status as a favorite-son candidate with highly focused geographic appeal.

I’ll let N. describe exactly what she did in her own words.  As you will see, it would be easy to replicate:

I just took the top 100 correlated search terms and did a state-by-state regression against vote share to fill in the remaining states. Specifically, what I did was:

(1) make three CSV text files listing the vote share for each candidate in each state so far (Note: I used “percentage of the Trump+Cruz+Kasich vote”—using percentage of the total vote would give different results in states where Rubio/etc did well. There are various ways you can handle that, with arguments for each.)

For, example, make a text file called cruz.csv, like this:
New Hampshire,0.3351
for all the states for which you have data.

(2) upload them to Google Correlate’s state-map tool. There’s an easy “download CSV” button. It will treat any states not in the file as unknowns.

(3) download the CSV dump for the 100 most correlated search terms for each, which gives the popularity of each term by state.

(4) for all the remaining states that haven’t voted, take the simple average of the popularity (in the CSV) of all 100 search terms in that state, which gave an overall score. That score is (roughly) linearly correlated with vote share, since that’s what Google Correlate was looking for in each search term.

(5) do the linear regression back to vote share using those search term scores, which lets you fill in the unknown states.


As an overall validation, let’s compare the outputs of the border-county and Google Correlate approaches. The two outputs show an amazing degree of correspondence:

The correlation coefficient is +0.76, and the two methods only disagree on one expected winner – in New Mexico, where N.‘s approach gives a tiny advantage to Trump. Such a low level of disagreement is not bad at all.

I do note that the Google Correlate method tends to assign larger win margins. If these come true, Trump would have quite a decisive advantage in getting above 1,237 delegates.


N. sent me this analysis on Tuesday. Since that time, three Indiana polls have been leaked. They show a near-tie with Trump barely ahead. However, I would take this news with a grain of salt. I am suspicious of leaked data generally. Often it’s spin from the trailing side, to make things look closer than they are. (Update: More importantly, a WHTR/HPI poll was just released showing Trump ahead by 6%. This closely matches our two methods, which average Trump +5%.)

Both our methods face several real tests in the coming weeks. Indiana might not be so hard. But from a modeling standpoint, the harder tests are Washington, Oregon, and New Mexico.

New Mexico is of particular interest. It is the only state where N.‘s approach and mine give opposite results. Since Google Correlate can “reach into” the center of a state, I might actually believe its prediction slightly more — a near-tie, Trump +1%. New Mexico has proportional representation, so the two models give a difference of only 4 delegates. In any event, New Mexico will be a useful test for comparing the two approaches.

Finally, to quote N. again, “I’ll also feel pretty good if there’s a Trump blowout in Delaware. Just because it’s such an odd little state in so many ways.” And on cue, here is a new survey: 55% for Trump, and an additional 12% undecided. If Trump gets many of those undecided voters, his total could end up pretty close to Google Correlate-based prediction of 69%. Make Delaware Great Again!

Tags: 2016 Election · President

28 Comments so far ↓

  • Phil

    Your forgetting one major thing. This is America, we don’t live in a democracy. It’s an Oligarchy (According to your fellow Princeton faculty members Gilens and Page and their 2014 publication) and they have made the system so we only have an illusion of choice. Most of Penn’s delegates are unbound and if you don’t think the GOP will literally bride them you should think again.

    WV is also a likely win the state lose the delegates trap for trump.

  • Bill Herschel

    “For instance, the sudden movement to House Republicans in the 3rd week of September 2014. There was no obvious correlated event, but Google might know!”

    I will bet a million dollars that the correlated event was Obama’s refusal to close the borders to travelers from Africa on account of Ebola.

    Obama went with the CDC and was right. The electorate went with TV terror and elected a Republican Congress.

    We have an Ebola Congress.

  • Marc Shepherd

    For the “boundary counties” test, does it matter how many boundaries there are?

    Washington State is bounded by Canada, the ocean, Oregon, and Idaho. Oregon hasn’t voted yet. Canada and the ocean will never vote. I am somewhat skeptical that Idaho has much predictive power over the entire state of Washington. Maybe eastern Washington, but not the whole state.

    • Sam Wang

      Your concern is probably minor. SD of most states across districts is ~5%. Go look up 2012 results at The Green Papers.

      An exception might occur if a party’s strength is far from the border. NY Dems for example. Borders may be better suited for GOP primaries.

  • Shelly

    You may not think well of Nate Silvers methods but at least he puts his projections for every contest out there. I used to count on you for every state but where are you? We need your work

    • Sam Wang

      I actually do think he provides a service. It is not how I would do it, but when he does the numerical part, it is good enough for consumers. Lately he seems not to be doing it so much. He’s become more like the other pundits.

      Plus, he is paid for the work. For me this is a secondary activity to my main work. As the free alternative, I can update things as they seem needed…but I’ll never be comprehensive.

    • Matt McIrvin

      Sam’s big thing has always been to aggregate state poll data to get a picture of the general election campaign (especially during presidential years). We’re only now getting close to the point in the 2016 cycle where it makes sense to start doing that. The primary campaigns aren’t completely settled, though they’re getting close. A lot of states don’t actually have any general-election polls yet, and in many others the data are stale and sparse.

  • Henry Wilton

    Perhaps I misunderstood N.’s method, but isn’t it essentially the definition of overfitting? (Obviously the fact that the other method seems to support its conclusions is somewhat reassuring.)

    • Sam Wang

      There is a risk of that. Let me quote from some of my correspondence with her. I had suggested the possibility of extracting error bars from her last step of the fit, or by comparing outputs with outcomes in states that have voted. The reply:

      Given how much potential for overfitting there is in selecting from the billions of possible search terms, the uncertainty comes not from the regression strength, but from the uncertainty in whether the correlation holds up outside the states-that-voted-so-far data set….You could check against the vote shares in the previous states, to get the correlation to put error bars on it, but I really think that would be almost guaranteed to be underestimating the real error, because of the overfitting problem I mentioned, and I don’t know by how much. Which is of course right at the heart of the idea of error in the first place! A better solution there is to try removing individual states or groups of states from the dataset and see how well they’re predicted if you run the whole processes on the others.

      This is why I quoted those results for Wisconsin and Idaho. Also, toward the end I emphasize the match between the two independent methods, and also why I did not do more with the exact amplitude of the imputed margins.

  • Frank

    For all intents and purposes, the general campaign starts May 1. “The cake is baked” is the common metaphor, but I liken it more to “the concrete is settling.” For the GOP front, they will need to implement a jackhammer to it in Cleveland.

    As for the other side, I would warn against being overconfident. Six months from May to Election Day is a long time. I predict Trump will close the gap to within 2-3% by November 1.

  • Amitabh Lath

    Amazing. Just amazing. This is paradigm shifting work. N. is a genius.

    And for two completely different techniques to give similar answers is remarkable.

    I am burning to know what these search terms are. How much do they vary by state? Are there one or two or a dozen that have more pull than the rest, or do all 100 contribute equally?

    Are there any search terms that have a large weight but are completely counter-intuitive ie, non-political, something you wouldn’t otherwise associate with primary voting?

    • Sam Wang

      Well, you can get them yourself. I will probably run some of these and post them.

      (Update: can’t get it to work. Will try again later. Or someone do it and send it to me? Files are here: Trump Cruz Kasich)

    • Amitabh Lath

      I wonder if the phrases like “Cruz mathematically eliminated” have a large negative weight on his support. There are polls that indicate that Republican primary voters care about convention shenanigans but how is that connected to voting?

      Google correlate can tell you!

      In fact N.’s technique has potential to shine a light on several mysterious sudden moves that polling cannot catch.

      For instance, the sudden movement to House Republicans in the 3rd week of September 2014. There was no obvious correlated event, but Google might know!

  • Lorem

    Two (probably) good new ideas at once! Excellent! Looking forward to seeing them validated.

    Am I correct in reading that N’s calculation does not weight the 100 different search terms by their degree of correlation with outcomes (or by rank, if that’s not available)? If so, why not? That seems like a plausible thing to do.

    • Stanlee

      Anything but equal weights would greatly increase risk of over fitting on spurious correlation with a search term or two. Already tempting fate taking 100 out of billions but at least averaging them gives you a chance to extract signal from any real predictors. Might work even better with a larger set of correlates if Google can output it since the larger the number the better the odds the false predictors cancel each out when averaged.

    • Sam Wang

      Apparently, the strengths of all the correlations were pretty similar—within about 7% between the most- and least-correlated term. Generally seems like a good idea though.

  • truedson

    Fascinating stuff. Changing the subject slightly some guy from North Dakota on the RNC committee was on CNBC today claiming that even if Trump got to 1237 it was irrelevant. He was going to propose an amendment that if someone got 1 delegate vote they were eligible to win on the first ballot! So 1 delegate is great a majority is useless….

  • Bill Herschel

    I think that what came out of New York or, if you will, did not come out of New York was the death of Ted Cruz. He’s finished.

    Thus, even if Trump is shy of a majority, I think it becomes impossible to deny him the nomination. The damoiselle obèse has sung.

    As for the Great Argental Satan, he let his feelings get in his way. He may be finished too.

  • LondonBob

    Problem for the Plains and Mountain West is prior elections were caucuses and many of the states had big Mormon populations. SD, Montana and Nebraska are all primaries.

    Surprised how accurate I side with has been for areas of Cruz strength.

    • LondonBob

      The Idaho results in the non Mormon west of the state suggest a much closer race in that region. Hopefully we will get a poll from that region soon

  • Scott Dougan

    Correction: there have been several occasions this season in which polls only and polls-plus disagree. Usually the difference is only in magnitude, sometimes the two disagree on the predicted winner. Usually these disagreements sort out by Election Day. And yes, Silver did build a demographic model for this election, which they say has been more productive than the polls. For Indiana, in fact, they favored Cruz in a vague way because Illinois, Ohio and Kentucky were weak states for Trump. Now that polling is in, I assume that will affect their analysis.

    • Sam Wang

      Well, of course they sort out by Election Day. They have to. This assertion has an element of untestability – if polls overcome the model, then what is the procedure for validating the model?

      In regard to Indiana, that sounds like baloney to me. Trump led Cruz by an average of 10.5 points in those three states.

  • Sam Lively

    The polls-plus forecast of California did disagree markedly with the polls-only forecast up until the latest polls blew Trump’s Cali lead into the double digits.

    • Sam Wang

      And then once polls are in, polls win. The point is that those forecasts have an awfully large uncertainty.

      My point today is that I am presenting two methods that are completely independent, yet point in the same direction. I find that interesting.

  • Joe

    I think this might be a transcription error re: New Mexico – I see a 1 point win in the above chart and a 36 point win in the below chart- possible NJ got transcribed to NM? Also creates much less distinction between methods.

Leave a Comment