Two independent ways of predicting GOP primaries lead to highly similar forecasts

April 22nd, 2016, 3:15pm by Sam Wang

Today’s update: Trump median 1285 delegates (IQR: 1239-1320). Probability of a pledged majority is 75%.

Despite the usual complaints, primary polls do reasonably well when aggregated. To understand a state, it is far worse to have no polls at all. As the joke goes, “That restaurant’s food is terrible. And such small portions!”

Unfortunately, we have no public polls for the Republican primary in Indiana (update – just in, we have Trump +6%, very close to both of today’s estimates, Trump +7% and Trump +5%). Indiana is pivotal to whether Donald Trump can get to a majority of pledged delegates. You’d think data pundits would rush to fill this void. But that has not been the case.

For any data pundit, the absence of polling has been a serious problem if the question is anywhere close to a tie. At the New York Times, The Upshot has made a demographics-based effort, but I believe that calculation missed Wisconsin (and lacks details). The Great Argental Satan seems to favor Cruz for Indiana in a fairly vague way. He and his staff make extremely weak use of demographics-based analysis, perhaps appropriately so; as far as I am aware, their approach is not strong enough to repair inaccurate polling (for instance, the Michigan Democratic primary). For better performance, there is a need for a method that uses state-level information that is more specific than general demographic composition.

Which brings us to today’s topic. I will show you two independent methods for estimating Trump/Cruz/Kasich support without demographics or polls. This is a long post.

Bottom line: the two methods agree in all important respects. Trump is favored in the remaining Eastern states, including Indiana. Cruz is favored in the remaining states west of the Mississippi (Washington, Oregon, Nebraska, and South Dakota). The only point of disagreement is New Mexico. Both methods indicate that Trump is on a path to more than 1237 pledged delegates.

The first method is the border-county-based analysis that I use for PEC’s calculation (see the banner above). The second method was suggested to me by PEC reader N., using Google Correlate.

As I wrote the other day, all four states surrounding Indiana (Illinois, Michigan, Kentucky, and Ohio) have already voted. Looking at 17 previous states, if Cruz finished ahead of Trump in more border counties, then he always won. If Trump had more border counties, then he almost won. Overall, this approach works 15 out of 17 times. One miss, Wisconsin, is right at the threshold. So this measure is correct 15/17=88% of the time.

Here is a full list that includes future states.

These are the margins that I am using to calculate the overall distribution of outcomes.


Now let’s consider a second, independent way of inferring the preferences of a state’s primary voters. Princeton Election Consortium reader N. used Google Correlate, which is basically the inverse of Google Trends: you give it an overall pattern (for instance, vote share in a list, with the corresponding states), and it gives back what search terms have the most similar patterns as the input. N. then used these to predict states that haven’t voted yet.

The results agree closely with the border-county-based estimates above—including a narrow Trump win in Indiana. In some cases it goes out on a limb: it predicts Trump +53 in Delaware, making that state the Trumpiest in the country. The tool also predicts a similar Trump margin in Rhode Island.

To test the method, N. removed states from the data, then re-ran the correlations without them. This gives pretty good retroactive predictions. For example, for Wisconsin it predicts Cruz 46.6, Trump 38.5, Kasich 14.9, whereas the vote was Cruz 49.5, Trump 36.0, Kasich 14.5. So right there it performs as well as polls, and better than demographic models and pundits.

It also worked in an extreme state, Idaho:
Predicted: Trump 28.3% Cruz 61.8% Kasich 9.9%
Actual: Trump 34.7% Cruz 56.1% Kasich 9.2%

Here’s the Google Correlate-based prediction for the remaining states. N. used relative vote share between Trump/Cruz/Kasich here, so in early states where Rubio did well, the numbers don’t capture what the true vote shares would have been.

Using poll medians as an independent measure of voter opinion, we can see if these numbers pass a smell test. N.‘s outputs have the same rank order (i.e. Trump first, then Cruz second, or whatever) in all six states (NY, CT, MD, PA, CA, NJ). The correlation coefficient is +0.87.

However, the exact percentages were not quite on target for this week’s election. In NY the polls were Trump 54%, Cruz 18%, Kasich 22%; voting was 60%, 15%, 25%. In that case N.‘s output was too favorable to Kasich/unfavorable to Trump.

At first, I was surprised to see Cruz favored in Oregon and Washington. Originally I had skipped doing the border-county analysis there because those states use proportional allocation. It didn’t seem critical. But wait! Six out of six counties bordering Washington favored Cruz over Trump. In Oregon, it’s 4 counties for Trump, 5 counties for Cruz. So the two methods basically match. That is awesome.

N. did curate the inputs a bit. She left out Ohio, because she found that “including it dramatically skewed Kasich’s correlates in a way that really didn’t look like it made sense.” And of course this would underestimate Kasich by a large margin. Taking the same leave-out approach for Texas didn’t change Cruz’s pattern much, consistent with Kasich’s status as a favorite-son candidate with highly focused geographic appeal.

I’ll let N. describe exactly what she did in her own words.  As you will see, it would be easy to replicate:

I just took the top 100 correlated search terms and did a state-by-state regression against vote share to fill in the remaining states. Specifically, what I did was:

(1) make three CSV text files listing the vote share for each candidate in each state so far (Note: I used “percentage of the Trump+Cruz+Kasich vote”—using percentage of the total vote would give different results in states where Rubio/etc did well. There are various ways you can handle that, with arguments for each.)

For, example, make a text file called cruz.csv, like this:
New Hampshire,0.3351
for all the states for which you have data.

(2) upload them to Google Correlate’s state-map tool. There’s an easy “download CSV” button. It will treat any states not in the file as unknowns.

(3) download the CSV dump for the 100 most correlated search terms for each, which gives the popularity of each term by state.

(4) for all the remaining states that haven’t voted, take the simple average of the popularity (in the CSV) of all 100 search terms in that state, which gave an overall score. That score is (roughly) linearly correlated with vote share, since that’s what Google Correlate was looking for in each search term.

(5) do the linear regression back to vote share using those search term scores, which lets you fill in the unknown states.


As an overall validation, let’s compare the outputs of the border-county and Google Correlate approaches. The two outputs show an amazing degree of correspondence:

The correlation coefficient is +0.76, and the two methods only disagree on one expected winner – in New Mexico, where N.‘s approach gives a tiny advantage to Trump. Such a low level of disagreement is not bad at all.

I do note that the Google Correlate method tends to assign larger win margins. If these come true, Trump would have quite a decisive advantage in getting above 1,237 delegates.


N. sent me this analysis on Tuesday. Since that time, three Indiana polls have been leaked. They show a near-tie with Trump barely ahead. However, I would take this news with a grain of salt. I am suspicious of leaked data generally. Often it’s spin from the trailing side, to make things look closer than they are. (Update: More importantly, a WHTR/HPI poll was just released showing Trump ahead by 6%. This closely matches our two methods, which average Trump +5%.)

Both our methods face several real tests in the coming weeks. Indiana might not be so hard. But from a modeling standpoint, the harder tests are Washington, Oregon, and New Mexico.

New Mexico is of particular interest. It is the only state where N.‘s approach and mine give opposite results. Since Google Correlate can “reach into” the center of a state, I might actually believe its prediction slightly more — a near-tie, Trump +1%. New Mexico has proportional representation, so the two models give a difference of only 4 delegates. In any event, New Mexico will be a useful test for comparing the two approaches.

Finally, to quote N. again, “I’ll also feel pretty good if there’s a Trump blowout in Delaware. Just because it’s such an odd little state in so many ways.” And on cue, here is a new survey: 55% for Trump, and an additional 12% undecided. If Trump gets many of those undecided voters, his total could end up pretty close to Google Correlate-based prediction of 69%. Make Delaware Great Again!

