Obama: "Next year someone else will be standing here in this spot, and it's anyone's guess who she will be." #WHCD pic.twitter.com/XVFxZUzKeJ

— Huffington Post (@HuffingtonPost) May 1, 2016

General-election matchup polls (*e.g.* Clinton v. Trump) started to become informative in February. In May, they tell us quite a lot – and give a way to estimate the probability of a Hillary Clinton victory.

First, let us examine the primary evidence. Wlezien and Erikson have gathered presidential preference polls from 1952-2008:

These graphs show that during the year of the general election, polls gradually converge to a point that is close to the actual November outcome.

Wlezien and Erikson expressed their findings in terms of correlation coefficients. In early February (about 280 days from the election), the correlation between polls and November outcomes is +0.2, where 0.0 corresponds to no relationship and +1.0 indicates a perfect relationship. The correlation rises to +0.9 by October. However, this measure is not easily used by consumers of polls.

Instead, a more intuitive measure is how far polls tend to move over time.

To calculate this box-and-whisker plot I also included 2012 data (spreadsheet here). Positive values indicate that the Democratic candidate did worse in November than in polls. The box indicates the interquartile range, *i.e.* the middle 50%, and the whiskers indicate the range. The red points indicate two outliers: the elections of 1964 (Johnson v. Goldwater) and 1980 (Carter v. Reagan v. Anderson). In May, polls overestimated support for the Democratic candidate by over 10 percentage points. For obvious reasons, Republican-leaning pundits like to write about 1980. But that is one case out of 16 elections.

Instead of such cherrypicking, it is more accurate to include them as part of an analysis of all 16 elections. The full range and estimated standard deviation of poll-outcome differences looks like this:

On average, polls have little or no bias relative to November, but have some variation, which is what we care about. That variation is quantified by the standard deviation (SD). I estimated SD using median absolute deviation (MAD), and verified this approach using interquartile range divided by 1.35. For March and April, the standard deviation is around 4 percentage points.

The November outcome should be within 1 SD of current polls approximately two-thirds of the time. Hillary Clinton’s polling margin over Donald Trump is currently +8% (median of 19 pollsters since mid-March) – twice the standard deviation. Based on past years, how likely is it that Trump can catch up? It is possible to convert Clinton’s lead to a probability using the t-distribution*, which can account for outlier events like 1964 and 1980. Using this approach, **the probability that Trump can catch up by November is 9%, and the probability that Clinton will remain ahead of Trump is 91%**.** This probability doesn’t take into account Electoral College mechanisms. But since the bias of the Electoral College is quite small, it does not make a difference in the calculation.

I should note that the polls have been telling us this information for some time. In the first half of March, Clinton led Trump by a median of 9 percentage points. Using an SD of 4.5 percentage points, her win probability would come out as 93%. So today’s estimate has been knowable for several months.

This is a result that may excite Democrats. However, it is subject to change. For example, the SD increases to about 7% in June, which combined with a lead of Clinton +8% corresponds to an 83% win probability, less certain than today. And of course the polls could change. I don’t know why polls would be less predictive in summer. Maybe general election campaign events drive polls away from where they would naturally go otherwise. Post-convention bounces would be examples of such events.

This estimate is also independent of other factors, such as the state of the economy and Clinton and Trump’s net favorability/unfavorability. Most such factors should already be partially baked into the polls, and therefore might not add much information. Now that polls are predictive, they give us a more direct measure of what will happen in November.

**In MATLAB: prob=tcdf(clinton_trump_margin/4.5,3). In Excel: =1-TDIST(clinton_trump_margin/4.5,3,1)*

***Modified to allow for the possibility of systematic error in polls. I assumed that polls will be off systematically by +/-2%, even on Election Eve. Calculating effective SD using the formula sqrt(SD^2 + 2*2), gives an effective standard deviation of 4.5% instead of 4%.*

Today I write about the PEC delegate snapshot. It is based on data posted here. All polls are current, including Trump +6% in Indiana (n=3 polls). Based on Tuesday’s voting, in which Cruz underperformed polls by a median of 4 percentage points, I will no longer assign a Cruz bonus. Note that Trump overperformed polls by a median of 8 percentage points.

As of today, for recently-unpolled states (NE,WV,OR,WA,MT,NM,SD) I will start using Google Correlate-based estimates. Of those states, Trump is favored in West Virginia (34 delegates) and is near-tied in Oregon and Washington (proportional representation). The rest are Cruz states.

Put through the PEC delegate simulator, the median delegate count is 1333 (interquartile range 1304-1339). The probability of getting to 1237 delegates is 98%:

What if we assume that Trump will lose Indiana? In that case the median drops to 1284 delegates (interquartile range 1278-1287). The probability of getting to 1237 is now 97%:

The 1% change in probability is inconsequential. The main effect of forcing a Cruz win in Indiana is to reduce uncertainty in the delegate count, which you can see in the narrowing of the historgram.

Close states (Oregon, Washington, and New Mexico) happen to use proportional rules, so they contribute very little uncertainty. Winner-take-all or nearly-winner-take-all (i.e. district-level rule) states are either strong Cruz (Nebraska, Montana, and South Dakota) or strong Trump (West Virginia, California, and New Jersey).

Most of the remaining uncertainty comes from district-level races in California. With California polls showing Trump +18% (Google Correlate says Trump +31%), it will take a highly coordinated effort by Cruz and Kasich to pick up many of its 53 districts. They would use geographic information like this Sextant Strategies survey to guide their efforts. At the moment, the likeliest outcome is for Trump to get at least 160 out of 172 delegates in the Golden State.

]]>How do we know this? Two reasons. The first is that national polls have been stable for four weeks, since March 22. The second is the remarkable success of a predictive method based on Google Correlate, which relies solely on past voting and web search patterns – and does not use polls or demographics at all. Here is how PEC and *N.*‘s Google Correlate method did (click to enlarge):

PEC, using a simple approach based on polls and border counties, did well. So did Google Correlate, when its results were fed through the delegate rules process.

Even more remarkable is the Google Correlate-based estimate of vote share. The chart below uses Google Correlate-based predictions (“Google-Wide Association Study” results) based on data excluding the candidates’ home states, and combines both Democratic and Republican primaries:

I’ll update this as more information becomes available, especially if I can find demographics-based predictions. At this moment, the closest thing I have is “538 demo,” which indicates an early estimate made by FiveThirtyEight assuming some kind of state-by-state constant shift. Google Correlate did notably better.

The “media data pundit” state-of-the-art is is demographics, which can give some predictive information, for instance in this year’s Democratic nomination. However, this year’s multicandidate Republican race has seemingly not lent itself well to such an approach. Demographic variables like %evangelicals are crude proxies for voter preference. Amazingly, Google search terms like “DeGrassi season 13″ do much better. A future challenge is to understand why.

>>>

In today’s PEC update, Trump’s median projected delegate count is 1308 (interquartile range 1281-1330), with a 94% probability of getting to 1237. The histogram looks like this:

The pre-East Coast primary delegate estimate was 1303 (IQR 1271-1326, probability 90%). The main effect of yesterday’s voting was to reduce uncertainty. I think it is reasonable to say that yesterday’s voting ended any realistic doubts about Trump being the eventual nominee. That is on a par with previous Republican nomination races: Romney and McCain were considered by their party to become the presumptive nominee in late April.

Since Google Correlate estimates larger margins than PEC does, it should give even less uncertainty. To demonstrate that, I have fed Google Correlate-based vote estimates into the PEC delegate estimator. Trump’s median projected delegate count then becomes 1334 (interquartile range 1306-1341), with a 99.6% probability of getting to 1237. The Google Correlate-based histogram looks like this:

I had briefly considered switching to Google Correlate-based estimates as the official PEC estimate. However, that approach does have a latent assumption that whatever new voting comes in is enough to capture any swings in the race. Before I take any such step, I have to think about the implications of that. Though come to think of it, polls have the same problem – less so in frequently-polled states, more so in rarely-polled states.

**Update, 9:00pm: **New poll medians/boundary-based estimates and Google Correlate predictions are here.

More calculations from *N.* after the jump.

*N.* has been playing around with dropping various states from the Google Correlate algorithm, and finds that dropping the three candidates’ home states tends to reduce the error of the imputation. *N. * found that distinguishing between primaries and caucuses didn’t help. She then estimated the overall error by dropping one additional state and calculate its imputed value, then repeating for four states that have voted so far (ID, MO, WI, and NC). The average error was 3.4%.

Here is the scorecard that *N.* is keeping.

For the record, I do not expect to do particularly well on N.’s measure of vote share. My goal this year was to estimate delegates.

**11:00pm: ***N. *may not be able to spell my name, but she can predict election results without using any polls at all.

>>>

In human genetics there is a form of analysis called a genome-wide association study (“GWAS”). In this kind of analysis, the researcher looks for bits of DNA that show up more often in people with some trait or disease. Motivations for doing this kind of study include (a) finding genetic variations that contribute to a condition, so they can be studied; and (b) providing a way of estimating the chance that a condition will occur. However, GWAS is full of challenges. One of my research interests is autism. Autism is strongly driven by combinations of genes, yet GWAS has only succeeded in identifying a small fraction of the risk. Many of these bits of DNA have all kinds of other effects (this is a project in my lab…and hey, I’m recruiting!).

The Google Correlate method for political prediction is analogous to GWAS…but better! In this analogy, Google search terms are the “genes.” Thousands (maybe millions) of Google search terms are statistically associated with the frequency at which a state votes for Donald Trump, Ted Cruz, John Kasich, Hillary Clinton, or Bernie Sanders supporters. Some of these terms make intuitive sense; others are mind-bending.

I wrote about this idea the other day. (To learn how the method works, and to do it yourself, **read this first**.) Today I want to explain a little further. I will show some fascinating and often hilarious results.

Reader *N.* turned me on to Google Correlate, which is basically part of the engine behind Google Trends. Correlate takes a pattern you give it – baked bean sales by state, robbery rates over time, or whatever – and gives back the search terms that have a similar pattern. There are billions of search terms – similar to the number of DNA “letters” in the human genome.

*N.* created a text file of vote shares by state. Here is what the first lines of a file for Trump would look like:

Iowa,0.513

New Hampshire,0.186

South Carolina,0.357

Nevada,0.302

Alabama,0.306

Alaska,0.492

Arkansas,0.455

Georgia,0.347

Massachusetts,0.125

…

These Trump support numbers are fractions of the total Trump+Cruz+Kasich vote. Percentages are okay too, since Google Correlate rescales everything to a range of -1 to +1. Typos are *not* okay – Correlate is very unforgiving of misspelled state names.

If you upload Trump, Cruz, and Kasich files, Correlate gives back a list of the most-correlated search terms. Those lists look like this:

These lists were generated using vote-share data that excluded Ohio (Kasich’s home state) and Wyoming (a unusually nonrepresentative voting process, even by the standards of caucuses). That was *N.*‘s decision, after playing around with the data a bit.

The way to read the table is as follows: state-by-state, Trump support is correlated with the frequency of “DeGrassi season 13″ with a correlation coefficient of 0.7438. Why? If a term shows up on this list, it doesn’t necessarily mean that the person doing the search supports Candidate X. It could also mean that relatives or neighbors of Candidate X’s voters tend to make that search.

To return to the GWAS analogy, such indirect connections are essential to genomic analysis: the snippet of DNA that is tracked is usually close to, but hardly ever identical to, the snippet that causes a trait. In genetics, even when there is a causal connection, it is not so obvious what is going on. For example, one gene, whose protein was thought to be mainly for how blood cells adhere to one another, turns out to be important in how synapses adhere as well – with implications for schizophrenia. The point is, we should be careful to avoid overinterpreting or overgeneralizing from the search terms above. But we should also keep our minds open about what we find!

Now, to examine some of these “hits”:

**John Kasich.** Places that like Kasich are richer in some fairly policy-wonkish search terms: “net cost,” “renewable portfolio standard,” the economist Joseph Stiglitz, *Financial Times* writer Martin Wolf, and *Vox* writer Dylan Matthews. These terms have a ring of plausibility. They might be good fodder for small talk…if you are talking with a Kasich supporter!

But then there are terms that I don’t entirely understand: Route 73 and Haven Pizza. Maybe someone can explain those to me. It is also true that with billions of search terms to choose from, occasionally a correlation will arise by chance. These might be false positives.

**Ted Cruz. **Many Cruz-related search terms are related to domestic life of a certain kind: family photos, felt Christmas stockings, scentsy plug ins, balloon animals, Baby Trend car seats, and DIY cribs. Easy enchiladas are particularly Cruz-y. Mmmm, enchiladas. And udder covers…I wasn’t expecting that one. Maybe the Cruz campaign could start distributing Cruz-themed udder covers!

**Donald Trump. **Note that the correlations are weaker. That could be because Trump support is broad-based in the Republican Party. Or it could be that the connection between the voter and the Google-searcher is indirect (i.e. they are different individuals who live near ne another).

At first, this list was quite puzzling. A prominent cluster of search terms is pop culture-related: DeGrassi (that’s a TV show that focuses on the problems of teens), Kids’ Choice Awards, and Nickolodeon star Alexa Nikolas. And…”never had a boyfriend.” Wait, what? I thought Trump voters skewed older.

*N.*‘s coworker had a thought. According to correlations from Neil Irwin and Josh Katz at The Upshot, Trump supporters are abundant in communities where many people have not finished high school, or places with lots of mobile home dwellers and “old economy” jobs like manufacturing. She suggests that in some families, the children of Trump voters are the only ones in the house who use the Internet. And at least some of them are looking for entertainment and relationship advice online. I do not think it is totally advisable to ask Google about boyfriends…but these searchers *are* helping us to predict voting patterns. So…you go, girls!

>>>

The final step in making a prediction is converting these search terms to predicted vote share. Each of the search terms has a correlation score attached to it. One could do a weighted sum using those scores…but they are not that different, so *N.* calculated a simple average of the terms’ coefficients, state by state. She then did a simple linear regression between that average and the vote-share for that candidate. The resulting linear fit is a formula that can be used to predict vote share in places that have not voted yet. *N.* repeated this for the other two candidates. The result is a set of inferred values of vote share – what a genomics researcher would call “imputed” values.

And what about today’s primaries? The PEC poll estimate and the Correlate-based estimate are in close agreement:

Because none of these five states have voted yet, none of them went into the Google-Correlate approach. The two green columns of estimates were made independent of one another. Today we will see how these two methods do. They both suggest that in the East Coast primary, Donald Trump will get about 150 delegates (this includes 43 out of 54 district-level Pennsylvania delegates, whose estimated faithfulness is 0.8), over 85% of the total to be given today.

I have one final thought on the analogy with GWAS. These days, modelers who attempt to predict votes from demographic factors are relying on linkages that they suppose to exist between populations and voting behavior. The education/Trump connection is an example of that. In human genetics, that would be called a “candidate gene” approach. Such approaches can turn up a good result if the hypothesis happens to be true. Often, they lead to results that have not held up so well over time.

In both human genetics and political modeling, constructing a good model requires the modeler to be an extremely good guesser. Google Correlate has the potential to search millions of possibilities at once. I think a clever modeler could really go to town with this tool.

>>>

Democrats, I have not forgotten you. Here as an added bonus, search terms that correlate with Clinton and Sanders support. The age/race divide is quite apparent. Where Sanders supporters are found, you can evidently get quinoa soup, PBS Nova, and people who do not know what “wonky” means. And baked goods. A lot of baked goods. It is a veritable cornucopia of Stuff White People Like.

Near Clinton supporters it’s cheap bedroom furniture, Nicki Minaj fans, and pink hoverboard shoppers. And “career in” – Google auto-complete as a job counselor!

**Update, April 26th, 11:00pm:** In today’s East Coast primaries, Google Correlate did better than any poll-based method:

Kasich, asked whom his supporters in Indiana should vote for: “I’ve never told them not to vote for me. They ought to vote for me.”

— Thomas Kaplan (@thomaskaplan) April 25, 2016

One day after the announcement of cooperation between Team Cruz and Team Kasich, John Kasich has already gone off script. I question whether this alliance will hold.

For the goal of stopping Trump, avoiding division is important, not just for Indiana’s 57 delegates next Tuesday, but also for California, where Cruz and Kasich are dividing the non-Trump support.

Even if their efforts stick, Cruz and Kasich may be too late. Today’s PEC delegate calculation is driven in large part by two polls in California showing leads of 18 and 27 percentage points for Trump over Cruz. Kasich’s support is unlikely to make up this difference. At best, the campaigns will need district-level information to coordinate their efforts. That seems difficult.

In addition, I also have a smattering of polls that confirm the missing-poll imputations I published over the weekend. *N.*‘s Google Correlate method gave 57% for Trump in Connecticut; the 3-poll median is 54%. Correlate gave 66% for Rhode Island, and the midpoint of two fresh polls is 58.5%.

Today’s GOP delegate calculation includes an assumption of Trump +1% for Indiana (a median poll margin of 7%, minus a 6-point bonus for Cruz that I’ve written about before). I am now also accounting for faithless Pennsylvania delegates with a multiplicative factor of 0.8 (i.e. one-fifth of delegates will renege). This is based on the Tribune-Review data (see the this post for an explanation).

]]>Statement from Cruz campaign saying it will focus on IN, clearing path for Kasich in OR and NM; Kasich campaign agrees. All to stop Trump.

— Jake Tapper (@jaketapper) April 25, 2016

The main news this week is probably Tuesday’s primaries, when Trump may come close to sweeping Connecticut, Delaware, Maryland, Pennsylvania, and Rhode Island. But what to make of today’s bargain between Team Kasich and Team Cruz?

I gotta say, this looks like a suboptimal deal for Kasich. Based on polls, three states clearly show evidence for a divided-field effect: Pennsylvania, Maryland, and Indiana. Some poll medians:

State |
Trump |
Cruz |
Kasich |
---|---|---|---|

Pennsylvania | 45.5% | 26.0% | 23.5% |

Maryland | 43.0% | 24.0% | 27.0% |

Indiana | 40.0% | 33.0% | 20.0% |

Certainly this deal could turn the tables in Indiana, an important state for Trump. But I have some questions:

(1) Why were Pennsylvania and Maryland left out of this deal? In a one-on-one match against Trump, Cruz might have a slightly better shot in Pennsylvania, and Kasich might have a better shot in Maryland. Is it too late to do anything about those two states? Or were the two candidates unable to agree on who should withdraw?

(2) Why would Cruz give up Oregon (May 17th) and New Mexico (June 7th)? These seem like odd choices, since Cruz is probably stronger than Kasich in both states – at least based on imputations that I published here recently. If a survey comes out showing Cruz stronger than Kasich in either state, surely Team Cruz would be tempted to renege on the deal.

Perhaps more important from an anti-Trump standpoint, Oregon and New Mexico are proportional states. Withdrawing from either will have hardly any effect on Trump’s delegates. It’s basically rearranging the Kasich/Cruz total. Hmmm..maybe that’s a big part of the deal: a barter of delegates.

An anti-Trump deal like this would have been more effective a few weeks ago. Now, it looks a bit too late. Other interpretations?

There is some chance by flipping Indiana, this maneuver could help hold Donald Trump below 1,237 delegates for a pledged majority. But it seems to me that if he is a handful of delegates below that threshold, his campaign could find some way to make up the difference from the 140 or so uncommitted delegates remaining.

]]>I have provided predictions separate from polls. I find it confusing to mix up a secret-saucey prediction with hard polling numbers. Unlike other sites’ evaluations [NYT] [538], the approaches presented here are transparent. It is easy to do the predictive calculations yourself, and I hope you will try them out yourself!

>>>

As I pointed out in March, Kasich’s win in Ohio was good for Trump because it kept the anti-Trump opposition divided. In Pennsylvania, Donald Trump is polling at a median of 42% (n=4 polls, April 7-18), with a divided opposition (Ted Cruz at 26%, John Kasich at 23.5%). There is little doubt about who will win the popular vote there on Tuesday.

In the overall PEC calculation of expected GOP delegates, Trump should receive all 71 of Pennsylvania’s delegates based on voting. I’ve written about this calculation and assumption before. Today, to expand upon those thoughts…

Despite the lack of suspense, horserace commentators are attempting to inject uncertainty by going off about the delegate-selection rule in Pennsylvania, which lists delegates on the ballot without a candidate’s name listed. As the argument goes, Trump could be deprived of 54 district-level delegates (3 in each of the 18 Congressional districts).

I say: meh.

The problem with the argument is that the Republican National Committee has made it clear that delegates should commit on the first ballot to a statewide or district-level winner. Pennsylvania Republicans may have nominal freedom, but they know perfectly well that they should vote for their district’s winner. Since district-by-district standard deviation of vote share is typically around 5%, Trump’s 16-percentage-point lead will probably translate to winning all 18 districts, a clean sweep.

Support for the idea of delegate fidelity comes from the Tribune-Review, where Tom Fontaine and Salena Zito contacted Pennsylvania’s 162 delegate candidates, and 110 responded to their survey. The results suggest that the majority feel pressure to comply with voter wishes:

Only 21 respondents – 19% of the total – explicitly said they would vote for someone other than Trump. Let’s say that the 54 winning delegates are representative of this group. It seems to me that there would be fewer Cruz-supporting delegates if their preferences become known to voters by Tuesday. But let us imagine that Cruz picks up 19% of district-level delegates, 10 in all.

The PEC overall delegate calculation counts all Pennsylvania delegates as being committed. It currently puts Trump at a median of 1285 delegates; the probability of getting at least 1237 delegates for a majority is 75%. Under such uncertain conditions (I regard probabilities of 20-80% as uncertain), losing 1 delegate would reduce the probability by approximately 0.5%*. So if ten Pennsylvania delegates are faithless, then the probability of Trump getting a majority drops to 70%.

Of course, there are unpredictable multiplier effects here. If Trump does poorly enough that he drops below 1237 delegates, any indecisive delegates may feel at liberty to vote against Trump. And if Trump stays above 1237 delegates without Pennsylvania, Pennsylvania delegates may decide to abandon any thoughts of straying.

Anyway, from a modeling standpoint, considering delegate attitudes, a simple approach that comes fairly close to reality is to assume that Pennsylvania delegates will go with their voters’ wishes. From that, the reader can apply a correction based on human factors. So that’s what I am doing. I leave it to readers to work out for themselves what they think will happen in alternative scenarios.

*Update, April 25th: I am starting to lean toward using the Tribune-Review survey to add a “faithlessness factor” in which the number of district-level delegates in Pennsylvania would be multiplied by 0.8. The difference is small, but it would actually capture the probable loss of about 10 delegates. It would also minimize the discrepancy between the model and the eventual outcome.*

**This is true for now. if Trump’s expected number of delegates gets farther above 1237, the race will become more certain and the effect of losing a delegate will become much smaller. And of course the converse is true: if he falls short, the race again becomes more certain in the other direction – and the effect of losing a single delegate again becomes smaller.*

Despite the usual complaints, primary polls do reasonably well when aggregated. To understand a state, it is far worse to have no polls at all. As the joke goes, “That restaurant’s food is terrible. And such small portions!”

Unfortunately, we have no public polls for the Republican primary in Indiana *(update – just in, we have Trump +6%, very close to both of today’s estimates, Trump +7% and Trump +5%)*. Indiana is pivotal to whether Donald Trump can get to a majority of pledged delegates. You’d think data pundits would rush to fill this void. But that has not been the case.

For any data pundit, the absence of polling has been a serious problem if the question is anywhere close to a tie. At the *New York Times*, The Upshot has made a demographics-based effort, but I believe that calculation missed Wisconsin (and lacks details). The Great Argental Satan seems to favor Cruz for Indiana in a fairly vague way. He and his staff make extremely weak use of demographics-based analysis, perhaps appropriately so; as far as I am aware, their approach is not strong enough to repair inaccurate polling (for instance, the Michigan Democratic primary). For better performance, there is a need for a method that uses state-level information that is more specific than general demographic composition.

Which brings us to today’s topic. I will show you two independent methods for estimating Trump/Cruz/Kasich support without demographics or polls. This is a long post.

Bottom line: the two methods agree in all important respects. Trump is favored in the remaining Eastern states, including Indiana. Cruz is favored in the remaining states west of the Mississippi (Washington, Oregon, Nebraska, and South Dakota). The only point of disagreement is New Mexico. Both methods indicate that Trump is on a path to more than 1237 pledged delegates.

The first method is the border-county-based analysis that I use for PEC’s calculation (see the banner above). The second method was suggested to me by PEC reader *N.*, using Google Correlate.

As I wrote the other day, all four states surrounding Indiana (Illinois, Michigan, Kentucky, and Ohio) have already voted. Looking at 17 previous states, if Cruz finished ahead of Trump in more border counties, then he always won. If Trump had more border counties, then he almost won. Overall, this approach works 15 out of 17 times. One miss, Wisconsin, is right at the threshold. So this measure is correct 15/17=88% of the time.

Here is a full list that includes future states.

These are the margins that I am using to calculate the overall distribution of outcomes.

>>>

Now let’s consider a second, independent way of inferring the preferences of a state’s primary voters. Princeton Election Consortium reader *N.* used Google Correlate, which is basically the inverse of Google Trends: you give it an overall pattern (for instance, vote share in a list, with the corresponding states), and it gives back what search terms have the most similar patterns as the input. *N.* then used these to predict states that haven’t voted yet.

The results agree closely with the border-county-based estimates above—including a narrow Trump win in Indiana. In some cases it goes out on a limb: it predicts Trump +53 in Delaware, making that state the Trumpiest in the country. The tool also predicts a similar Trump margin in Rhode Island.

To test the method, *N.* removed states from the data, then re-ran the correlations without them. This gives pretty good retroactive predictions. For example, for Wisconsin it predicts Cruz 46.6, Trump 38.5, Kasich 14.9, whereas the vote was Cruz 49.5, Trump 36.0, Kasich 14.5. So right there it performs as well as polls, and better than demographic models and pundits.

It also worked in an extreme state, Idaho:

Predicted: Trump 28.3% Cruz 61.8% Kasich 9.9%

Actual: Trump 34.7% Cruz 56.1% Kasich 9.2%

Here’s the Google Correlate-based prediction for the remaining states. *N.* used relative vote share between Trump/Cruz/Kasich here, so in early states where Rubio did well, the numbers don’t capture what the true vote shares would have been.

Using poll medians as an independent measure of voter opinion, we can see if these numbers pass a smell test. *N.*‘s outputs have the same rank order (i.e. Trump first, then Cruz second, or whatever) in all six states (NY, CT, MD, PA, CA, NJ). The correlation coefficient is +0.87.

However, the exact percentages were not quite on target for this week’s election. In NY the polls were Trump 54%, Cruz 18%, Kasich 22%; voting was 60%, 15%, 25%. In that case *N.*‘s output was too favorable to Kasich/unfavorable to Trump.

At first, I was surprised to see Cruz favored in Oregon and Washington. Originally I had skipped doing the border-county analysis there because those states use proportional allocation. It didn’t seem critical. But wait! Six out of six counties bordering Washington favored Cruz over Trump. In Oregon, it’s 4 counties for Trump, 5 counties for Cruz. So the two methods basically match. That is awesome.

*N.* did curate the inputs a bit. She left out Ohio, because she found that “including it dramatically skewed Kasich’s correlates in a way that really didn’t look like it made sense.” And of course this would underestimate Kasich by a large margin. Taking the same leave-out approach for Texas didn’t change Cruz’s pattern much, consistent with Kasich’s status as a favorite-son candidate with highly focused geographic appeal.

I’ll let *N.* describe exactly what she did in her own words. As you will see, it would be easy to replicate:

*I just took the top 100 correlated search terms and did a state-by-state regression against vote share to fill in the remaining states. Specifically, what I did was:*

*(1) make three CSV text files listing the vote share for each candidate in each state so far (Note: I used “percentage of the Trump+Cruz+Kasich vote”—using percentage of the total vote would give different results in states where Rubio/etc did well. There are various ways you can handle that, with arguments for each.)*

*For, example, make a text file called cruz.csv, like this:*

* Iowa,0.485*

* New Hampshire,0.3351*

* Virginia,0.34851*

* for all the states for which you have data.*

*(2) upload them to Google Correlate’s state-map tool. There’s an easy “download CSV” button. It will treat any states not in the file as unknowns.*

*(3) download the CSV dump for the 100 most correlated search terms for each, which gives the popularity of each term by state.*

*(4) for all the remaining states that haven’t voted, take the simple average of the popularity (in the CSV) of all 100 search terms in that state, which gave an overall score. That score is (roughly) linearly correlated with vote share, since that’s what Google Correlate was looking for in each search term.*

*(5) do the linear regression back to vote share using those search term scores, which lets you fill in the unknown states.*

>>>

As an overall validation, let’s compare the outputs of the border-county and Google Correlate approaches. The two outputs show an amazing degree of correspondence:

The correlation coefficient is +0.76, and the two methods only disagree on one expected winner – in New Mexico, where *N.*‘s approach gives a tiny advantage to Trump. Such a low level of disagreement is not bad at all.

I do note that the Google Correlate method tends to assign larger win margins. If these come true, Trump would have quite a decisive advantage in getting above 1,237 delegates.

*>>> *

*N.* sent me this analysis on Tuesday. Since that time, three Indiana polls have been leaked. They show a near-tie with Trump barely ahead. However, I would take this news with a grain of salt. I am suspicious of leaked data generally. Often it’s spin from the trailing side, to make things look closer than they are. *(Update: More importantly, a WHTR/HPI poll was just released showing Trump ahead by 6%. This closely matches our two methods, which average Trump +5%.)*

Both our methods face several real tests in the coming weeks. Indiana might not be so hard. But from a modeling standpoint, the harder tests are Washington, Oregon, and New Mexico.

New Mexico is of particular interest. It is the only state where *N.*‘s approach and mine give opposite results. Since Google Correlate can “reach into” the center of a state, I might actually believe its prediction slightly more — a near-tie, Trump +1%. New Mexico has proportional representation, so the two models give a difference of only 4 delegates. In any event, New Mexico will be a useful test for comparing the two approaches.

Finally, to quote *N.* again, “I’ll also feel pretty good if there’s a Trump blowout in Delaware. Just because it’s such an odd little state in so many ways.” And on cue, here is a new survey: 55% for Trump, and an additional 12% undecided. If Trump gets many of those undecided voters, his total could end up pretty close to Google Correlate-based prediction of 69%. Make Delaware Great Again!

More thoughts…The opinion is very short (barely over 10 pages) and is narrowly written, as Amy Howe as SCOTUSBlog writes. It doesn’t seem to do much more than state that population deviations of up to 10 percent are acceptable. It does as little as possible, while still upholding the Commission’s work. Several people, including me, filed briefs arguing that statistically, there was no measurable partisan offense. But the Court ruled in a way that did not require them to take a position on whether an injury had occurred.

The narrowness of the opinion might reflect a new direction for the Chief Justice. On the one hand, Supreme Court opinions are unanimous, or nearly so. But in recent years they haven’t shrunk back from a series of 5-4 votes on controversial cases that roll back worker rights, voting rights, and other areas of standing law.

This redistricting case could have been such a situation. FantasySCOTUS had it as being anywhere from 5-4 (favoring Harris) to a 9-0 majority. As written, the opinion stayed well clear of matters regarding the Voting Rights Act, an area where the most conservative wing (Thomas, Alito, Roberts) don’t accept current law. Now that conservatives are probably headed for being in a 5-4 minority, Roberts may discover a new-found love for consensus – as a means of slowing down any advance for liberal priorities.

I initially became interested in this case as a means of advocating for a gerrymandering standard. With today’s decision, that didn’t work out. However, other cases are coming down the pike:Whitford v. Nichol in Wisconsin, and Shapiro v. McManus in Maryland (for a description, see pp. 36-41 of my article). The time is ripe for the Court to take up this question. If not this year, then next year. Onward!

]]>