Princeton Election Consortium

Innovations in democracy since 2004

Outcome: Biden 306 EV (D+1.2% from toss-up), Senate 50 D (D+1.0%)
Nov 3 polls: Biden 342 EV (D+5.3%), Senate 50-55 D (D+3.9%), House control D+4.6%
Moneyball states: President AZ NE-2 NV, Senate MT ME AK, Legislatures KS TX NC

Among Republicans, Trump supporters have slightly lower incomes. But what really differentiates them?

May 7th, 2016, 1:39pm by Sam Wang

First, the news clips. At The New Yorker, John Cassidy digs further into the question of data journalism vs. data punditry, and cites PEC favorably. He thinks data journalism is at its best when it isn’t trying to make predictions, but helps us understand what is happening now. I mostly agree with that, though I do think future predictions can be useful if they are transparent and put the assumptions on the table where we can see them.

Also, some long perspectives from Scott Lemieux at the New Republic and from me at the Daily News on Trump’s long odds. Based on a current margin of Clinton +7%, I put Clinton’s poll-based November win probability* at 70%. That’s not taking into account my observation yesterday that Trump’s ascent shows that “The Republican Party is broken. It probably broke slowly, from 1994 to 2014.” This is addressed in part by a long analysis piece by Patrick Healy and Jonathan Martin in today’s NYT.

Now let us turn to a recent offering by FiveThirtyEight. They gave income information on Trump voters (which is good data journalism practice!) – and then created a false impression that Trump voters are well-off (which is questionable data punditry). Let me explain.

First let me praise FiveThirtyEight for showing the data, which revealed the problem with their headline. I think that is good practice on their part.

Now to the claim.

Based on exit polls, Trump voters have a median income of $72,000, above the national median. Therefore, writes FiveThirtyEight in its headline and lede, these voters are more affluent than the press would have you believe. This claim has been picked up by USA TODAY and Money magazine. In the case of USA TODAY, the original analysis has been mangled somewhat, which can happen when the original article has a misleading headline.

However, the exit-poll statistics describe Trump voters from Republican primaries. Republican primary voters are not representative of all voters. For example, they are better off than Democratic primary voters; Trump, Cruz, and Kasich voters all have this characteristic. So the $72,000 figure is consistent with “Republican primary voters are better-off than Democratic primary voters.” Which is not news.

However, the same data allows a within-group comparison, which tells a different story. Trump voters have slightly lower median income than Cruz supporters…and a lot lower than Kasich supporters. I should point out that even this measure is hard to interpret easily, since each group of voters contains a mixed bag of different incomes. However, these numbers do support the idea that more Trump voters tend to have lower incomes than Republican voters as a whole.

Another problem with the income analysis is that Trump supporters and Cruz supporters differ in income by only $1,000. This is a very small difference. There are far better differentiators among the different types of voters. Here are two.

First, see this excellent piece by The Upshot’s Neil Irwin and Josh Katz, “The Geography of Trumpism.” Counties with Trump support correlate with counties where voters have less education, work in old-economy jobs, and when asked about their ethnicity say “I’m an American.” Irwin and Katz get their conclusions from a correlational analysis of many demographic variables. They are telling a deeper story with data than what the $72,000 income statistic appears to tell.

Another way to get a better picture comes form Google-wide Association Studies. Recall that Google search terms did better than polls in predicting primary outcomes. Those search terms tell a story that is far more like what The Upshot says than FiveThirtyEight:

These search terms, to the extent that we can interpret them, point toward Kasich voters being the most affluent of the three groups of Republican voters. This is consistent with the exit-poll data when viewed more broadly.

Tags: 2016 Election · President

61 Comments so far ↓

  • DZK

    To be fair to Nate, he did make sure to mention that the Trump primary voters were worse-off in income than Cruz + Kasich. However, he should have differentiated between “supporters” and “voters”, since it’s been long-established that low-income earners are much less likely to vote in the first place.

    I think his wider point was that Trump’s support isn’t limited to Rust Belt/Appalachian whites. In the recent primaries he’s been winning outright majorities of those earning between 100-200K, like in Indiana

    • Sam Wang

      I agree that there is good stuff in that article. But the headline is misleading, and USA TODAY mangled it pretty well. However, you are also missing the real point in the data: among GOP voters, Trump voters have lower income.

      I think the wider point that Trump supporters are not all hillbillies is not interesting at all. It falls trivially from the fact that Trump has started to clear 50 percent of Republican voters. Is it really necessary to gin up a bad analysis to make this point?

    • AySz88

      Oddly, there does seem to be an impression among Republicans (especially Trump surrogates, and supporters, and perhaps Trump himself) that he can win working-class *Democratic* voters in the general election, based on the perceived strength among “working-class” Republicans.

      (A quick search yielded this Reuters article from August, making a connection to blue-collar Reagan Democrats, etc.)

      That notion might be what was being aimed at there.

    • Matt McIrvin

      I’ve been seeing articles using Trump’s strength in West Virginia to suggest that he ought to be able to flip Ohio, Pennsylvania, Michigan and Wisconsin. But I don’t think there’s actually much similarity between those states as a whole. West Virginia is a place where, in 2012, several counties voted for a convicted felon over Obama in the Democratic primary. I think some people are still thinking of WV as the mostly-Democratic-voting state that it was before 2000. Currently it may be the reddest state in the union. It’s far redder than Virginia, which is the reverse of the pre-2000 situation.

      Trump will probably carry the parts of Pennsylvania that most resemble West Virginia, sure.

  • David Lynx

    Sam, you deserve the end zone dance. Question: today you give Clinton a 70% chance but a couple of days ago you were at 91%. Seems like a big swing in a short period of time, or am I confusing two different types of calculations?

    • Sam Wang

      To be blunt, 91% was a mistake that I have to fix. That post has a statement to such an effect. I am waiting until I have a little more to say than that…also, I have other work to do.

    • bks

      I’m not enjoying this exercise of trying to predict the outcome from the Trump vs. Clinton national polling. Obama and Romney were a rock-solid 50-50 in the national polling on election day and Obama won 51-47 in the actual vote and the electoral college was 332-206.

      Maybe I’ll take a nap till Labor Day.

    • Matt McIrvin

      We’ll have a decent amount of state polling by late summer. There are already polls in most of the swing states, but they’re sparse enough that we can probably be misled by small-number effects.

    • Josh

      In the week leading up to the 2012 election, nine major polls came out. Five had Obama up, three were a tie, and one had Romney. Not quite “a rock-solid 50-50”.

    • bks

      Rock-solid median of 48-48.

  • N

    Hey Sam, N here! Great point about the within-group comparison. I think you’re right.

    I just want to remind your readers that the Google Correlate search term lists are proxies for states, not proxies for searches by the supporters themselves. I think this difference can be pretty important!

    It’s easy to imagine that if Google Correlate existed in 1968, it might report that George Wallace’s vote shares correlated with search terms popular with African-Americans. This could lead you to think Wallace had lots of black supporters! He didn’t, of course; it’s just that states with large black populations were the ones where white people were invested in fighting for segregation.

    • Amitabh Lath

      Hi N. , great analysis. It’s rare to see such a totally new take on a problem. Could you elaborate about what you mean “proxies for states” and not “proxies for supporters”. How do you then form the columns labelled for the candidates?

    • N

      Amitabh, thank you for the kind words!

      Let me try to answer your question. The columns show search terms which correlate with the candidate’s primary vote share in each state. In other words, Trump’s column lists things which are searched for a lot in states where Trump did well, and not searched for very much in states where he did poorly.

      It’s tempting to stereotype from these search terms, but it’s important to remember that these aren’t necessarily things searched for by the candidate’s supporters themselves. They’re just things that are popular in states where the candidate is also popular. In other words, they tell you about the state, not the supporters themselves.

      The George Wallace hypothetical shows how important this distinction can be. Failing to understand it could lead you to think Wallace was supported by African-Americans, since he did well in states with large black populations.

      To use another example, Trump’s top searches are consistently teen cultural icons and shows like Degrassi. Does this mean Trump voters love high school dramas and tween pop stars? I guess it’s possible, but seems more likely that—as my friend suggested—Trump voters live in states where the children are more tech-savvy than the adults, and so their interests make up a proportionally larger share of searches.

      Or maybe there’s a totally different explanation that someone clever will figure out! But without data on searches by the actual supporters themselves—which only Google has—it’s hard to know for sure.

    • Amitabh Lath

      Okay, so Google Correlate is somewhat like a Neural Net in that you “train” it with states that have already voted. And the search phrases are like the values stored at hidden nodes, which are pretty meaningless by themselves (although people do try to make sense of them).

      I will refrain from making any Einstein/hidden variables comments although I’m itching to because I just got done teaching undergrad quantum for 3 semesters.

      I remember when NN techniques first started being used in my field (particle physics) there was pushback from the old timers, and now of course it’s so common not just for analyses but also everyday detector calibrations.

      I forsee myriad uses for your technique, beyond politics.

    • Josh

      Not trying to quibble, but…if there were nine polls and five had Obama up, then wouldn’t the median poll, by definition, have Obama up?

  • Amitabh Lath

    How do the Google correlate lists tell you anything about incomes? Degrassi fans are poor? I’m missing something.

    I do notice that the coefficents for Trump bad. In fact the lowest coefficient for Cruz (0.80) and Kasich (0.89) is bigger than the highest for Trump (0.74, the degrassi season 13 fans).

    • Sam Wang

      Well, it is a speculation. To be more precise, I think they are consistent with:

      1) (Trump) Low internet use among adults >> teens do the searching >> teen searches.

      2) (Kasich) Searches for economists and economic writers >> more affluent communities.

      I agree that it is not logically required that Trump and Kasich voters be affluent or not. But in conjunction with the Irwin/Katz evidence, I am willing to make the speculation.

      Again, note that we are talking about just part of the distribution.

    • Amitabh Lath

      Yes, that does make sense, assume kids searches dominating mean adults not using the computer. Or at least, not doing many searches.

    • alurin

      The google data may have in aggregate predicted the primary outcomes better than the polls, but it seems to me that there are a lot of assumptions involved in interpreting the GWAS search terms.

  • Amitabh Lath

    Walk me through the 30% probability for Trump. I’m trying to cobble together 270 EV and having problems if I start without Florida’s 29.
    I can pile on Iowa, North Carolina, Ohio, Colorado, but they’re not enough.

    Trump must win Florida to have any shot, and with 80% unfavorable with Hispanics and ex-Gov. Bush coming out early and strongly against him, it seems very low probability. In fact to have a 30% overall he would have to have greater than 30% prob for FL. Current polling has Clinton ahead by nearly double digits.

    • Sam Wang

      Hillary is at +7% over Trump nationally. SD on that drift between now and November is 12% (not 6%, I made a bonehead error – Wlezien data is vote share, not vote margin). The one-tailed probability on 7/12 sigma is 70%.

    • Amitabh Lath

      I see. This estimate is based on national numbers. What is the SD on Wlezien data if you restrict yourself to the years 2000 and later?

    • Sam Wang

      At this point in the race, SD=2.4% for the Democratic two-party vote share. Therefore the SD of the two-candidate margin is twice as large, 4.8%. Less variable than pre-2000 years. Perhaps a consequence of polarization and/or the slow GOP crackup (these might be synonymous).

      If this held up, then Clinton’s lead would be 1.5*SD, pretty large, corresponding to a probability of 88%. But on what basis would be think that this year will play out like 2000-2012?

    • Andrew

      ” for Trump. I’m trying to cobble together 270 EV and having problems if I start without Florida’s 29. I can pile on Iowa, North Carolina, Ohio, Colorado, but they’re not enough.”

      In my view Trump can win by cobbling together a new map that does not correlate to the basic 1988-2012 electoral map.

      In this map, Trump is not going to win a CO, VA, IA, OR, MN, or WI.

      Instead he holds the Romney states, looks to pick up among industrial north NH, PA, OH, and MI, and the sandy south of FL, and NV.

      I actually think Trump is more likely to win DE and NJ before he wins CO and VA, where he is strongly unpopular in northern Virginia and Denver.

      I base this statement on the levels of relative support shown on the ISideWith maps. Ignoring the absolute values of the numbers on the maps, the maps have been very accurate showing where Trump would do well or poorly vs. Cruz and Kasich.

      With that in mind, look at the relative strength Trump shows vs. Hillary.

      Ordering the states by relative strength on the map shows Trump wins if he takes first FL and NV, ME District 2, OH and MI and holds UT(!). His insurance against a surprise loss elsewhere becomes wins in DE, NJ, PA, NM, NH, IA and CT in that order according to the relative support.

      I would say this looks to be a crazy election season, and the states to really watch are Michigan and Utah – can Trump win the one and hold the other?

    • Sam Wang

      At first I thought this was crazy…but it has the advantage of being totally testable. And anyway, times change. For example, the 1976 Carter-Ford electoral map is unrecognizable by today’s standards.

    • Amitabh Lath

      Thanks. I asked because the Wlezien data for years 2000 and on does look qualitatively different, so it is reasonable to wonder if this is due to a) higher voter polarization causing decisions to solidify earlier and earlier (GOP crackup as you eloquently put it), or b) just a random thing.

      I guess it’s also possible that c) pollsters in the mid-20th century were just horrible, and the large scatter is just uncontrolled systematics.

      Anyway, if a) (or c) then the true SD is almost x3 smaller. Worth noting.

    • Amitabh Lath

      Andrew, what is the provenance of this isidewith map? What data is it based on?

      If we are looking for a large scale shifting of the map in just one 4-year cycle, the shift from 1976 to 1980 is notable. If Trump can transform the 2012 map by that much, then you are right my perturbative approach to Trump 270 is invalid.

      And just as Iran was the key to all the red in the 1980 map, it could well be again, for instance if they explode a nuclear device they have been building in secret.

      But until it goes boom I stick with polls.

    • Andrew


      “Andrew, what is the provenance of this isidewith map? What data is it based on?”

      Its based on a freely available on-line survey which geolocates you and your candidate and policy support onto a map showing how your positions compare to other voters near you and nationwide.

      If you look at the percentage numbers on the map, Trump and Sanders supporters are obviously overrepresented.

      For the Trump vs. Clinton map I linked to, its based on 471,000 individual recent responses since February 2016. You can change the settings to go further back in time or pick different candidates to compare. The Sanders-Clinton map has 518,000 responses.

      Its interesting if you toggle the “All Responses” option it increases to 1,555,000 responses (since November 2015 it says) and the percentages change in favor of Hillary (probably because Hillary has been a candidate for a longer period of time), but the relative strength of Trump vs. Hillary on the map doesn’t really change.

      It is also interesting if you chose Hillary vs. Cruz, you get a traditional 1988-2012 type Democrat vs. Republican map which seems to show Cruz would have done about the same as Romney.

      Its not a scientific survey, but it does pass the “smell test” that Trump is not just a very different type of candidate, but a candidate with very different support areas.

    • Matt McIrvin

      If you do look at the absolute numbers on the ISideWith map, though, you see that Trump is winning almost everywhere, which suggests to me that this is a sample predominantly of white people. That wouldn’t be much of a problem for predicting geographic variations in the Republican primary vote. It would be a big problem for predicting geographic variations in the general election vote.

      The few polls that have been taken in Mississippi show that Trump’s win margin is much smaller there than it is in, say, West Virginia, whereas this map shows the opposite. I suspect that’s race.

    • Matt McIrvin

      …To your statement about Michigan and Utah, I’d add that I think Arizona and the South are more in play than they used to be. Utah is an oddball because the electorate has a huge number of Mormons, who are only a large fraction of the population in one or two other states, and there seems to be a specific Mormon revulsion for Donald Trump. I’m more interested in how the non-white vote is shaping up, but a lot of the relevant states aren’t polled much because they are considered safe red states.

    • ChrisK

      We have good reason to think that the map will look different and has more potential variation than other post 2000 elections. Trump is not a standard candidate ideologically. In principle, a reasonable GOP candidate that was against free trade, and against the iraq war would be a scary GE candidate for a dem.

      As for the isidewith maps. Today we get a poll with GA within the margin of error. If that’s Trump’s map, he has heaps pf work to do.

    • Amitabh Lath

      I would be extremely careful (nicer way to say “avoid like Ebola”) any poll that is not random selection (or weights for selection biases like YouGov).

  • JohnD

    I have a strong suspicion that we’re looking at correlation, and assuming causality. My hypothesis is that Trump voters are driven by xenophobia and racism – which has a stronger incidence in lower income households. But the driver of racist/anti-immigrant appeal cuts across income levels and gets him over 50% of Republican voters.

    • Andrew

      “My hypothesis is that Trump voters are driven by xenophobia and racism”

      According the exit polling, while Trump is certainly winning the xenophobic vote, his majority of support comes from people concerned about jobs and economics. His message on that score is unabashedly protectionist – he calls out companies that left or are leaving the US and he actually suggests a 35% tariff in his stump speeches. His talk on protectionism and economics occupies much more of his speeches than talk about immigrants.

      I’m a Trump supporter, and this is the primary draw for me along with his unabshedly nationalist non-interventionist/anti-war America First foreign policy. Trump is what I would call a Teddy Roosevelt Republican and so am I. People are befuddled by him because we haven’t seen a Roosevelt Republican in a prominent national position since Senator Robert Taft in 1953. Apparently it is still a wildly popular position among the Republican voters.

    • whirlaway

      Thomas Frank says Trump support base correlates “even better with deindustrialization and despair, with the zones of economic misery that 30 years of Washington’s free-market consensus have brought the rest of America.”

  • Andrew

    “These search terms, to the extent that we can interpret them, point toward Kasich voters being the most affluent of the three groups of Republican voters. ”

    Its consistent with the actual results too.

    For example, Kasich won one county in New York – Manhattan, obviously the wealthiest point in the state.

    In Pennsylvania Kasich won just ten townships and boroughs – Lower Merion, Narberth, Radnor, Easttown, Treddyfrin (the Main Line); Swarthmore and Rose Valley (PA’s version of Princeton and a small town outside it); and out by Pittsburgh the towns of Mt. Lebanon, Sewickley, and Fox Chapel which are the three wealthiest suburbs. In Philadelphia City, Kasich won the wards covering Center City and Chestnut Hill.

  • E L

    If Trump’s meeting with Ryan Thursday is a disaster, then Trump’s chances move very close to zero.

    • Mark F.

      I honestly think 70% is too high right now, if you take other things into consideration besides the polls.

    • Brian

      If Trump’s meeting goes badly, Ryan’s chances of winning his August 9th primary may drop close to zero as well.

  • Paul

    I have been following this site now since 2004 and it is excellent (I am a biology prof. So I my life is data analyses). I must admit though that I am baffled by the degree of uncertainty that much of the press discuss surrounding this election. I just looked at a breakdown of the Romney/Obama vote (Roper Public Opinion Rsch, Cornell Univ.). Obama swept the African, Asian and Hispanic American vote by huge margins and won the female vote by 55%. He only had 39% of the white male vote but more women vote than men and their are more women than men. Can anyone possibly image this changing in Trumps favour? Clinton has a high unfavourability rating, but when she is actually elected to something (NY Senate) or appointed (Secretary of State) she has a high favourably rating – I.e., it cannot be that deep seated except amongst those who would never vote for Obama – yet he won twice. My take on all this is what I tell my students about statistical analyses – everything, unless it is absolutely identical is different and if there is a robust difference, no mater what analysis you use, even despite the fact that some reviewer believes you should use a different analysis, nothing is going to change that robust difference. In the end, demographics won out in 2012, those demographics have only increased in the favour of the Democrats and will relentlessly continue to do so. Nothing is going to change that until bigotry ceases to be a force in politics.

  • Ed Wittens Cat

    Dr Wang says: “a surprise can happen. The most extreme such case happened in 1980, the year of the Iran hostage crisis and a national energy crisis. Jimmy Carter led Ronald Reagan in May surveys by about 10 points, but ended up losing by almost as much.”
    if the failed Eagle Claw op tanked Carters re-election bid– how would a failed Mosul op affect Hillarys lead?
    Obama did the OBL “raid” to better his re-election chances in 2012, very much haunted by Eagle Claw. It succeeded.
    But Mosul may not.

    • Matt McIrvin

      If you look at Carter’s approval ratings, he was deep in the red before the Iran hostage crisis began, because of the economy and the perception of him as an uninspiring leader. I don’t think Iran tanked him; I think he got a temporary rally-round-the-flag spike that got him up a little above 50% approval at the beginning of the hostage crisis, and gradually returned to his past performance as the situation dragged on with no apparent hope of resolution.

      Meanwhile, people gradually came to regard Ronald Reagan as acceptable. Support for John Anderson seems to have served as a kind of gateway, with people moving from supporting Carter to Anderson to Reagan.

      The gradually increasing acceptability will probably happen to Trump with Republicans, if not with anyone else. But Obama was never as unpopular as Carter in the first place. His Presidency has been unusual in how stable his approval numbers are, actually: after a very short honeymoon period they’ve never been outside of the range of 40% to 55%.

    • Scott

      I think it’s a huge stretch to say Obama did Bin Laden to improve his reelection chances. He was not behind and in fact stood to lose much if the raid failed. To me it was a courageous move – and the right move for the country. It was not the right move for Obama’s personal political career.

    • Ed Wittens Cat

      “If you look at Carter’s approval ratings, he was deep in the red before the Iran hostage crisis began, because of the economy and the perception of him as an uninspiring leader.”

      Matt, im specifically referring to Dr Wang’s 20 point poll inversion in the presidential. Certainly approval could be a contributing factor.
      I think you are discounting the importance of Eagle Claw.
      Obama certainly recognized it, staging the OBL “raid” right before 2012 election.

    • Andrew

      you do know that the bin laden raid occurred almost two years before that election?

  • JayBoy2k

    I have been pulled away from politics by my day job. I am definitely interested in this topic. I am not sure what question is being asked on Republican Primary voters income and how it relates to NE outcomes. Is there some likelihood that upper income Republican primary voters would choose to vote democratic in 2016? That seems far-fetched. On weekends I spend time sharing my hobby with a Cruz supporter. He is greatly disappointed with recent events and like a lot of other Republicans I talk to, will never vote for Hillary, but is likely not to vote for Trump. He disagrees with Trump on Policy that to some extent Trump shares with Democrats — Social Policy, Planned Parenthood, Transgender bathrooms, Abortion, Fiscal Policy, Healthcare, etc. There is going to be a significant set of Republicans that sit out this election, and that could contribute as a primary factor to a Democratic landslide.
    Sam points to a broken republican party that has failed to either educate or satisfy the needs of it’s voters since the early 1990’s. Now combine that with a rebellion of the Conservative and Establishment arms of the Republican party to sit out the election, and you may not need very much from the Democratic party except to sit back and reap the bonanza.
    We are in the middle of a political transformation, likely to be analyzed for decades on what factors determined outcomes. While I appreciate any raw data (Republican Primary voter Income levels), I am having trouble fitting that into a premise with a predictive outcome.

    • Matt McIrvin

      I’ll believe it when I see it. I suspect that nearly every Republican who is currently saying they could never vote for Trump will convince themselves he’s an acceptable choice by November. You can already see it starting to happen with high-profile politicians.

  • anonymous

    Trying to think through how/why the Google Correlate method may work. It seems to find the state-wise pattern of frequency of search terms that correlate best with the state-wise pattern of voters choosing a particular candidate. So if you take a state where the voting has not yet occurred, and you want to predict the vote, the frequency of that particular search term in that state is predictive of the candidate’s vote share. Averaging over a larger number of correlating search terms reduces the chances of errors due to any one search term correlating by fluke with the known states.

    But why do those search terms correlate with the vote share? The simplest explanation is that the type of voters voting for a particular candidate also perform certain google searches that are not performed by voters of the other candidates. Sam thinks that it may not be the voters themselves, but their family members, based on the strange correlating search terms seen. In this comment thread, N notes that it may be any pattern that correlates by state with the voting pattern for a candidate, e.g. Wallace’s vote share may have indirectly correlated with African-American topics of interest in 1968. This point dissociates the voters (or their family members) from a direct link with the search terms, but maintains an indirect relationship.

    The main conceptual problem for me is the thought that the search terms should not correlate with the vote share unless they are exclusive to the voters of a particular candidate, either directly or indirectly. If ‘Degrassi’ is searched for by Trump voters (or their family members), it should not be searched for by voters/family members of opposing candidates, otherwise the correlation would suffer. It would have been so much easier to be comfortable with the method if the search terms showing up could be connected more readily to the candidate.

    • Amitabh Lath

      According to N. the searches do not have to be done by the candidate’s voters or anyone related to them. They just have to be done in that state.

    • anonymous

      @ Amitabh Lath

      Agree, but if the correlations are not due to some sort of underlying relationship between the search terms and the candidate vote share (whether direct or indirect), then it is hard to understand why the search term pattern should be predictive of the vote share. Right now, it seems like too much of a black box. Imagining a 50 dimensional space (for 50 states), and given a large number of vectors (in this case gazillions of possible google search terms), one might be able to find a specific vector (in this case, a particular search term) that is very similar to another vector of special interest (in this case, a candidate’s vote share) in 49 dimensions. In the absence of an underlying relationship between the two vectors, the two vectors could, however, be completely different in the 50th dimension. Predictions in the 50th dimension would not work in this case. The method does seem to work, but I don’t understand how it would work in the absence of a direct/indirect relationship between the search term and the vote share.

    • dk


      You are correct that the method can’t work unless there is some direct/indirect relationship between the search terms and the vote share. What’s so intriguing about the method is that it does work, even when we have no idea what that relationship is!

    • Amitabh Lath

      Anonymous, I’ve been thinking of it more like a Neural Net, where the 100s of specific search terms are like the hidden nodes/layers which you normally do not get to examine.

      Just like a NN you set up the machine with a set of inputs, train it on known data sample (states that have already voted), and then set it loose on the unknown data (states that are yet to vote). The machine learns, (ie, sets values and connections between the nodes and layers) and then applies that to the unknown data and spits out an answer, but the inner workings are quite opaque.

      The pro is that because it can squeeze small biases out of loosely correlated input variables, it can be a much more powerful discriminant than simple analytic analyses. The con is that one has no real idea of how it formed its answers.

      When it works NN techniques are amazing, but when they fail it is difficult or impossible to figure out where it went wrong. There was discussion in the particle physics community about using machine learning techniques (NN, Boosted Decision Trees, etc) on the Higgs boson search but eventually we went with a straightforward old-fashioned bump hunt.

    • Amitabh Lath

      Speaking of Neural Nets, there is an anecdote about an early NN, implemented in analog circuitry, used by the Army to differentiate between Soviet and American tanks.

      It did amazingly well when trained on photographs of tanks, picking out photos of Soviet tanks. But in the field it failed miserably.

      The problem turned out to be the photographs. American tank photos were bright color glossies but the Soviet ones were dark grainy spy photos. So the NN basically learned to go by brightness. Moral of the story is that your training dataset may not be the same as your analysis dataset.

    • 538 Refugee

      The whole process might be more interesting than I thought for the general. There will be plenty of poorly polled ‘reliable’ states that could be interesting to look at and compare to ‘battle ground’ states. This may have been brought up but the volume on the site is already getting heavy for this early in the race and I’m sure I’m missing some posts.

    • Lorem

      I vaguely suspect that if we had actual infinite search terms, you’d probably be right, anonymous, and it’d be a crapshoot (at least before we carefully weighted them by their importance?) But we actually have a finite number of search terms, and 40-50 dimensions is a whole enormous lot of space.

      It’s actually enough space that even among a trillion search terms, I expect that very few would end up close to a 40-state match by chance* – and we almost certainly have fewer than a trillion significant search terms. So, the ones that do end up close are quite likely to have some real relationship with the data, and if that relationship holds up in 40 states, it’s pretty reasonable to expect it to hold up in the rest.

      *as a thought experiment, if the search terms are independent random across states, we’d expect only 1 in 1.2*10^19 to actually end up in the same 1/3 of the distribution in every state as a given vector. Since they aren’t independent across states, the odds are probably much much better, but exponentiation is still a mighty thing.

      (This also sort of assumes Google is using some sort of Eucledian distance metric, and they probably aren’t, but I think it’s still a reasonable way to think of things.)

  • Andrew

    “flip Ohio, Pennsylvania, Michigan and Wisconsin. But I don’t think there’s actually much similarity between those states as a whole.”

    There is a lot of similarity between Pennsylvania and Michigan, starting with successful local GOP party politics and both having a large Democratic city (Detroit and Philly). Illinois is an extreme version of this because metro Chicago is so much more dominant in the state.

    Ohio is a bit different from lacking a single big city which is why it is more Republican. Wisconsin is more true midwestern and much more similar to Minnesota in overall demographics.

    “Currently it may be the reddest state in the union.”

    That honor will go indefinitely to Nebraska, Kansas, Wyoming and Idaho, not West Virginia.

  • Mark J

    Obviously, if present polls hold up, Clinton wins handily. Trump’s chances lie in a wave development. What starts The Wave at a football game? A couple people and a crowd wanting some action.

    I believe a Trump wave has at least a 50/50 chance of developing this summer. There is no doubt in my mind that the crowd is susceptible to it happening. Things that might prevent The Wave from developing at a football game are action on the field that diverts the fans attention. Hillary Clinton’s history and plodding style do not seem to me to be good wave-preventers.

    So who are the couple of guys that could trigger The Wave? ISIS. Trump. A love-fest GOP convention followed by a riots in the streets Democratic convention. A black swan. A market collapse.

    The more I think about it, the more likely it seems to me. 70/30. That’s about right.

  • Mark F.

    Actually, Obama was ahead of Romney in the poll averages the day before election day. The race was not a tossup.

  • Mark F.

    The final state polls in 2012 showed Obama with a very likely EC victory. Florida was a tossup, and it was indeed the closest state. But Obama didn’t need it.

    • bks

      Not sure if you’re responding to me, or not, but if you are, I was discussing the national polls per the article above, not the state polls. I’m a big believer in the power of counting the EV based on state-by-state polls.

Leave a Comment