Princeton Election Consortium

Innovations in democracy since 2004

Outcome: Biden 306 EV (D+1.2% from toss-up), Senate 50 D (D+1.0%)
Nov 3 polls: Biden 342 EV (D+5.3%), Senate 50-55 D (D+3.9%), House control D+4.6%
Moneyball states: President AZ NE-2 NV, Senate MT ME AK, Legislatures KS TX NC

Feeding Karl Rove a bug

November 9th, 2012, 1:40am by Sam Wang

Today’s PEC news clips: USA Today, the Philadelphia Inquirer, the LA Times, Atlantic Monthly, and the Daily Princetonian.

Early on Election Night, the New Hampshire results made clear that the state polls were on target, just as they were in 2000-2008 – more accurate than national polls. At that point it seemed more interesting to watch Fox News for reactions. At first they were filled with confidence of a Romney win. As data came in, a funereal air fell over the proceedings. And as is well known by now, Karl Rove became wrapped up in his calculations and had to be called out by Megyn Kelly.

Rove gave every appearance of genuinely believing that Romney would win. Similarly, Team Romney (and many pundits) thought that professional pollsters as a group were off base. This is a case of motivated reasoning: selective questioning of polls that they found disagreeable. It afflicted the whole right-wing media structure.

Do such biases ever help? What about analytical improvements, like the layers added at FiveThirtyEight? Today I report that by a quantitative measure of prediction error, we did as well in Presidential races as Nate Silver, and on close Senate races, we did substantially better – 10 out of 10, compared with his 8 out of 10. Let’s drill into that a little.

For us the keys to success were (a) a high quality data feed, and (b) avoiding the insertion of biases. Indeed, Mark Blumenthal and Andrew Scheinkman at gave us great data. After that we chose a a median-based polls-only approach to minimize pollster biases.

I will be honest and say that an Election Eve test is not very interesting. Long-term predictions are of greater importance – as well as other ways that aggregation adds value, like tracking ups and downs, as we did. By Election Eve, anyone who is looking at the data honestly can figure out what will happen the next day. Still, let us go along with this week’s media frenzy.

First, the obvious: of the 51 races, one was essentially a coin toss – Florida. Nate Silver, Drew Linzer, and Simon Jackman won the coin toss; Scott Dillon and I lost (though I briefly made a good guess). Is there a better way to quantify this?

One way is to look at our final polling margins, compared with returns.

Whenever a candidate led in pre-election polls, he won. This was true even for a margin of Romney +1% (NC). Evidently state polls have a systematic error of less than 1% – as good as 2008! (Also, like 2008, pre-election polls substantially underestimated actual margins, this year by a factor 0f 0.8 +/- 0.3. Majority-party voters in nonswing states like to vote – or minority-party voters don’t.)

Since Florida was a coin toss, it is better to examine our state win probabilities, as suggested at Science 2.0. The closer the probabilities are to 1.00, the more confident they are. Probability should also measure the true frequency of an event. If I say a probability is 0.80, I expect to be wrong 1 out of 5 times. Our record of 50 out of 51 (counting Florida as a loss) means that our average probability should have been about 0.98. It was 0.97.

This can be quantified using the Brier score, as described by Simon Jackman of This score is the average of the squared deviations from a perfect prediction. For example, if Obama won a race that we said was 90% probable, that’s a score of (1.0-0.9)^2 = 0.01. If we were only 70% sure, the score is (1.0-0.7)^2 = 0.09. The average score for all 51 races is the Brier score. The Brier score rewards being correct – and rewards high confidence.

For the Presidential races, the Brier scores come out to

Presidential Brier score Normalized Brier
100% confidence in every result 0.0000 1.000
Princeton Election Consortium 0.0076 0.970
FiveThirtyEight 0.0091 0.964
Simon Jackman 0.0099 0.960
Random guessing 0.25 0.000

We appear to be slightly better than our very able colleagues. The additional factors used by the FiveThirtyEight model include national polls and maybe some other parameters. It seems that these parameters did not help.

A more interesting case is the Senate, where the 10 closest races had these probabilities:

State 538 D win % PEC D win %
Arizona 4% 12%
Connecticut 96% 99.8%
Indiana 70% 84%
Massachusetts 94% 96%
Missouri 98% 96%
Montana 34% 69%
Nevada 17% 27%
North Dakota 8% 75%
Virginia 88% 96%
Wisconsin 79% 72%

Note that a number of these races (Indiana, Montana, North Dakota, Virginia) were races I designated as knife-edge at ActBlue.

I have indicated in red the cases where the win probability pointed in the opposite direction as the outcome. These are not exactly errors – but they are mismatched probabilities. The Brier scores come out to

Senate race Brier score Normalized Brier
100% confidence in results 0.000 1.000
Princeton Election Consortium 0.039 0.844
FiveThirtyEight 0.221 0.116
Random guessing 0.250 0.000

In this case, additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.

Update: I have added a normalized Brier score, defined as 1-4*Brierscore. This is a more intuitive measure. Thanks to Nils Barth. I’ll update this post with more information shortly.

Tags: 2012 Election · President · Senate

210 Comments so far ↓

  • Shawn Huckaby

    Welcome back! We wondered what you were up to (beyond the media appearances). Glad to see you’re not resting on your laurels.

  • MarkS

    Digging even deeper … for Montana, Nate’s “state fundamentals” appear to have been the reason for his wrong call, but in North Dakota, Nate does not list at all the polls from Pharos and Mellman that appear to be the basis for your correct call. I knew that Nate weighted polls, but not that he ignored some completely …

    Well, it doesn’t matter! And thanks to Act Blue, I contributed to those Democratic victories!

    Now it’s time to start writing to the Senators that I helped elect to call out the vital importance of filibuster reform, and how much I’m counting on them to do something about it!!!

    • Some Body

      Might be an issue of using a different information source for polls. Nate excludes leaked campaign polls (don’t know if this is the case with Mellman, probably not), but at least for Pharos that might have been simple oversight.

      Also, this could perhaps be interpreted as an argument against adjusting for house effects, but there are counterarguments to that one.

    • Sam Wang

      The hard part is knowing which thing to adjust. Choices:

      (1) leave ties alone, i.e. call them ties
      (2) adjust house effects
      (3) bring in the national trend, which appeared to be moving toward Obama
      (4) take my lumps and say that polls-alone gets us 50/51, and move on from this parlor trick where I dropped a card!

    • Froggy

      Nate didn’t count the Pharos polls because he suspected they weren’t legitimate:

    • Some Body


    • Hayford Peirce

      I tried many, many times to contribute to the people running again Michele Bachmann and Paul Ryan. I went to their official websites and clicked on the Contribution buttons. In both cases I was directed to ActBlue. I then filled out the necessary info and pushed the appropriate button. I would then get an icon whirring around while my donation was processed — forever and forever. The transaction was never completed.

      I tried this five or six times with my primary browser, Chrome. It never worked for either of them.

      I then tried it using Internet Explorer 8 and the most recent version of Firefox.

      Same story. Impossible to make a donation.

      I sent emails to both websites telling them of my problem and asking how I could contribute.

      In neither case did I ever get a reply.

      In my opinion, both these candidates were so incompetent that they deserved to lose.

      I live in Arizona — but I went to the website of Sen. McCaskill in Mo. and of three other candidates for the House scattered around the country. In all four cases I was able to donate without problems.

      So what is the matter with Act Blue? In my particular case, at least, it actively kept me from making donations.

    • Steve16748

      Absolutely, everybody should demand filibuster reform on practical and fairness grounds and demand voting reform and the end of the hours long lines on red blooded self respect / moral grounds.

  • Paul

    How about adding Linzer’s scores? I’d also be curious about your larger take on his approach. It’s just one election, but his model sure looks like an Oracle right now — essentially called the outcome in late June.

    • Sam Wang

      You’re right – I should include him. Thank you. I think he did quite well. However, if I recall, his probabilities are substantially underconfident. I need to look at that.

    • Drew

      Sam – I think I emailed you my state win probabilities on Tuesday morning, but if you don’t have them, I’ll resend. If anything, they were overconfident rather than underconfident… overall I think you’ll find they did very well. For example,

    • Steve16748

      Yes, I think you should look at Professor Linzer’s probabilities again, Dr. Wang. For the swing states, where their were lots of polls his confidence intervals seem to be point blanc bullseyes in every case, particularly Florida, just a nats width on the 50% line.

  • Marvin8

    Sam and Nate combined make for a fearsome dynamic duo. Thanks to both of you, I actually managed to sleep a little bit last week. Love you both. Now, would you go and take some much deserved R & R?

  • AJSdownunder

    I think you have stated the case with Occam’s razor sharpness.

    Meanwhile a mixed commentary going on at Brad DeLong’s blog regarding your views of uncertainties at 538.

  • AySz88

    What are the bars in the figure – one standard error? No bars = not enough polling?

    There seems to be something odd going on in that figure – the final poll margins seem to be underestimating the actual ones consistently. ( Discouragement/enthusiasm? Or something else? )

    Do the other years have this same effect? Depending on how one is calculating poll-vs-vote error one might get different measures as to the reliability of state polls. Could this account for any of why PEC and 538 disagree on how confident we can be in the state polls? Ironically enough, this would be a “systemic” error that doesn’t show up as a constant bias across all states. I could imagine that this might cause Silver to underestimate the reliability of state polling near the 50% mark?

    • Sam Wang

      No bars = no polls, so we had to fill in the previous margin. In the code someplace we put an error bar in to allow calculation of a probability. Since the Meta-Analysis only cares about the probability, which for these states is 0 or 1, the error bar is not a critical parameter.

      In regard to the decreased slope – good eye. Yes, it happened in 2008. Follow that link.

      I think the difference in confidence between PEC and 538 may arise from the use of national polls over there, since he wrote about the large uncertainties. Maybe state polls too – I don’t know. However, I do not think that estimating probability with great exactitude is in his interest, given that he is trying to appeal to a wide audience that does not mind uncertainty.

  • Paul Crowley

    This “Brier score” is a very weird measure. The only score that motivates giving your exact probability is the log of the probability that you gave the final result – smaller (less negative) is better.

  • Laura

    The Atlantic unjustly put you in the second tier of its “best pundits” article. I love all of the chart-and-graph goodness on 538, but you win 1,000 weegie points for pegging the race early.

  • James McDonald

    Is there any standard way to combine something like the Brier score with a notion of predictive power that rewards earlier predictions?

    It seems a bit tricky, because you want to reward people who got it right early on, but only if they continued to get it right up to the end.

    For example, however they did it, I’m impressed that Votamatic seems to have converged on the final EV result very early on and barely deviated it from over the course of the campaign. Intuitively, that should be given more predictive credence than someone who gyrated wildly but then nailed things the night before the election. [Btw, not a criticism at all of sites that were tracking the current mood of the populace — different measures for different goals.]


  • Martin

    HA, as i stated before REVENGE OF THE NERDS lives on!

    There ought to be a Mount Rushmore of election forecasters! Thanks so much for what you guys do! I can’t express enough how thankful i am for this! You and Nate kept me sane throughout this long bleeping election during those times i wanted to just shove my head down my toilet with these damn pundits and media members with their constant Ro-mentum BS!

  • pechmerle

    Thanks to Sam’s Act Blue recommended places to put a buck to best use, I donated (small) sums to just four Senate races. ND, IND, MA, and VA. Sam batted 1.000 for me, and I couldn’t be happier.

    (Well, yes, I could, if the O team hadn’t so badly misconceived their approach to the first debate, weakening their upward pull on Dem house chances. But you can’t have everything — not this year anyway.)

    • David

      Every sitting president seems to flub his first debate. They get out of practice and fatigue has set in. Well, not Bill Clinton but he was born with political talent far beyond most.

    • pechmerle

      David, interesting article in the NYT today on the insider view of debate prep on both sides, how much & many warnings O got of just the effect you (& many others) describe, and O’s failure to take them seriously. The story of the reactions inside the O campaign team while monitoring the first debate live is fascinating in a morbid sort of way.

  • wheelers cat

    Dr. Wang, you changed your FL EV estimate because the axis of evil (Rasmussen, Gravis, ARG) distorted the models forecast with a few late polls.
    You assumed nonparametrics would remove their bias.
    But it was asymmetrical bias. Non-gaussian.

    Mirror mirror on the wall
    who is the fairest pollster of all?

    • wheelers cat

      Dr. Wang.
      PEC was the major influence on my 332 EV estimate. I took all my data from here. But I didnt change from 332 because Rasmussen set off my cheater detection module.
      I made a series of 2×2 payoff matrices that demonstrated Ras actually could maximize payoff by selective cheating because of asymmetrical enthusiasm.
      In essence I believed Ras would be wrong enough that FL was actually going Obama.
      It would be interesting to see if some sort of cheater detection could be quantified and incorporated into the models.

    • Froggy

      wheelers, could you provide some more details on your payoff matrices. What sort of “payoff” are you referring to? Affecting the election, things like the PEC model, or something else? What “enthusiasm” is involved? And what is the “cheating”? Is it manipulating the numbers, or do you have something else in mind?

    • wheelers cat

      Froggy I used asymmetrical enthusiasm in the GOP base as one payoff. I took the estimated value from RAND actually (likelihood), because they were sampling a captive population. In essence RAND solved the responder problem by paying $2 for each survey returned.
      For another matrix I used the payoff as increased contributions from Republican donors– which probably isnt independent.
      I ran simulations then to repeat the experiment.
      Both payoffs privileged cheating over the payoff of maintaining the Rasmussen reputation.
      I got the experimental design from Hofstaders Metamagical Themas where he was testing the existence of the Superrational with the Platonia Dilemna.
      I mean, its just a crude approximation and not a rigorous test.
      A more rigorous test would quantify Rasmussen against both other pollsters and events…
      And you see…he didnt have to change his methodology. He failed to capture the cell phone demographies and the hispanic vote in NEV and CO in 2010. There was no market force pressuring him to change. Everyone consumed Rasmussen.
      The aggregators just averaged and weighted his stuff and still used it. Wang, Silver, Blumenthal, etc.
      But I believe his frequent polls and cavalier treatment of hispanic and cell demographies (dependent im sure) pulled the aggregations off course.

    • wheelers cat

      and this is pretty standard classic evo theory of cooperation and cheater detection stuff.
      im just not as good at explaining as Dr. Wang.
      I think Rasmussen had an effect on all the aggregators. They all used his data and hes probably the most prolific pollster. In Dr. Wang’s case I just dont know if the CLT is as proof against cheating as it is against random variation.

    • MarkS

      Drew Linzer identified the outlying pollsters (Rasmussen, Gravis, ARG) on Nov.2.
      Scroll down to the graph, which is quite striking.

  • Some Body

    I actually think this post (or the presidential part of it) is a bit premature. The final results are not in yet (and in some places, notably Ohio, won’t be in for a couple more weeks). None of the races will be flipped, but the margins will change, probably in Obama’s direction in most swing states.

    Why am I mentioning that? Because in terms of margin, the swing state polls actually don’t seem to have been that accurate this year. There was a considerable systematic error in most swing states, but in Romney’s direction. Obama is up by 7 points or so in Iowa and NH. The poll median has him up by 2. With other states the margins are off by less (so far; we should wait), and are off in Obama’s direction in OH (I expect this to change by the time we have the final result, though). But on the whole, this argues for a smaller level confidence, not a higher one.

    • Some Body

      Sorry – 6 pts in IA and NH. 7 in WI (4.5 in the median).

    • BNW

      I second that. Looking more closely at the exact margins of victory is crucial.

    • Sam Wang

      That is interesting. I don’t think much will be changed by the residual counting that remains. But I do agree that a fine-grained look at swing states is in order.

  • charles

    538 will be a lot closer on the margin than you wre once the ballots are counted. Obama went up one point on the provisional ballots, late counted last time.
    Your work is really good but I do not see the value in the prediction. The meta margin was 2.46 and the prediction 2.2., 2004 repeats itself
    You would have gotten a lot closer to the result by just trusting the math- your strong point.

  • BNW

    Jackman writes: “For example, while the averages compiled by HuffPost Pollster and the other polling aggregators were correct in forecasting Obama the winner of the key swing states, HuffPost’s averages understated the president’s victory margin by 2 to 3 percentage points in Wisconsin, Nevada, Iowa, New Hampshire and Colorado (as of this writing, based on the current AP vote count).”

  • Pat

    Actually, it is interesting to note that FiveThirtyEight’s margins of victory were closer to the actual results. Taking all the closest states, the average absolute difference with the actual victory margins are smaller than all of PEC, Votamatic and Pollster.
    I am keeping a spreadsheet, but as mentioned above, we might want to wait a little for the final margins to be known before making too firm conclusions.

  • Pat

    There seems to be still millions of votes left to be counted.
    Any idea when we can finally expect the dust to settle? In 1 week, shall we have most of the remaining votes tallied?

  • Nils von Barth

    Hi Sam,
    Thanks for the analysis!

    Three comments to follow, in separate comments:

  • Nils von Barth

    (( Normalize Brier score? ))

    How about normalizing the Brier score as 1 – 4*B?

    The scale of Brier scores (0 = perfect and guessing 50% for everything guarantees 0.25) is confusing if you’re not used to it – a 0–1 range is clearer. Perhaps clearest is 1 – 4*B, so 1 = perfect, 0 = chance, negative = worse than chance. This “Normalized Brier” accords with intuition: 100% = perfect, 0% = no information, negative = worse than nothing. (From quick look at literature, various normalizations of the Brier score seem used in different contexts, so this seems an ok term.)

  • Nils von Barth

    (( Comparison set ))

    How about separate “whole nation” and “swing states” computations?

    Comparison set makes a big difference in Brier score – computing both for whole nation (to show overall uncertainty) and swing states (to show hard-to-tell range) would be interesting. Using a common category also helps one compare scores, to see how Presidential and Senate predictions compare.

    Since the Brier score scales inversely with number of predictions, predicting safe states improves your score – thus the “whole nation President” and “swing states Senate” Brier scores are not comparable (apples and oranges). Let’s say that 40/50 states were 0%/100% sure – then guessing 50% for the 10 remaining states already gets you B = 0.05.

    Using normalized Brier score (1–4*B, as above) and using only 10 states for Presidential gives (approximately) 85%, 82%, 80%, while normalized scores for Senate are 84% (Sam) and 12% (Nate). (Presidential swing state scores are slightly better b/c no penalty for not being 0%/100% on other states.) By this measure, Sam’s Presidential swing states and Senate swing states are about the same (and both very good), while Nate’s Presidential swing states are v. good, but Senate only just better than chance.

  • Nils von Barth

    (( Scatter plot? ))

    How about a scatter plot of predictions vs. outcomes?

    This gives an interesting overall shape, and outliers jump out.

    A scatter plot, as Simon Jackman does, is quite informative, both for a single set of predictions and for comparing two sets of predictions, especially a close-up of the transition region. (Formally, x = win prediction %, y = (2-party) outcome %.)

    Should be a sigmoid curve (roughly), crossing at (50%, 50%) with sharpness of transition showing confidence of predictions; bias is if “crosses” above or below 50% outcome. Especially interesting data points are:
    * Wrong side – correct outcomes are in NE and SW (UL/LR), showing wins for >50%, losses for <50%. Any predictions in NW or SE are (binary) misses.
    * Ranking mismatches – when ranking of probabilities disagrees with final outcomes, either due to incorrect ranking of red/blueness, or due to particularly strong or weak confidence in prediction, say due to extensive or missing polling.

  • BNW

    Professor Wang – In the Jackman article you link to, he also calculates the Root Mean Square Error and Median Absolute Error for his state-by-state point predictions. How did PEC, 538, and Jackman perform compared to one another using that measure?

    • Pat

      BNW. Good idea: I just updated the spreadsheet with the median absolute error and root mean square error (link above).

    • BNW

      Nice work, Pat. I’d love to hear Professor Wang’s view on analyzing the RMSE and MAE for prediction of vote share.

    • BNW

      Withe results available as of Nov. 8, calculated the RMSE for state-by-state vote share predictions.

      Silver: 1.93
      Jackman: 2.15
      Margin of 2.15
      Linzer: 2.21
      DeSart/Holbrook: 2.40

      If you average the vote share predictions from these 5 models, the RMSE goes down to 1.67

      Unfortunately, they have not performed the calculation for Professor Wang’s vote share projections.

      Also, the numbers need to be refreshed with updated results now available.

    • Sam Wang

      It wasn’t a primary goal in our analysis. We put very little effort into estimating margins in nonswing states on the grounds that it had little practical consequence. The numbers are available on this site.

    • BNW gives these calculations for RMSE for projected state-by-state vote share, using the results available Nov 9:

      Nate Silver: 1.81659
      Josh Putnam: 2.05217
      DeSart & Holbrook: 2.17918
      Simon Jackman: 2.240148
      Drew Linzer: 2.468036
      Margin of Error: 2.520504
      Wang & Ferguson: 2.777933

      Their calculation of state-by-state Brier scores is:

      Drew Linzer: 0.00384326
      Wang/Ferguson: 0.007615686
      Nate Silver: 0.00911372
      Simon Jackman: 0.00971369
      DeSart/Holbrook: 0.01605542
      Margin of 0.05075311

    • BNW

      Professor Wang – Thank you for the reply. That is entirely understandable. At the same time, predicting the state-win probabilities may not have been the the top priority of all sites. So, using the Brier score should not be the sole or primary way of analyzing the results from different sites. Brier score and RMSE seem to be reasonable ways to look at the projections produced by various sites, to see where their relative strengths and weakness are, while taking the site’s priorities into account.

      As well, I think the issue raises interesting questions for forecasters and those who follow forecasting. Predicting win probabilities, or predicting vote share: which of the two projections should be a higher priority for forecasting models, and why? Why should one be a higher priority the other? Or, are each the two projections more useful for different purposes: win probabilities more useful for some purposes, vote share more useful for other purposes? Etc.

      Professor Wang, it would be great to hear some insights from your perspective. Why do you put a higher priority on win probabilities than on vote share? Using your model, more accurate vote share predictions result in more accurate win probability predictions, or is that incorrect? If it’s correct, wouldn’t that be a reason for giving the two equal priority?

  • Paul G

    I have a question about the order of states. The median electoral vote this year was located in Colorado, just as it was in 2008. All along, if you had allocated Obama the states that went for Kerry or Gore + Nevada (all of which Obama won by 9+ in 08) that would have been 263 electoral votes). So it was clear that he needed Ohio, Virginia, or Colorado to push him over the top.

    But according to all reports during the campaign, the Obama team felt most confident about Ohio. For example, this report the day before the election:

    “Chicago… feels nearly as certain of carrying Ohio; and that Obama is just a tad ahead in Virginia. As for Colorado… Team Obama believes… too close to call.”

    But in the end, Obama won Colorado by about 5, Virginia by about 3, and Ohio by less than 2 (at current count). This is also exactly what happened in 08, when he won Colorado by nearly 9,Virginia by more than 6, and Ohio by less than 5.

    So why were the Obama strategists/pollsters off in their assessment of these states? Part of the Colorado thing might be a mini-Nevada effect, for example the public polls also underestimated Obama, as they did in 08, and the same for Bennet in 10, when polls showed Buck by 3. But still, now we have 2 consecutive election cycles in which Democrats have performed in the order Colorado > Virginia > Ohio. It would behoove pollsters to learn how to poll these states a bit better.

    • Paul G

      Oh, one partial answer to my own question is that the size of the margin may not be the only factor in assessing probability – the stability of the lead is important too. Maybe the Obama campaign was more certain the lead was durable in Ohio for some reason. Still don’t see why though – it ended up a little too close for comfort…

    • Pat

      Yes, I suppose this has to do with the “elasticity” of the state. Same with Pennsylvania: it was virtually uncontested, but ended up with a much closer margin than other battleground states (Iowa, New Hampshire, Nevada, Wisconsin).

    • Paul G

      Elasticity is definitely a consideration, but I’m still skeptical of the evidence that Ohio was more solid for the Dems than Colorado.

      In fact, if you look at the list of the 9 competitive swing states + the 3 “moustache” states that Romneyland claimed were tightening, the order of the D-R margin from 2008 is almost identical.

      From lowest to highest, according to current totals:


      As you can see, the only changes to the list are PA moved up a couple of spots and MN moved back a couple of spots.

    • Paul G

      Oops PA and CO are reversed on the second list.

    • Some Body

      With CO and VA, the Hispanic vote sort of suggests itself (take into account that most polls underestimated not so much the Hispanic turnout, as Obama’s margin of victory with it, probably as a result of not doing any polling in Spanish).

      But the substantial misses of the polling with IA, NH and WI (as of now) – now that’s more interesting.

  • Elliot

    Are you going to follow up on your analysis of whether aggregate House of Reps vote corresponded to Republicans retaining majority or was result of re-districting? If Dems actually won majority of aggregate vote, very importnat to communicate that to the chattering class, which persists in reporting GOP House majority as reflection of the “will of the People.”

  • mediaglyphic

    what about the house Dr. Wang. I seem to recall we were prediciting 210+/- 10 any idea what the others were predicting, not sure if can do a brier on this one, as i doubt people had probabiliites for each seat.

    • VG

      HI Sam,
      Your analysis of Nate versus PEC above is great except, people will only remember 50/50 versus 49/50, not 0.0091 versus 0.0076. I do wish your work got the same level of attention his does since your approach is better.

      I have one specific suggestion: I know you prefer not to be underconfident; but Florida was a clear case where you knew the margin was tied, and that the actual vote margin would be less than 50K votes. Why not simply call it tied instead of tossing a coin?

      If you had done so, you would have been correct 50/50 states, and would probably get the same kudos that Nate is (deservedly) getting.

    • Sam Wang

      Totally understood. That’s what happens when one loses a coin toss.

      I agree, calling a tie would have been better. That was a judgment error on my part.

    • wheelers cat

      but what if the PEC model was just more sensitive to Rasmussen effect in FL for some reason?
      because its simpler?

  • David Zuckman

    I too am seeing a trend line drawn through the actual data that has a greater slope than the ‘perfect prediction’ 45-degree line. So this means that the predictions were actually more accurate in the closer races, and were a bit overly conservative in the less competitive ones, with the conservative-bias error proportional to the real-result margin, yes? Is this an algorithm problem, or a problem with an inaccuracy of polling?

    Also, my take-away understanding of the Brier score is that even though PEC didn’t quite call as many states right as, say, Nate Silver did, it gets a better “grade” because PEC made more confident (smaller margin of error) predictions, is that right? This was my feeling about Mr. Silver’s results–that they were a bit overly “hedged.”

    Appreciate your work…

    • Sam Wang

      (1) Polling inaccuracy. Accuracy is less critical when races are not close. Therefore pollsters probably do not get much feedback in those cases to get better.

      (2) Brier score: your interpretation is correct. However, see my response to VG below. My own damn fault for thinking I had to make a call on Florida.

  • Lanny

    Sam, have you done (or will you do) any analysis of the voting of demographic subgroups that appears to have been so decisive in this election? I have the sense that there are some emerging received wisdoms — e.g., the “gender gap” favoring the President — that could benefit from a deeper look ….

  • PeterFormaini

    Excellent work, Sam!

    Actually, I ended up getting all 9 battleground correct in my final EV Map – but I used a time-honored method for calling Florida for Obama: Since it was a TIE statistically, I merely looked at the TREND of the polls over the final 2 weeks – and since they were definitely trending from Romney to Obama, I broke the tie in his favor.

    No muss – no fuss! (Plus I used thepsychological component of assuming that the political attempt to block Democratic votes would only infuriate Democrats all the more – and thus help Obama in the end.

  • WingGirl

    Just wanted to thank you for preserving my sanity this week. Cheers!

  • Mike M

    Dr Sam, in one of your interviews you said that, because they have so much more money, the campaigns can do even more with polling data than you can. I have always believed that, with Carter supposedly being told he had lost weeks before the election as an example. If that is true how is it possible that Romney was surprised when he lost? Did they lie to him? Was his analytic team incompetent?

    • Reason

      Yep, you are the man, Sam. But you still need to update FL! Also, when will you start doing aggregates on the Va and NJ elections?

    • Sam Wang

      I was pondering this. The only thing I can think is that they were gaming out the most hopeful scenario, and began to believe it themselves. Ultimately, their polling shop had to be run by a few people. If they set up an internal conversation where they mainly trusted one another, then they might become impervious to external criticism – especially if they mistook it as being driven by partisanship as opposed to data. That’s what I mean by motivated reasoning.

    • Venkat Ranganathan

      Excellent work! Thanks for all the hard word. Regarding motivated reasoning, wouldn’t that also explain many people looking at PEC, 538 etc for feeling confident about the democratic victory. Of course, the candidates have to have their pollers provide the right information – otherwise, there is no value in having a team to doing the polling analysis than watching FIXed news!

    • Mike M

      If he actually had a polling team that couldn’t poll, I am sure glad he didn’t get to be president.

    • wheelers cat

      Dr. Wang could it have been asymmetrical ideology and “common sense” bias? Rasmussen had the most SNT influence and supported the dem oversampling thesis that the Right bought into.
      They are still in denial.

    • E L

      I’m guessing that his pollster perhaps might have known Romney was losing but did not tell him because they knew how committed he had been for 7 long years and they did not want to crush him or they were afraid, from long experience with CEO types, that bad news causes the death of the messenger. From all reports, Romney was not at all prepared for the results.

    • Michael

      “If that is true how is it possible that Romney was surprised when he lost? Did they lie to him? Was his analytic team incompetent?”

      This may not be all that different from what others have said, but I think they made up their minds about a turnout model before they ever even started polling. They couldn’t believe they were losing because they couldn’t believe that Obama voters were actually going to vote.

  • Vasyl

    To be perfectly clear, you must admit, that 538 had a bit more uncertainty due to the fact that it incorporated some chance of systematic bias among all the polls one way or another.
    So, since in these elections there was no bias (state polls provided the best information, and you used it without including that extra uncertainty) your result is better by Brier score. Would it be an election with a systematic bias – it would be quite worse.
    Not to diminish your modeling, I’m a big fan of both models.

  • Neal J. King

    The campaign I feel bad about, which everyone seemed to miss, was the NV Senate race, with Berkley (D) vs. Heller (R). This also came down to the wire, but no one highlighted it.

    I feel slightly guilty, because that race was the one where I drew the line (I spent a lot of money on out-of-state elections for Senate and HoR; not much compared to Adelson’s millions, but proportionately a lot more than I probably should). If I had had any suspicion that she would come so close, I would have tossed a little more.

  • Jay Bryant

    I give you 51/51, Sam. You said Florida would be very close, and it was. You admitted that calling it red was a guess, but you made it clear that your real call was just that it was very close. Close enough for me.

    As for Nate’s “fundamentals,” I think they bake in the stereotypes people have about states. “Well, Montana is one of those big mountain west states. Of course it’s red by nature.” As you just demonstrated, ’tis better to kick those kinds of assumptions to the curb and stick to the polls.

  • wheelers cat

    Dr. Wang, in your jackaroyd appearance, you said something that leads me to believe that you feel the GOP hasnt really taken away any lessons from this–
    Do you think there is going to be intraparty war between the tea party faction (fundamentalists) and the reformers?
    And what are your tea leaves for what will happen in 2014?

  • rudy

    Sam, Is it just possible that the Obama (and Romney) campaigns were monitoring yours and Nate’s blogs and coming to the same conclusions. Or even possible that they simulated your software for their own private polling data.

    I must say Obama campaign team exuded confidence well before polling began (“I will shave my moustache” comment) and Cutter’s satisfied smirk when PA was declared for Obama on election night.

  • David Zuckman

    Sam, regarind your reply to my question above about the increased inaccuracies in the results for the higher-margin races, you reply that it is polling inaccuracy, but that doesn’t totally explain what I’m seeing in the Pres-margins-returns-2012.jpg graphic. If the higher-margin races simply had less accurate polling results due to less polling, I’d expect to see a 45-degree trend line implied by the number, but a broader vertical scatter of the points on the graphic at the high-margin ends, equally distributed above and below the 45-degree trend line. Instead I see a discernably consistent trend line of higher slope than the 45-degree line. Why would it be that the polling at both the high-margin ends of the graph have consistenly too-conservative results?

  • Tangotiger

    I think the Margin Of Victory test is not only better, but also more intuitive. In that case, 538’s model was off by 4 points, while PEC’s was off by 5 points:

    Otherwise, if you look at “probability”, you have so many states at the 99% level that is becomes noise to the data. I think there should be a big difference in calling a state at 100% with a 20 point margin of victory or with a 10 point margin of victory, even if both ultimately are a 100% probability of a win.

  • Khan

    With all the money Karl Rove has stolen from those billionaires on the backs of–and we’re just going to leave it at shaky, opaque–data, he is lucky a bug is all he’ll be eating.

    Also, how great is it that #DrunkNateSilver was a top 10 trending topic on Twitter?

    • wheelers cat

      i think #drunknatesilver is punishment enough for Nate.
      wow, I loved this.

      “Wang, the Princeton professor, believes pundits and computer-aided analysts can coexist.
      “It’s possible to be Homer and write about the wine-dark sea,” he said. “But sometimes you want the guy with the thermometer.”

      Its possible to be an uberl33t Poll Jedi and quote TS Eliot and Homer.
      Third culture intellectuals FTW.

    • Khan

      @wheelers cat

      My favorite was #DrunkNateSilver is wandering around NYC telling people the day they’re going to die.

    • wheelers cat

      and the regular commenters here know that I have never entirely trusted Nate since he pulled a post debunking the dem-oversampling myth in July and refused to replace it.

    • Khan

      @wheelers cat

      I can understand the skepticism. I generally trust Nate because I understand how his model functions. Sure, he could tweak something on the back that might have the appearance of normality, but then it comes down to is he a trustworthy person and does the integrity of the NYT add to that credence.

      On the other hand, Nate is still under the purview of the NYT editors, and although his model might be safe from their reach, his editorializing is not.

    • Froggy

      Wheelers cat, if we’re talking about pulled posts let’s not neglect Dr. Wang’s mid-October post on Nate Silver’s error bars, which was only briefly on line before it disappeared, never to be seen again. (I have a copy on my computer at home, should anyone be tempted to try and deny its existence.)

    • Froggy

      When I wrote, “should anyone be tempted to try and deny its existence,” I certainly didn’t mean that Dr. Sam would do that — he’s far too much of a stand-up guy for that sort of behavior. There was actually some good stuff in that “lost” post regarding sources of error, things that might have later made their way into later posts.

    • Sam Wang

      Hmmm, I actually thought I recycled most of that stuff in other posts. It’s ok – remind me (and everyone) of what it said. I have been pondering whether it’s of interest. If I recall, some of it was fairly nerdy inside baseball…

  • Shawn Huckaby

    Fascinating discussion on NPR this morning about how Allan Lichtman uses “geophysical” principles to model and forecast election outcomes, sometimes years in the future. He’s been right for the last 8 cycles, and called this one for Obama in January 2010!

    Better watch your back Dr. Wang…

  • Jim Janis

    Thanks, Sam. You and Nate helped keep me and my family sane these past few weeks as we fretted about the election. Your work is very important and we appreciate it very much.

  • Pat

    Here is a link to a spreadsheet I made, comparing for each state the margins given by 538, PEC, Pollster and Votamatic with the actual (preliminary) result.
    The average error for the 10 closest states is given below the table.

    Not sure if Pollster and Votamatic gave a national popular vote estimate. You are welcome to point any mistake.

  • BNW

    Re: Pat et al. above

    Is it accurate to say that PEC was more accurate in projecting probabilities, by the Brier score, and 538 was more accurate in projecting the vote share & margin of victory?

    If so, this raises an interesting question. Which is more important in assessing the results produced by a projection model: the accuracy of the probabilities, using a Brier score, or the accuracy of the vote share/vote margin? Which one, and why?

    Or, are they equally important? Or, are each the two projections more useful for different purposes: more accurate probabilities are more useful for some purposes, more accurate vote shares/margins more useful for other purposes? Etc.

    • Pat

      It seems so. Though admittedly an average (absolute) error of 1.89% for PEC or of 1.46% for Fivethirtyeight for the 10 closest states is not that huge a difference.
      In general, it looks like all aggregators (i.e. the polls) missed in the same direction: mostly underestimated Obama’s support in all swing states, with the notable exception of Ohio (and possibly North Carolina).

    • BNW

      For the margins, I can’t do them myself at this moment, maybe this weekend, but I’m interested in seeing the calculations for all 50 states, of the Root Mean Square Error and Median Absolute Error.

      It also seems the a .0015 difference in the Brier score is not a huge difference.

      Yes, erring mostly in the same direction, but by different degrees and with, as Pat noted, exceptions. It would be interesting to see a state-by-state breakdown of these numbers.

    • Pat

      OK the spreadsheet has been updated to include all state margins of PEC as taken from Sam’s file
      (Note that these predicted margins differ slightly from the Power of your Vote table)

    • BNW

      Pat – Sorry, I only took a hurried look at your spreadsheet before going to work. I see you already did a fifty state calculation.

      I wonder what it was in the 538 model that made it more accurate on vote share. My first hypothesis on what made the biggest difference would be the use of “state fundamentals” particularly for non-swing states, where there was less polling data.

    • Pat

      Yes, this is especially clear in non-competitive states like Hawaii or Tennessee, where state polls alone were probably rare and missed the final margin by quite a bit. Nate’s fundamentals apparently helped in those cases.
      It also seems to have helped a bit (to a smaller extent) in swing states, but not sure why.
      We may just have to wait a little longer for the final results to come out.

  • bks

    You guys all did great, but because of high-quality poll data and because you did not let your own bias influence your science. Will the same pollsters be good in 2016, or is it stochastic? Did Scotty Rasmussen have a finger on the scales, or just bad luck? And will the focus on poll aggregators in 2016 feed *forward* into the results because of anticipatory actions by the pollsters themselves?


    • wheelers cat

      I predict Rasmussen will be discovered to have had a finger on the scale with bad methodology.
      the invisible hand of the market is going to take care of him.

    • E L

      @cat two universes exist in the US: Fox/Drudge/Limbaugh and Reality. So there may be two invisible hands for quite some time.

    • wheelers cat

      I do not have any faith in Intrade as predictive or as a “performance benchmark”. But I do believe in competitive performance, bidding theory, and evo theory of cooperation.
      Rove and Rasmussen are going to see their market value fall.

  • Jason

    Sam, correct me if I am wrong, but I believe your model produced a predicted vote-share by state as well as standard errors about that estimate (which is ultimately what was used to create the win probabilities).

    Have you checked yet how the standard errors on the state-by-state vote shares performed? For example, was the vote-share for Obama within the 95% confidence interval about 95% of the time?

    I ask because a major concern with your model was that you were understating the uncertainty due to bias in state polls (it’s clear from the result there was no large systematic bias, but there may have been idiosyncratic bias your model missed causing it to overstate the certainty of the result).

  • Mark Sillman

    Three comments:

    (1) The one significant difference between Dr. Sam and Nate was in Montana. Nate’s site shows that the polls he used favored Tester, but his model also includes the state partisan tendency, and that tipped it towards the Republican. Nate was wrong this time.

    (2) A broader issue: Oddly, Sam and Nate may be doing a disservice by bringing these accurate predictions to public attention. If it were widely known that Obama was a 95% favorite to win the election, then people might not have stood for hours in line to vote – and that might have caused Romney to win! We may be better off with ignorant pundits, or with a moratorium on polling during the final week.

    (3) That’s a nasty, in-your-face headline: lumping Karl Rove and Nate Silver together!

  • Steve

    As one who has been interested in election modeling for a long time, I first became interested in what Rasmussen was doing back when he first started. Initially I was very curious as to how his robocall methods would fare, since he was able to get reasonably decent sample sizes.

    But, as time wore on, I discovered that he was way off on individual states, particularly in 2000. As I studied his methods more, I found that he was not using acceptable weightings (i.e., his R/D/I assumptions versus demographic stats.)

    Now we have the fact that his robocall methods are not supplemented with cell phone contacts. In short, his work is biased and not reliable at all. I have found that while he has gotten lucky in at least one national percent call (2008) he is awful for state calls. Nate Silver has already written about this. Moreover, I suspect that his methods of robocalling have problems that bias his results. Cannot nail this down, but that is what I think. Perhaps some of you who are very much more astute than I am can shed some light on this.

    I would very much like to think that the market will censure him, but I am afraid that he will always have a right wing R segment that will like to hear his results —- not all that different than those same people listening to Rush Limbaugh. Intentional or not (I say it is intentional) he leans R and panders to those who want to hear that.

    In conclusion, I view Rasmussen as a total disgrace to statistics. But I fear he will still be around for the same reasons that Limbaugh is.

    • wheelers cat

      but everyone still used him.
      number of polls dropped from 1700 in 2008 to 1200 in 2012.
      ALL the aggregators were held hostage to regulatory capture and the cartel of the red house effect pollsters.

      lets say Silver now refuses to use Rasmussen data in the future. The NYT will say that is “unfair”.
      And its inefficient.
      We just need to understand how to remove error accurately from poll houses that exhibit asymmetrical political bias behavior.

    • 538 Refugee

      Since Nate weights his polls he doesn’t have to refuse to use any poll. That is his basic model anyhow. His predictions were still pretty solid overall. Not like there aren’t left leaning polls which is why Sam’s median works. My real issue with 538 is Nate now writes like he has a word quota and the paywall. Every once and a while I’d have to ‘toss my cookies’ to continue. Not a big deal but annoying.

  • Christian

    I agree with many comments stating that any evaluation should be delayed until all vote counts are certified.

    I also think that any evaluation of a model’s success should be done by an independent third-party. Someone with the statistical knowledge but with no interest in demonstrating that one model is “better” than another. I think Andrew Gelman at Columbia would be an awesome person to do this. It seems some other commenters are doing this now by building a spreadsheet.

    It would also be great to see an evaluation that included past predictions. Not sure how to set this up, but some metric to evaluate a model’s prediction at -6 months, -3 months, -1 month, and -1 day from election. For instance, Votamatic had a steady prediction of 332 Obama EVs, plus or minus a few bumps, for a very long time.

  • bks

    Also worth noting that Romney himself believed in “unskewing” the polls!


  • Fred

    Nate is getting a lot of well deserved recognition but Sam Wang deserves every bit as much.

Leave a Comment