Feeding Karl Rove a bug

November 9, 2012 by Sam Wang

Today’s PEC news clips: USA Today, the Philadelphia Inquirer, the LA Times, Atlantic Monthly, and the Daily Princetonian.

Early on Election Night, the New Hampshire results made clear that the state polls were on target, just as they were in 2000-2008 – more accurate than national polls. At that point it seemed more interesting to watch Fox News for reactions. At first they were filled with confidence of a Romney win. As data came in, a funereal air fell over the proceedings. And as is well known by now, Karl Rove became wrapped up in his calculations and had to be called out by Megyn Kelly.

Rove gave every appearance of genuinely believing that Romney would win. Similarly, Team Romney (and many pundits) thought that professional pollsters as a group were off base. This is a case of motivated reasoning: selective questioning of polls that they found disagreeable. It afflicted the whole right-wing media structure.

Do such biases ever help? What about analytical improvements, like the layers added at FiveThirtyEight? Today I report that by a quantitative measure of prediction error, we did as well in Presidential races as Nate Silver, and on close Senate races, we did substantially better – 10 out of 10, compared with his 8 out of 10. Let’s drill into that a little.

For us the keys to success were (a) a high quality data feed, and (b) avoiding the insertion of biases. Indeed, Mark Blumenthal and Andrew Scheinkman at Pollster.com gave us great data. After that we chose a a median-based polls-only approach to minimize pollster biases.

I will be honest and say that an Election Eve test is not very interesting. Long-term predictions are of greater importance – as well as other ways that aggregation adds value, like tracking ups and downs, as we did. By Election Eve, anyone who is looking at the data honestly can figure out what will happen the next day. Still, let us go along with this week’s media frenzy.

First, the obvious: of the 51 races, one was essentially a coin toss – Florida. Nate Silver, Drew Linzer, and Simon Jackman won the coin toss; Scott Dillon and I lost (though I briefly made a good guess). Is there a better way to quantify this?

One way is to look at our final polling margins, compared with returns.

Whenever a candidate led in pre-election polls, he won. This was true even for a margin of Romney +1% (NC). Evidently state polls have a systematic error of less than 1% – as good as 2008! (Also, like 2008, pre-election polls substantially underestimated actual margins, this year by a factor 0f 0.8 +/- 0.3. Majority-party voters in nonswing states like to vote – or minority-party voters don’t.)

Since Florida was a coin toss, it is better to examine our state win probabilities, as suggested at Science 2.0. The closer the probabilities are to 1.00, the more confident they are. Probability should also measure the true frequency of an event. If I say a probability is 0.80, I expect to be wrong 1 out of 5 times. Our record of 50 out of 51 (counting Florida as a loss) means that our average probability should have been about 0.98. It was 0.97.

This can be quantified using the Brier score, as described by Simon Jackman of Pollster.com. This score is the average of the squared deviations from a perfect prediction. For example, if Obama won a race that we said was 90% probable, that’s a score of (1.0-0.9)^2 = 0.01. If we were only 70% sure, the score is (1.0-0.7)^2 = 0.09. The average score for all 51 races is the Brier score. The Brier score rewards being correct – and rewards high confidence.

For the Presidential races, the Brier scores come out to

Presidential Brier scoreNormalized Brier
100% confidence in every result0.00001.000
Princeton Election Consortium0.00760.970
FiveThirtyEight0.00910.964
Simon Jackman0.00990.960
Random guessing0.250.000

We appear to be slightly better than our very able colleagues. The additional factors used by the FiveThirtyEight model include national polls and maybe some other parameters. It seems that these parameters did not help.

A more interesting case is the Senate, where the 10 closest races had these probabilities:

State 538 D win %PEC D win %
Arizona4%12%
Connecticut96%99.8%
Indiana70%84%
Massachusetts94%96%
Missouri98%96%
Montana34%69%
Nevada17%27%
North Dakota8%75%
Virginia88%96%
Wisconsin79%72%

Note that a number of these races (Indiana, Montana, North Dakota, Virginia) were races I designated as knife-edge at ActBlue.

I have indicated in red the cases where the win probability pointed in the opposite direction as the outcome. These are not exactly errors – but they are mismatched probabilities. The Brier scores come out to

Senate race Brier scoreNormalized Brier
100% confidence in results0.0001.000
Princeton Election Consortium0.0390.844
FiveThirtyEight0.2210.116
Random guessing0.2500.000

In this case, additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.

Update: I have added a normalized Brier score, defined as 1-4*Brierscore. This is a more intuitive measure. Thanks to Nils Barth. I’ll update this post with more information shortly.

92 Comments

Jason says:

You know it is extremely insecure of you to constantly benchmark yourself against Nate Silver. It’s quite childish as well. You should stop. It makes you look bad. -A Happy Democrat

Sam Wang says:

…as opposed to someone you never heard of? hmmm. For good or bad, he is currently the benchmark.

wheelers cat says:

no it doesn’t.
don’t you believe in the competition of the Holy “Freed” Market?
Competition will deliver the most accurate model.

howard says:

it’s perfectly natural for the top poll prediction sites to benchmark one another. instead of looking at it in personal terms, look at it as just standard academic practice. whenever researchers present a new prediction approach in a paper, they often compare training/holdout errors of their proposed model against those of other well-known/industry-standard approaches.

Kerr says:

There are myriad ways to say it. I prefer, “Competition breeds innovation”. It makes no one inherently look bad. This isn’t a “receive a medal for participating” affair. Those who follow poll aggregators are interested in how the variants fare against one another.

Muhahahahaz says:

I hope Jason is being facetious?
It’s very natural to compare well-known models. That’s the academic way! The whole point is to evaluate their strengths and weaknesses. Then we will know which models should be used under which circumstances.
It’s just like something you would see in an academic paper (which I’m sure Dr. Wang writes plenty of). It’s very refreshing to see, especially when contrasted with the constant bickering that we see in the political arena.

azoomer says:

Disagree as well.
True geeks ™ were waiting for the comparisons. We shouldn’t fall for the same trap the conservatives fell into. Ideas should be tested rigorously and not declared as dogma by fiat.
Waiting for the analysis of the other aggregators.
Thanks.

Berkeley W D says:

SAM is the benchmark, as Nate knows at least subconsciously.

PollyUSA says:

How else would we know how the different models have performed? Sam is an honest broker.

Missiv says:

I find it kind of funny, the various tiny rules everyone appears to be putting data collectors. Numbers don’t lie, they don’t care who the winners and losers are. Numbers exist. Who would he compare himself to, Unskewed?

BNW says:

Comparisons are an essential part of evaluating different models.
However, there seems to be a consensus in the comments here that the original post has limitations: it uses Brier scores as its only method of comparison, when RMSE for vote share is also a meaningful method.
gwern (below) has calculated the RMSE:
Nate Silver: 1.81659
Simon Jackman: 2.240148
Drew Linzer: 2.468036
Wang & Ferguson: 2.777933
Regarding vote share margins, Professor Wang posted: “It wasn’t a primary goal in our analysis. We put very little effort into estimating margins in nonswing states on the grounds that it had little practical consequence.”
Those decisions are entirely understandable, and this puts PEC’s high RMSE into an appropriate perspective. At the same time, predicting the state-win probabilities may not have been the primary goal in other models. There are equally understandable, equally practical reasons for deciding to make vote-share predictions the primary goal and making modeling decisions that are conservative on win-probabilities.
Whatever decisions may be built into models, Muhahahahaz is right: “The whole point is to evaluate their strengths and weaknesses. Then we will know which models should be used under which circumstances.”
For these reasons, Brier scores should not be the only method used for comparing models: RMSE should be used as well.
Per Professor Wang’s comment above, another comparison should be the RMSE for swing states only, rather than all states.

BNW says:

Using the gwern’s data for the swing states of NH, VA, NC, FL, PA, OH, WI, IA, CO, NV, and using updated values from Pollster for the election results, to 1 decimal place, the RMSE values work out to:
Linzer: 0.637965516
Silver: 0.64807407
Putnam: 0.696769331
DeSart: 0.712741187
Wang: 0.751165761
Jackman: 0.820365772
MoE: 1.692040189
gwern’s original values for all 50 states and D.C. were:
Silver: 1.81659
Putnam: 2.05217
DeSart: 2.17918
Jackman: 2.240148
Linzer: 2.468036
Margin of Error: 2.520504
Wang: 2.777933
Looking just at the above swing states, compared to 50 states and D.C,, Wang jumps over MoE and Jackman, moving from #7 to #5. The most significant difference is Linzer, who moves from #5 to #1.

badni says:

Maybe polls are even more accurate than we think, but some human judgment is needed on who is a pollster (as opposed to a messager).
If you re-run your final prediction excluding the polls widely believed to be biased (I think that would be Ras, ARG, and Gravis), do the errors become smaller? What about for 2008?

wheelers cat says:

Here is an observation.
Project ORCA crashed and burned.
http://www.balloon-juice.com/2012/11/09/fail-whale/
I think the GOP needs to recruit nerds far more than it needs to recruit hispanics. Sasha Issenberg said it too, that micro-targetting and SNT research is happening exclusively on the Left.

MAT says:

In fairness, Project Houdini (a similar project on the D side) crashed and burned about midafternoon on election day 2008. Rolling out a project in a single day that requires a huge amount of scaling to non technical users with zero room for error is nearly impossible for even highly competent rollout teams (which this doesn’t seem to have been the case here, given the list of basic blocking and tackling level screwups being reported in the press).
Regardless of whoever is running the show – there are always growing pains. But ORCA is a glaring example of one side literally being 4+ years behind the other. It’s even more cringe-worthy when you realize ORCA was website based, rather than phone-app based. There are good reasons for doing it that way (it’s much easier for the developers), but it’s not anywhere near where the cutting edge is. In this particular war, one side is being very heavily outgunned and it shows.
There are plenty of conservative programmers. It’s the rest or the creative/academic universe, like you rather elegantly put in a previous post, where one side is very badly lacking.

MAT says:

Having reflected more on this, however, it occurs to me that betting the US Presidency on a single day rollout of software that had never been tested in the field is unbelievably, colossally and monumentally stupid.

Partha Neogy says:

@ wheelers cat
Hi Wheelers,
Silver, Wang and Linzer pretty much nailed the election. If ORCA had “succeeded,” we would be wondering where did Silver et al go wrong. So, I am not sure whether ORCA really failed, or whether its potential was over-hyped by the people who sold it to their sponsors.

wheelers cat says:

Hai Partha!
I’d add Simon Jackman. And yes, you an’ MAT are correct.
7 months for a software project of that magnitude??
jamais de ma vie.
what were they thinking???
But it still seems to me, all the R&D, all the cutting edge stuff, is happening in the Obama campaign.
I worked GOTV, and the micro-targetting was powerful. SNT pumped influence and obligation along all the connections, turned out the volunteers, who turned out the voters.
Force amplification.
I agree with MAT…a four year lag.

Some Body says:

The whole thing, must say, sounds a bit too Orwelian to my liking.

MAT says:

Making this even worse, ORCA could have easily been field tested during early voting. NC, for example, provides a downloadable list of who early votes, updated every evening. A trial run could have been done with results compared to the actual data.
Also, I know for a fact that in my county in NC, no request was ever made for credentials for R observers in our polls. So at least here, no data would have been captured, even if the thing had worked perfectly.

PollyUSA says:

I was micro targeted by the GOP and their affliates (Faith and Freedom, Americans for Prosperity, PAGOP…)
What’s interesting is that I was being called as a sporadic PA registered republican.
I haven’t voted in PA since 2007 because I moved to California. Starting three days before the election I was called eleven times on my California landline. They were calling as if I was a PA voter.
They also called and asked for my daughters by name as well. (both were registered republicans in PA and would also be considered sporadic voters because they also moved to Cali in 2007)

TAW says:

It seems to me that most of the gain from micro targeting and other tech advantages that the Democrats have would *not* be picked up by fpolling.
For example, if micro targeting is used to turn an unlikely voter into a voter — then the polls would either have to figure this out in advance or miss the vote.
In 2012, the close states had a margin for Obama larger than predicted in State polls.
Better technology can produce a systemic bias – and in this election could explain the larger than predicted margins for Obama.

PKWash says:

Sam: Just wanted to thank you for all the insightful (and accurate) analysis leading up to and now after the election. I was (and am) an avid reader of the 538, but began reading your blog about 3 weeks ago. Together, I had the most accurate view I think I could have gotten, short of mind-reading, as to how the election would turn out. Really settled my nerves. Thanks again — and, please (if possible), keep it up!

dsm says:

Professor Wang,
Thanks very much for this analysis, although as a layperson I should mention that the relevance of a Brier Score (“normalized” or not) is somewhat beyond me.
In your “Virtually Speaking” interview you touched on a point that was much easier for this layperson to understand. Specifically, you mentioned that 538’s “cautious” probabilities suggested that Silver ought to have missed three or four calls. Is it something you can expand upon without resorting Brier scores?
Thanks again.

Peter D says:

I find it interesting that the trend was for a candidate to *outperform* his predicted margin of victory. I believe that historically the tendency has been for underperformance of about 20-30%. I wonder why this changed?

P G Vaidya says:

Please do not stop now.
This site is getting more and more interesting. I like Paul Crowley’s idea of taking a log. It could take us in the direction of measuring information content of predictions (which, as you know, would be the negative of entropy).
I also like Wheeler’s Cat’s game theoretic 2 by 2 matrices.
Let me add one rather mundane metric. As I have stated before, I do not gamble. (I keep repeating the story when Isaac Asimov confessed to his father that he had lost ten dollars on a bet. His father said, “thank God, I hate to think what would have happened if you had won!”)
On the other hand, from Poincare to Mandelbrot, mathematics has benefited by taking a detached academic view of Gambling.
So, when Intrade has cooled down, we can ask one question: ” Was there an optimal strategy for a gambler to place his bets, everyday, for the last three months based on, say, Dr. Wang’s results, vs. Nate Silver’s results?”
Theoreticians only. Let me restate that I would enjoy this site less if in 2016, greedy gamblers pollute this pristine site.

Steve says:

I do think that Sam Wang, Nate Silver, Drew Linzer, Simon Jackman (and perhaps I am missing some others?) should be knighted for their work. Nobel Prize? They all are great! Particularly Sam!
A bit more on Rasmussen: If I understand it right, Fox News — other than him being a guest on their programs — is not using him. They have their own in-house group that appears to be not too shabby. They got put on the spot by Rove’s Baghdad Bob impersonation routine on election night— and stood their ground. That whole Fox thing was one of the weirdest things I have ever seen on TV.
With regard to lousy pollsters like Rasmussen one of the beauties of Sam’s methods (such as medians, etc.) is that the outlier polls are not influencing the results when there are plenty of state polls. For me, in a non-scientific manner, every time I saw a Rasmussen poll, I automatically tended to mentally subtract 4 points from the result and move it to the Dem side.

badni says:

A consistently biased poll is not an outlier. It is simply a weight on one side of the scale. If there are 3 biased (in the colloquial sense) pollsters on one side, and just one on the other, then the median will consistently be two polls’ worth to that side as compared to the unbiased median.
As long as there are a balanced number of biased polls this is fine. I think Sam has been lucky so far. One more pop-up firm like Gravis, and I think he would have missed several calls. I think he needs to scruitinize how to deal with this as poll aggregator manipulation of this kind becomes a more aggressively used strategy.
Medians do not solve for this issue.

wheelers cat says:

I actually think Rasmussen and Gallup caused Dr. Wang to miss the coin flip on FL.
The PEC model kindof black boxes poll inputs.
The CLT ensures smoothing of outliers, not cheaters or antique methodology.
I was protected by my cheater detection module and 2×2 payoff matrices, Nate was protected by his red house effect weighting.

wheelers cat says:

yes badni, you got it.
asymmetrical political behavior bias.

Neal Rosendorf says:

I’d like to add Scott Elliot at Election Projection to your list. Although ardently conservative, he unhesitatingly called the election for Obama in the days leading up to this past Tuesday (like Dr. Wang, he called FL for Romney; and he projected 303 Obama on the morning of Election Day, accompanied by expressions of sadness over the inescapable numbers). Unsurprisingly, the many Conservative Websites I was scanning in the lead-up to Nov. 6th gave Election Projection no coverage. Kudos to a scrupulous conservative political analyst, sadly atypical in 2012.

Some Body says:

Cat — Gallup is irrelevant, though. They didn’t poll Florida; just nationally.

wheelers cat says:

Some Body, I understand that you are just itching to mansplain it all to me.
Just don’t.

Joe Allen says:

Poll Aggregation Dodged a Bullet, but seriously underestimated poll-based error
The strength of Obama’s victory, and the fact that he happened to hold small leads in swing state polls prior to the election has obscured a highly inconvenient fact: the poll aggregates were actually substantially in error, and it was only because the error happened to favor Obama in this instance (something that was not predictable in advance from the polls) that the predictions didn’t go disastrously awry.
All one has to do is compare the RCP average for swing states to the actual results, and then ask what if the discrepancies had all gone the opposite way (i.e. in Romney’s direction not Obama’s—an outcome that was equally likely given the assumptions of the poll aggregations). In that case, Romney would have won the following states and hence the election! (numbers in parentheses present the RCP final average first, followed by the discrepancy between that error and the final result…these numbers indicate the discrepancy would have been enough to swing the state to Romney had it been in his favor instead of Obama’s. Note that no states would have swung the other way if all the discrepancies had simply been reversed):
CO (1.5, -3.2), FL (0, -.6), IA (2.4 – 3.1), NH (2.0, -3.8), VA (0.3, -2.7), NV (2.8, -3.8)
In short, there was more than enough true error in the poll-based estimates to swing the election either way…and only because it went toward Obama do the poll aggregation models come out looking OK.
To be clear, I think the poll modeling approach is great, and that both Nate Silver and Sam Wang do wonderful work…I just think both need to model in more error to their estimates going forward and to recognize that poll-based estimates are NOT “sampling votes,” but instead doing something that as a psychologist I recognize to be far more uncertain: sampling one predictor of individuals’ future behavior. In fairness to 538, the results above are based on the RCP average…Nate Silver’s final nowcast projections actually did somewhat better, though errors were still quite sizeable.

Khan says:

I’m sorry, but you missed the ball entirely.
The actual results mirrored the swing state medians almost exactly.

Joe Allen says:

Not sure where you get this Khan, Obama significantly outperformed the swing state estimates (both medians and means) for the states listed above (see Sam’s ‘geek’s guide’ pdf for the actual estimates).

Vicente says:

PEC’s results were also better – RCP excludes polls, so it seems weird to use it to drive the point you’re trying to make.
Sam’s medians from the geek’s guide, followed by actuals per Politico, and your discrepancy:
CO: O+2/O+4.7/+2.7
FL: Tie/O+0.6/+0.6
IA: O+2/O+5.6/+3.6
NH: O+3/O+5.8/+2.8
VA: O+2/O+3/+1
NV: O+2.5/O+4.6/+2.1
NC: R+1/R+2.2/+1.2
Leaving aside Florida (since it was tied in the estimate – and effectively was nearly tied in reality), that seems to leave only Colorado and Iowa in the camp of states where a things would have flipped if the error hadn’t been in favor of Obama. That said, North Carolina would have flipped as well if the error had been in favor of Obama there, so 2 in one direction and 1 in the other.
Not sure how to interpret that, but I’m not sure it makes for a compelling argument that the state polls badly missed and dodged the bullet only because the error was in Obama’s favor.

Some Body says:

@Vicente — But the discrepancy does show there was room for systematic error in the polls (looking at the numbers, it might have been chiefly the ground game of the last days that explains it, assuming NC was given up on in the end). Which is an argument for keeping more room for uncertainty in the prediction (a la Silver and Jackman’s win probabilities).

joeljcj2 says:

i think the argument being made is that if a projection favors O by 2%, and O wins by 5% , the 3% error in the projection is so large that had it gone the other way the result would have been wrong and the predictor appear wrong. the argument that since the errors all went in the same direction of underpredicting , the error is hidden by the result.—this sort of makes sense if the goal is proper prediction of the size of the victory, but given that just the binary question of who wins is interesting, underprediction is not a sin. if we bet on ali to win, we don’t care if he wins in the 6th round or the 9th. ( only gamblers playing point spreads , etc. are concerned, and of course ali who must fight 3 more rounds )

Sam Wang says:

My goal has largely been to anticipate outcomes with practical governing consequences: Electoral College, Senate, House. Along the way, other quantities fall out such as popular vote margin
and state win probabilities.
State margins are of interest mainly insofar as they gauge win probabilities. Thus I have been more interested in Z-scores, i.e. Z=margin/SEM. This is why I integrated over longer time in the home stretch: to reduce SEM and thus maximize Z. For margins, the quantities in the Power of Your Vote might be better.

wheelers cat says:

kindof OT, but I find this fascinating.
More on the ORCA debacle.
http://andrewsullivan.thedailybeast.com/2012/11/romneys-bumbling-bureaucratic-campaign.html
What the right needs is more nerds, not more hispanics. But 94% of scientists are NOT REPUBLICAN.
Its unlikely hispanics are going to turn right anyways…at least not in time.
http://andrewsullivan.thedailybeast.com/2012/11/righting-themselves-on-immigration.html
And I doubt the base will buy into it.
Look at birtherism. The GOP elites made a full court press to strike down birtherism, and its more prevalent than ever.
conservative backfire effect in action.

robotslave says:

But “the right” already has plenty of nerds; there is a well-known libertarian streak in the ranks of programmers and other software professionals.
The failures of the ORCA system shouldn’t be read as a failure of the GOP to attract top talent; rather, it should be read as an unfortunate but mundane outcome in a large software project.
To this day, most such projects fail, particularly on the first iteration, regardless of the political makeup of the programmers (and there is generally quite a lot in such failures than can’t be attributed to the programmers at all; management and product design contribute significantly)
This isn’t news about politics, this is merely an embarrassing but unsurprising outcome in the industry of software production.
People do learn from failure, though, so it would be an enormous mistake to assume that the GOP software toolkit will collapse, fall short, or otherwise fail in the next election cycle.

538 Refugee says:

Let me add a little more ‘beef’ to my problem with 538. He is boasting correctly calling 51/51 electoral calls but at the top of the page his “product” still shows a forecast of 313 electoral votes? That was always an unlikely total. At least it finally dropped he fractional part. 😉

JLH says:

The 539 page still shows 313.0 for me. But that’s an average over the course of many simulations. If you look at the probabilities you get 332 electoral votes as the most probable outcome or if you add up each state call you get to 332. That’s why people refer to his forecast as being 332.
I do think the 51/51 is kind of silly (although I haven’t heard Silver or many others claim otherwise) because of Florida. It seems if you had Florida close then it shouldn’t matter if you picked heads or tails. But I haven’t see 538 (or Drew Linzer or Simon Jackman) claim their models were better because they won the coin toss so I’m not sure there’s any need for beef.

Ryan says:

313 is a weighted average. His Mode prediction (most likely case) was 332.

Sam Wang says:

and here 312 was the median.

Bruce Wayne says:

I don’t get your “beef”.
The 313 electoral votes you’re referring to is the MEAN elector vote count for Obama from all of the 538 model runs. That’s wasn’t a prediction for Obama’s finally tall. To get that, you have to look at 538’s plot of electoral vote distribution produced by the model. The largest spike, representing the most frequent outcome, was at 332 for Obama. The second largest spike was at 303. So, in other words, the 538 model was split between awarding FL to Romney or Obama.
Also, Sam Wang also posted a near exact prediction. At the top of this webpage it says Obama 312 Romney 326. I’m guessing you somehow missed that. So why don’t you have a “beef” with PEC?

538 Refugee says:

Batman. Nate never owned the 332 figure as his final prediction. Yes his map has it if you add it up yourself but what number do you see when you go to his site? Did the Times make him leave it ambiguous? I just double checked. I see no final prediction thread. Don’t get me wrong, I have a lot of respect for Nate and what he does. I am just a little disappointed in what I see now that he is with the Times. I think he has lost a little of that independent edge.

azoomer says:

There would always be at least two predictions. The mean and the mode.
Ns mean =313, mode =332
Sw mean =312, mode =303
Don’t take it against silver that he’s displaying the mean more prominently. He got the mode right, and for most of the media, that’s what mattered.
Similarly, sw also highlights the mean instead of the mode. However, if Florida went for R,we’d all be saying thay sw “got” 51/51. And someone would call out the 312 forecasted ev.

BNW says:

Wang and Silver (and Jackman, and probably others) both predicted that FL was the closest state, that the outcome was razor thin, and that it was so close as to be a coin toss in terms of making a prediction beforehand.
As for Silver’s final forecast, he wrote: “In the final pre-election forecast at FiveThirtyEight, the state of Florida was exceptionally close. Officially, Mr. Obama was projected to win 49.79 percent of the vote there, and Mr. Romney 49.77 percent, a difference of two-hundredths of a percentage point.”

Ken Dogson says:

See you in four years.

Some Body says:

You’re missing out on all the interesting stuff in between!

JLH says:

I think Vasyl and Jason touched on this in earlier comments, but I’m curious, should we really expect that if a model gave 5 states an 80% probability of being won by a candidate, that only 4 of the 5 should be won by that candidate for the probability to be accurate?
These results are not independent of each other. If a forecast based on polling is off for one state, it is likely to be off by a similar amount in another state. The errors should be correlated. It doesn’t make sense to me to say that getting 50 out of 51 means the average probability should be .98, when those 51 outcomes are related to each other.
I would think the probabilities should be tested over many elections, rather than many states in the same election. So if you have five states at 80% probability over five elections, then I would expect to see the number correct at something like 5, 4, 5, 1, 5 rather than 4, 3, 4, 5, 4.
Is there a way to “score” the probabilities provided in this election on their own? I’m not sure there is, since even the Senate contests are polled by many of the same companies and so any poll bias would permeate through those races as well. I think it would be better to leave the question of how good the probability estimates were until after these models are used over more independent elections.
So we can see which model had the best accuracy this time, but I don’t think the question of which model is best overall can really be answered right now.

Sam Wang says:

The statement about independence is incorrect. Outcomes may change in the future, and if there is systematic error then it would be symmetric. Therefore the expectation value is the same.
I am very interested in the fact that this error persists among my readers.

JLH says:

Thanks for the response, Sam. I admit I’m not a regular reader, so perhaps you’ve addressed this in the past. Are you saying that the results of each state vote relative to the polls are not correlated at all? (Forgive me if I’m not using the right terminology, hopefully what I am saying is clear anyway.) In other words, would we not expect that if the polls were wrong by a point towards Obama in Ohio that they very well might also be wrong by a point towards Obama in Pennsylvania and Indiana as well?
Why would systematic error be symmetric? Of course it’s possible that it would be symmetric, especially if it was based on sampling error, but if the polling errors came from decisions about how to model the likely voter universe then it could just as easily lead to a consistent bias towards one candidate. Right? This seems obvious to me but I might be missing something.
I’d appreciate it if you or others could explain what I’m not seeing. Thanks!

Some Body says:

@Sam — The symmetry wouldn’t be felt that much in the Pres. results this year, though. You had many states with a small Obama lead, but only one (NC) with a small Romney lead.

538 Refugee says:

I guess this is what Nate Silver does with his state fundamentals weighting? It ignores what the campaign may be doing in any individual state though. I don’t watch much TV but my wife does so I did see some of the media blitz. What happened in Ohio didn’t affect PA much in that regard. What about the ground game? Or lack thereof in terms of beached whales?

Lenny says:

Glad that you are looking at this – I’m hoping Nate gets around to analyzing the strengths and weaknesses of his model vs. other possibilities in public way sometime soon.
Though it seems clear that the state polls were the best way to analyze the close Senate races at the last minute (which is, as you said, not really the gold standard for how great a prediction is), I’d like to see the following stuff before dismissing Nate’s “extra layers” completely:
1) What about the states that weren’t as close and were thus more sparsely polled? How accurate were the “fundamentals” there? If they were also pretty bad, then maybe Nate’s model just had a poor method of including fundamentals – were there any other models that had a better one?
2) The long-term predictions. My initial take on your numbers here is really that Nate wasn’t aggressive enough in removing the effect of state fundamentals from his “now-cast” – that, like the economics info, that information is already priced into the polls by the time the election happens.

Matt McIrvin says:

Next question: long-term predictions, or: Was Drew Linzer just lucky?

Stephanie Jones says:

Thank you for all the insightful (and accurate) analysis leading up to and now after the election. Could you comment on the Bayasian analysises and how those compared with the more traditional analysis? By the way, I enjoy 538 also, so it is interesting to see the discussions of the relative differences. Thanks again — and, please (if possible), keep it up! I look forward to seeing different issues (other than the election) covered.

Bill Bethke says:

My understanding of 538 model is that all the “fundamentals” drop out before the final prediction (on the basis that by then they are fully baked into the polls). If that’s correct (and maybe it’s not), then are national polls the only source of discrepancy? And if that’s correct, could one argue the national polls may be a reasonable source of data in the Presidential race BUT also a source of “noise” (to use a Nate Silver word) in Senate projections?
And, last, maybe there’s value in looking at this two or three ways in order to see different things. I’m not sure I see PEC and 538 (and perhaps others) as an either/or proposition.

Bruce Wayne says:

538 seems to be a lot more generous when allocating uncertainty. Days before the election, 538 gave Romney an 18% chance of winning, based largely on the possibility of all the polls being systematically biased against Romney. PEC, on the other hand, largely discounted this possibility and gave Romney a 1-2% chance of winning.
We still have no way of knowing which approach is better. If 538’s approach is right, then sooner rather than later we will have an election season where all the polls are systematically biased against one candidate. If that does happen, a model purely driven by polls, such as PEC’s, will be at a huge disadvantage and will likely produce large misses.
However, if such a scenario is truly unlikely, then 538 will consistently beat its own probabilities. In other words, if 538 predicts an 80% of winning when it’s actually 99%, they will always make the right call.
Even then, I’m not sure if this counts as a feature or a flaw. 538 has a much wider target audience, many of whom are laymen who don’t appreciate the fact that things will occur despite the odds. If Nate Silver actually got 1 in 5 of his 80% predictions wrong, people would be less inclined to trust him (when in reality, they should trust him more). So, Silver may have an incentive downplay his own confidence.

Allan Rosenberg says:

Bruce, if someone consistently gets 80% of his predictions wrong, that is just as useful as if he/she consistently gets 80% of his predictions right. You just have to reverse whatever predicitons he/she makes.

Dan B says:

I’m pretty sure the reason you got the North Dakota senate race right and Nate Silver didn’t is that you included the Pharos Research polls (the only ones which showed her ahead) and he didn’t. Differences in methodology are in the noise.

Sam Wang says:

There have been polls showing her ahead all season. Those are the recent ones.

Allan Rosenberg says:

Sam, do you have any thoughts on how to compare the accuracy or utility of pre-election night polls? The best I’ve come up with is to pretend that each is a prediction market, then see whether you could have used either to make money in the other.

Ross C says:

There is much work to be done in the public/pundit sphere. Tonight, on “Inside Washington,” Charles Krauthammer insisted that Hurricane Sandy stopped Romney’s momentum. He stated that before the hurricane, Romney was up 5-6 points in the polls. I know, consider the source, but still.

Dean says:

Nate Silver and Dr. Wang are on the right side of facts and history. There is certainly room for debate and introspection about how each model works, and of course comparison and criticism, but look at what happened on the other side, in the fact-challenged “echo chamber” of the right wing. Right wing poll “analysts” fooled themselves and ultimately Romney, who believed his own polls and thought he had Ro-mentum.

Alex says:

I have found the site and your approach interesting this election season, notwithstanding the slight obsession with claiming superiority over FiveThirtyEight. Somehow your attempts to tone it down have made it no less obvious or irritating.
Strongly suspecting that Florida would not have been so much of a “coin toss” if you had gotten it right and that this blog post would have been considerably shorter for lack of needing to demonstrate why you’re superior despite having presented the inferior prediction.

Sam Wang says:

I can think of one way you can avoid irritation! Seriously, I agree this is a parlor trick (though I don’t agree about the bottom line).
It is more important to focus on governance and practical outcomes: where to put effort, specific downticket races, Senate/House races, and long-term prediction. Thus my obsession with the House, ungratifying for the lack of precision but so important. Also, thus the focus on ActBlue/CrossroadsGPS.
Now get off my lawn, ya no good kid!

wheelers cat says:

Alex this shouldn’t have to be addressed twice in the same post, but this is Just What Science Does.
It drives out the best explanation, the best model.
And the nerdcore loves to pick apart the mechanism and dwell lovingly over the detail. Its so we can improve the models for the next election, its a process.
Now if thats not your taste, we totes understand.
So get off Dr. Wang’s lawn and go cultivate your sour grapes elsewhere.

538 Refugee says:

Nate weighs economic factors into his prediction. Look at the model in Colorado that used ONLY economic factors. Complete wiff even though the model supposedly was 100% for past elections. Registered voters, and even more so likely voters, are probably those most likely to be paying attention to the news on a daily basis so those factors are probably already ‘cooked’ into the polls.

Hcareney says:

Dr.Wang: I just wanted to thank you for your statistical model, your insightful comments and for being a touchstone in a world of hype. I’ve been following your website for the last three months and, during those days when I gave into the roar of an imminent Romney victory, I would turn to your website and find solace in actual numbers. Thanks so much for your work. (and thanks to Andrew and everyone who wrote comments, which I read and learned from.)

RJB says:

Sam,
I’m hoping you’ll write a post clarifying something I’ve been puzzling over. How should we assess the reliability of your (or Nate Silver’s) performance in this election?
The Brier scores make sense as a measure of an aggregator’s performance in this election, but how should we think of the sample size? I hear many people emphasizing the number of races called correctly, but those are hardly statistically independent results. As I understand the meta-margin approach, you are assuming it makes sense to think of there being a single underlying variable that must be added to all race results, and then ask how extreme its realization would need to be in order to make the race a dead heat. That would mean that all of the races we observed this week still gave us only a single observation on the accuracy of the aggregator. So why should I believe that you or Nate Silver will do well next time around? After all, if that single unobservable variable had been different, it would have changed all of the race outcomes!
Bottom line: if you were creating a model to project the accuracy of your model and Nate Silver’s model in the next election, what would that model look like, and how wide would its uncertainty bars be? Exactly the same as the uncertainty bars on the morning of this past election (because they were then reflecting only the single unobservable variable)? Would the number of races in this election have any effect on the uncertainty bars?

Sam Wang says:

State polls were unbiased in 2008 and 2004 to <1%, and probably for 2000 as well. It requires some digging on the site (Ryan Lizza at TNR collected that). I would say that is n=150, or n=30 if you only count swing states. 
There could be future failure, but this involves a failure of the polling profession. At this point my polls-only aggregation approach is secure – and in any event transparent and lacking an easy opportunity for much adjustment. 

Aaron Andalman says:

Perhaps a simple likelihood calculation would be a better measure for comparing model quality than the Brier score, i.e. what is the probability that the final vote percentages for each state would be drawn from your models estimated vote distributions?
While the Brier score is based on election outcomes, it fails to capitalize on lots of information relevant to assessing model quality because it only considers the discretized outcomes.

zenger says:

Popular vote margin is now 2.7 percentage points.
http://politicalwire.com/archives/2012/11/10/obama_formally_wins_florida.html
Seems I’ve seen that figure somewhere before.

zenger says:

Ah, yes.
ELECTORAL PREDICTION Popular Vote Meta-Margin Obama +2.76%.

gwern says:

> A more interesting case is the Senate, where the 10 closest races had these probabilities:
What are your respective Brier scores when doing all ~30 states? Is there a list of all the win % and Senate margins?
(Previously asked on Twitter: http://twitter.com/gwern/status/267093264073621504 )

Sam Wang says:

For all nonlisted races my implicit statement was that they were certain. I direct activist interest (on both sides) based on knife-edginess. I regarded them as 95% bets. This may not be enough for your purposes.

gwern says:

95% bets? Well, OK, I’m fine with listing them all as 0.95 in the relevant direction if you are.

bsk says:

Any chance of seeing your RMSE?

gwern says:

bsk, I’ve taken the liberty of calculating a number of RMSEs myself in http://appliedrationality.org/2012/11/09/was-nate-silver-the-most-accurate-2012-election-pundit/ / https://docs.google.com/document/d/1Rnmx8UZAe25YdxkVQbIVwBI0M-e6VARrjb0KdgMEVhk/edit
For Wang & Ferguson’s 2012 Presidential state margins, I got a RMSE of 2.777933 Including the electoral & popular vote, the RMSE is 2.725015.3

Brash Equilibrium says:

You’re right. Differences between the predictive ability of poll aggregators on the eve of election do not enlighten us about the relative predictive power of the models. Better to calculate Brier scores and RMSEs on the full history of predictions for each state, and also calculate Brier scores on the full history of electoral vote win probabilities. I’ve written an article at Malark-O-Meter arguing that we should construct model weights based Brier scores and RMSEs, and then aggregate the aggregates proportional to their relative predictive power.
http://www.malarkometer.org/1/post/2012/11/some-modest-proposals-for-weighting-election-prediction-models-when-model-averaging.html#.UJ-8huQ0WSo
One last thing. If you look at electoral vote prediction alone, Drew Linzer basically beats everyone else in terms of long term predictive ability. I predict that he may have done the same on other measures as well.

TAW says:

Republicans – Worst Cycle Ever for Polling
http://www.politico.com/blogs/burns-haberman/2012/11/republicans-worst-cycle-ever-for-polling-149229.html
Interesting that their internal polling was so different and so much worse than both Public and Democratic polling.
I would suggest that this is partially a function of their deficits in technology vis a vis the Democrats.
It also seems like it would be wise to separate pure forecasting from ‘aspirational’ management information systems and data.
In the private sector, Business plans are generally targts or goals and specifically understood *not* to be forecasts. And firms that confuse them tend to pay a price for that ambiguity.

Alan Houston says:

I had predicted the President to win Virginia, Colorado, and Florida. I computed how many days he led from May 1 to Nov 5 in state polls, the November 5th polls, the November 1 to 5 trend, and the 2008 margin. My model had the President winning Florida by 50,000 votes.
I emailed those results to friends on Sunday. Was Karl Rove REALLY surprised?
.

538 Refugee says:

“Was Karl Rove REALLY surprised?”
Cherry picking data seemed to work well for the GWB administration so, yeah, he was blindsided. I seem to vaguely remember something about some yellow cake uranium or was it a bridge in Brooklyn? Which ever, we ended up in another war. Point is you are talking about someone that believes they can do no wrong. Literally.

emory mayne says:

Oh Dear!
A couple of comments here.
1. Competition does not make a prediction better. Informational analysis make a prediction closer, to an future outcome (sorry free martekteers.)
2. Benchmarking 538. Why would you not benchmark 538. Silver has done a outstanding job, and has become a Pundit whipping boy in the process.
3. Professional pride in beating the prediction of a high benchmark is rightful cause for celebration. PEC outperformed 538 in this cycle – something about numbers.
4. Courage, yes a little courage to put yourself out there, with no absolute gaurantee you have done absolutly everything right. yep, it takes a little of that too. Is Dr. Wang brave, or does he just lack imagination? Both PEC and 538 could be toast right now.
Thanks

Gary says:

Agreed with emory mayne – chasing someone else’s “good” path rather than The Truth leads to a phenomenon known variously as “target fixation”, “groupthink”, and “lemming boy”.
Wang seems to be on his own path informationally, though by both his and Silver’s account either of the two of them could’ve been the “winner”. Kudos, Wang, for trying to speak to the few people who can understand that you were at least as right as he was.
Now how to get that pesky press to understand without billions of advertising dollars?

Thoughtful says:

I have to say living in London, where there were at least a dozen major bookmakers running a book on every state, every race etc. with odds never better than 1/3 Obama in the national poll EV and for example 2/7 on Iowa; I did get 6/4 Obama winning Florida but that was even money on the day. Nevada 1/10. EV 1/5 was best on the day. Countless examples. Yes very profitable and tax free! How is it that the GOP was completely oblivious to the World’s book makers agreeing with Sam, Nate and Simon? Even Intrade ever gave Romney better than a 40/60 chance. Dogma and delusion seems to be endemic in the GOP.

Ned says:

I am perplexed. Nate Silver and PEC consistently claimed in the days leading up to the election that the likelihood of an Obama victory was considerably higher than the likelihood implied in the Intrade betting line. If professional pollsters were so confident, why not put money down — indeed, a good deal of money, which presumably would have brought the betting line closer to the predictions of the modelers. Indeed, several days before the electoin PEC was claiming that the likelihood of an Obama victory was well into the 90th percentile, whereas Intrade was mostly in the low 60s. The is a very big margin, as any regular gambler could tell you. So how to explain the variance – and in particular, why aren’t the polling pros in position to make some serious money here?

Sean says:

I found it particularly gratifying to watch Karl crash and burn, and waste so much money unsuccessfully. Good times.

Leave a Reply

Your email address will not be published. Required fields are marked *