We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the first time. -T...
Feeding Karl Rove a bug
Early on Election Night, the New Hampshire results made clear that the state polls were on target, just as they were in 2000-2008 – more accurate than national polls. At that point it seemed more interesting to watch Fox News for reactions. At first they were filled with confidence of a Romney win. As data came in, a funereal air fell over the proceedings. And as is well known by now, Karl Rove became wrapped up in his calculations and had to be called out by Megyn Kelly.
Rove gave every appearance of genuinely believing that Romney would win. Similarly, Team Romney (and many pundits) thought that professional pollsters as a group were off base. This is a case of motivated reasoning: selective questioning of polls that they found disagreeable. It afflicted the whole right-wing media structure.
Do such biases ever help? What about analytical improvements, like the layers added at FiveThirtyEight? Today I report that by a quantitative measure of prediction error, we did as well in Presidential races as Nate Silver, and on close Senate races, we did substantially better – 10 out of 10, compared with his 8 out of 10. Let’s drill into that a little.
For us the keys to success were (a) a high quality data feed, and (b) avoiding the insertion of biases. Indeed, Mark Blumenthal and Andrew Scheinkman at Pollster.com gave us great data. After that we chose a a median-based polls-only approach to minimize pollster biases.
I will be honest and say that an Election Eve test is not very interesting. Long-term predictions are of greater importance – as well as other ways that aggregation adds value, like tracking ups and downs, as we did. By Election Eve, anyone who is looking at the data honestly can figure out what will happen the next day. Still, let us go along with this week’s media frenzy.
First, the obvious: of the 51 races, one was essentially a coin toss – Florida. Nate Silver, Drew Linzer, and Simon Jackman won the coin toss; Scott Dillon and I lost (though I briefly made a good guess). Is there a better way to quantify this?
One way is to look at our final polling margins, compared with returns.
Whenever a candidate led in pre-election polls, he won. This was true even for a margin of Romney +1% (NC). Evidently state polls have a systematic error of less than 1% – as good as 2008! (Also, like 2008, pre-election polls substantially underestimated actual margins, this year by a factor 0f 0.8 +/- 0.3. Majority-party voters in nonswing states like to vote – or minority-party voters don’t.)
Since Florida was a coin toss, it is better to examine our state win probabilities, as suggested at Science 2.0. The closer the probabilities are to 1.00, the more confident they are. Probability should also measure the true frequency of an event. If I say a probability is 0.80, I expect to be wrong 1 out of 5 times. Our record of 50 out of 51 (counting Florida as a loss) means that our average probability should have been about 0.98. It was 0.97.
This can be quantified using the Brier score, as described by Simon Jackman of Pollster.com. This score is the average of the squared deviations from a perfect prediction. For example, if Obama won a race that we said was 90% probable, that’s a score of (1.0-0.9)^2 = 0.01. If we were only 70% sure, the score is (1.0-0.7)^2 = 0.09. The average score for all 51 races is the Brier score. The Brier score rewards being correct – and rewards high confidence.
For the Presidential races, the Brier scores come out to
|Presidential Brier score||Normalized Brier|
|100% confidence in every result||0.0000||1.000|
|Princeton Election Consortium||0.0076||0.970|
We appear to be slightly better than our very able colleagues. The additional factors used by the FiveThirtyEight model include national polls and maybe some other parameters. It seems that these parameters did not help.
A more interesting case is the Senate, where the 10 closest races had these probabilities:
|State||538 D win %||PEC D win %|
Note that a number of these races (Indiana, Montana, North Dakota, Virginia) were races I designated as knife-edge at ActBlue.
I have indicated in red the cases where the win probability pointed in the opposite direction as the outcome. These are not exactly errors – but they are mismatched probabilities. The Brier scores come out to
|Senate race Brier score||Normalized Brier|
|100% confidence in results||0.000||1.000|
|Princeton Election Consortium||0.039||0.844|
In this case, additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.
Update: I have added a normalized Brier score, defined as 1-4*Brierscore. This is a more intuitive measure. Thanks to Nils Barth. I’ll update this post with more information shortly.