We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the first time. -T...
A Presidential/downticket prediction challenge
Last month (“Using predictions in the service of ideals and profit,” Sept. 23) I asked what makes a good prediction. I made an analogy to hurricane forecasting. Predictions should:
- Be precise, allowing us to pinpoint a narrow range of outcomes.
- Change relatively little in the long term, giving us time to plan in advance.
- Give a true sense of the uncertainty – a sense of knowing what we don’t know.
Can we use these criteria to evaluate political prediction models?
Here is my first prediction (“A real prediction for November (Take 1),” Aug. 3):
…and here we are today.
The red strike zone, which captures the middle two-thirds of predicted outcomes, has stayed in the same general place:
- August:Obama 285-339 EV (Meta-Margin Obama +3.0+/-2.2%).
- Today: Obama 287-327 EV (Obama +2.3 +/-1.2%).
These numbers have one property of a good prediction: consistency. But are they right? We’ll learn more in a few weeks.
Today I propose some criteria by which to evaluate predictions. These criteria are applicable to both the Presidential race and (with modification) to downticket races as well.
This year, Presidential political predictions have come in multiple flavors:
- Purely poll-based (Princeton Election Consortium, Electoral-Vote.com),
- Econometric start-of-season models (political scientists),
- Mixed (FiveThirtyEight, Votamatic),
- Wisdom-of-crowds (InTrade), and
- Expert-based evaluations (Charlie Cook, other pundits).
Which of these is best – and what does “best” mean?
Here I propose five benchmarks for these different types of model. After the election, we will score as many of them as we can.
(1) Final EV outcome based on last prediction. For poll-based calculations, this should be the easy one, because Election Eve polls do so well. In past elections, the current PEC algorithm missed by 0 EV (no error) in 2004 and 1 EV in 2008. FiveThirtyEight should do well also.
However, for econometrically-based models, this test is not trivial. One well-established model by Abramowitz predicts a narrow Obama re-election, while Ray Fair’s model is on the fence. Drew Linzer’s Votamatic has stayed consistently near Obama 332 EV. In a newer model that gets reliable headlines, two Colorado political scientists think Romney will get over 330 EV. As the Dire Straits song goes: Two men say they’re Jesus. One of them must be wrong.
The benchmark: EV error.
(2) Vote share. This is a straightforward tallying, for instance of the two-candidate margin.
The benchmark: Deviations of predictions from actual outcome in the 50 states and DC are predicted on Election Eve.
(3) Long-term predictive value. Predictions are most useful if they give information far in advance. For example: on July 14th, major aggregators said Electoral-vote.com Obama 297 EV, FiveThirtyEight 302.5 EV, Princeton 312 EV, and RealClearPolitics 332 EV. If the final outcome is Obama 290 EV, then that day’s best performer would be Electoral-vote.com with an error of 7 EV.*
The benchmark: Compare final outcome with median predictions from every day of the campaign. Score using the median absolute deviation.
(4) Uncertainties. A good prediction has the essential quality of conveying what we don’t know – the known unknowns, as Donald Rumsfeld memorably put it. This is underappreciated. For example, a House model at the Monkey Cage has been pointed out to me. But its uncertainties are large, so it conveys little information about the final outcome. In this respect, a median/mean prediction is not informative on its own.
The benchmark: Compare a model’s stated uncertainties with its actual deviations from the final outcome. In principle, the average error should be about 1 sigma. To be evaluated: Princeton Election Consortium and other models that give uncertainties.
(5) Accurate probabilities. If a forecaster says the win probability is 80%, in principle he/she is saying that in five such cases, he/she will be right about four times. It is common to be accidentally underconfident. FiveThirtyEight habitually underreports confidence (or overstates uncertainty), as does InTrade.
The benchmark: Evaluation is possible if a forecaster gives win probabilities for many outcomes, for example the 50 states and DC, Senate races, and/or House races. Compare (sum of win probabilities) with (correct races called with >50% probability). If these are similar, then the probabilities are accurate.
*As we collect information from various aggregators, intentions are important. Many aggregators (us included) give a daily snapshot of the current state of the race on any given day. In those cases, benchmarks (1) and (2) should be applied to the last day’s numbers. In other cases, predictions are made explicitly, so comparison (3) is fair. And so on.
Many of the benchmarks above can also be applied, with some adjustment, to Senate and House races. After the election I’ll do my best to collect information on these benchmarks. If anyone cares to help, I’d be delighted.
Please suggest different benchmarks in comments.
Thanks to Andrew Gelman and Inbad Inbad for discussion.