Tarnished Silver: Assessing the new king of stats - Macleans.ca

Tarnished Silver: Assessing the new king of stats

Nate Silver’s attackers don’t know what they’re talking about. (Nor do his defenders)


The whole world is suddenly talking about election pundit Nate Silver, and as a longtime heckler of Silver I find myself at a bit of a loss. These days, Silver is saying all the right things about statistical methodology and epistemological humility; he has written what looks like a very solid popular book about statistical forecasting; he has copped to being somewhat uncomfortable with his status as an all-seeing political guru, which tends to defuse efforts to make a nickname like “Mr. Overrated” stick; and he has, by challenging a blowhard to a cash bet, also damaged one of my major criticisms of his probabilistic presidential-election forecasts. That last move even earned Silver some prissy, ill-founded criticism from the public editor of the New York Times, which could hardly be better calculated to make me appreciate the man more.

The situation is that many of Nate Silver’s attackers don’t really know what the hell they are talking about. Unfortunately, this gives them something in common with many of Nate Silver’s defenders, who greet any objection to his standing or methods with cries of “Are you against SCIENCE? Are you against MAAATH?” If science and math are things you do appreciate and favour, I would ask you to resist the temptation to embody them in some particular person. Silver has had more than enough embarrassing faceplants in his life as an analyst that this should be obvious.

But, then, the defence proffered by the Silverbacks is generally a bit circular: if you challenge Silver’s method they shout about his record, and if you challenge his record they fall back on “Science is always provisional! It proceeds by guesswork and trial-and-error!” The result is that it doesn’t matter how far or how often wrong Silver has actually been—or whether he adds any meaningful information to the public stockpile when he does get things right. He can’t possibly lose any argument, because his heart appears to be in the right place and he talks a good game.

Both those things count. Silver is a terrific advocate for statistical literacy. But it is curious how often he seems to have failed upward almost inadvertently. Even this magazine’s coverage of Silver mentions the means by which he first gained public notice: his ostensibly successful background as a forecaster for the Baseball Prospectus website and publishing house.

Silver built a system for projecting future player performance called PECOTA—a glutinous mass of Excel formulas that claimed to offer the best possible guess as to how, say, Adam Dunn will hit next year. PECOTA, whose contents were proprietary and secret and which was a major selling point for BPro, quickly became an industry standard for bettors and fantasy-baseball players because of its claimed empirical basis. Unlike other projection systems, it would specifically compare Adam Dunn (and every other player) to similar players in the past who had been at the same age and had roughly the same statistical profile.

For most players in most years, Silver’s PECOTA worked pretty well. But the world of baseball research, like the world of political psephology, does have its cranky internet termites. They pointed out that PECOTA seemed to blunder when presented with unique players who lack historical comparators, particularly singles-hitting Japanese weirdo Ichiro Suzuki. More importantly, PECOTA produced reasonable predictions, but they were only marginally better than those generated by extremely simple models anyone could build. The baseball analyst known as “Tom Tango” (a mystery man I once profiled for Maclean’s, if you can call it a profile) created a baseline for projection systems that he named the “Marcels” after the monkey on the TV show Friends—the idea being that you must beat the Marcels, year-in and year-out, to prove you actually know more than a monkey. PECOTA didn’t offer much of an upgrade on the Marcels—sometimes none at all.

PECOTA came under added scrutiny in 2009, when it offered an outrageously high forecast—one that was derided immediately, even as people waited in fear and curiosity to see if it would pan out—for Baltimore Orioles rookie catcher Matt Wieters. Wieters did have a decent first year, but he has not, as PECOTA implied he would, rolled over the American League like the Kwantung Army sweeping Manchuria. By the time of the Wieters Affair, Silver had departed Baseball Prospectus for psephological godhood, ultimately leaving his proprietary model behind in the hands of a friendly skeptic, Colin Wyers, who was hired by BPro. In a series of 2010 posts by Wyers and others called “Reintroducing PECOTA”—though it could reasonably have been entitled “Why We Have To Bulldoze This Pigsty And Rebuild It From Scratch”—one can read between the lines. Or, hell, just read the lines.

Behind the scenes, the PECOTA process has always been like Von Hayes: large, complex, and full of creaky interactions and pinch points… The numbers crunching for PECOTA ended up taking weeks upon weeks every year, making for a frustrating delay for both authors of the Baseball Prospectus annual and fantasy baseball players nationwide. Bottlenecks where an individual was working furiously on one part of the process while everyone else was stuck waiting for them were not uncommon. To make matters worse, we were dealing with multiple sets of numbers.

…Like a Bizarro-world subway system where texting while drunk is mandatory for on-duty drivers, there were many possible points of derailment, and diagnosing problems across a set of busy people in different time zones often took longer than it should have. But we plowed along with the system with few changes despite its obvious drawbacks; Nate knew the ins and outs of it, in the end it produced results, and rebuilding the thing sensibly would be a huge undertaking. We knew that we weren’t adequately prepared in the event that Nate got hit by a bus, but such is the plight of the small partnership.

…As the season progressed, we had some of our top men—not in the Raiders of the Lost Ark meaning of the term—look at the spreadsheet to see how we could wring the intellectual property out of it and chuck what was left. But in addition to the copious lack of documentation, the measurables from the latest version of the spreadsheet I’ve got include nice round numbers like 26 worksheets, 532 variables, and a 103 MB file size. The file takes two and a half minutes to open on this computer, a fairly modern laptop. The file takes 30 seconds to close on this computer. …We’ve continued to push out PECOTA updates throughout the 2010 season, but we haven’t been happy with their presentation or documentation, and it’s become clear to everyone that it’s time to fix the problem once and for all.

For the record, the Wieters Bug turned out to be a problem highly specific to Wieters; in Silver’s “copiously undocumented” rat’s nest of a model, there was a blip in the coefficients for the two different minor leagues in which Wieters had played in 2008, and BPro did not have time to ransack the spreadsheets looking for the possible error. The Ichiro Problem, by contrast, is intractable by ordinary statistical means; there are just a few players who are so unusual that a forecaster is as well off, or better off, falling back on intuition and first-principles reasoning. (Unless, that is, he has better data. Today’s PECOTA is able to break batting average into finer-grained statistical components in the hope of detecting Ichiros more perceptively.)

If the history of Silver’s PECOTA is new to you, and you’re shocked by brutal phrases like “wring the intellectual property out of it and chuck what was left”, you should now have the sense to look slightly askance at the New PECOTA, i.e., Silver’s presidential-election model. When it comes to prestige, it stands about where PECOTA was in 2006. Like PECOTA, it has a plethora of vulnerable moving parts. Like PECOTA, it is proprietary and irreproducible. That last feature makes it unwise to use Silver’s model as a straw stand-in for “science”, as if the model had been fully specified in a peer-reviewed journal.

Silver has said a lot about the model’s theoretical underpinnings, and what he has said is all ostensibly convincing. The polling numbers he uses as inputs are available for scrutiny, if (but only if) you’re on his list of pollsters. The weights he assigns to various polling firms, and the generating model for those weights, are public. But that still leaves most of the model somewhat obscure, and without a long series of tests—i.e., U.S. elections—we don’t really know that Nate is not pulling the numbers out of the mathematical equivalent of a goat’s bum.

Unfortunately, the most useful practical tests must necessarily come by means of structurally unusual presidential elections. The one scheduled for Tuesday won’t tell us much, since Silver gives both major-party candidates a reasonable chance of victory and there is no Ross Perot-type third-party gunslinger or other foreseeable anomaly to put desirable stress on his model. Silver defended his probabilistic estimate of the horserace this week by pointing out that other estimates, some based on simpler models and some based on betting markets, largely agree with his.

This is true, and it leaves us with only the question of what information Silver’s model may actually be adding to the field of alternatives. The answer could conceivably be “Less than none”, if his model (or his style of model-building) is inherently prone to getting the easy calls right and blowing up completely in the face of more difficult ones. (Taraji P. Henson Alert!) It is worth pointing out that a couple of statisticians have given us a potential presidential equivalent of the Marcels—a super-simple model that nailed the electoral vote the last two times (and that actually is fully specified).

It is also worth pointing out that Silver built a forecasting model for the 2010 UK election, which did turn out to be structurally unusual because of the strong Lib Dem/Nick Clegg performance. Silver got into squabbles with British analysts whose models were too simple for his liking, and the whole affair was an exemplar of what Silver’s biggest fans imagine his role to be: the empiricist hard man, crashing in on the pseophological old boys’ club and delivering two-fisted blasts of rugged science. It did not go well in the end, as his site’s liveblog of the returns records:

10:00 PM (BST). BBC exit poll predicts Conservatives 307, Labour 255, LibDems 59.

10:01 PM (BST). That would actually be a DROP for Lib Dems from the last election.

10:02 PM (BST). BBC nerd says: “The exit polls are based on uniform behavior”, a.k.a. uniform swing. So we haven’t really learned anything about whether uniform swing is the right approach; it’s baked into the projection.

10:07 PM (BST). We would obviously project a more favorable result than just 307 seats for Conservatives on those numbers. Calculating now.

10:11 PM (BST). If the exit polls are right but the seat projections are based on uniform swing, we would show a Conservative majority on those numbers.

10:13 PM (BST). Here is what our model would project… [Cons 341, Lab 219, Lib Dem 62]

The final result? Conservatives 306, Labour 258, Liberal Democrats 57. The BBC’s projection from exit polls, using simple uniform-swing assumptions to forecast the outcome of a very wrinkly three-sided race, was so accurate as to be almost suspicious. And how was Silver’s performance after being basically given the national vote shares for the parties? Perhaps it’s best to draw the veil of charity over that.

Which, in fact, seems to be what has happened. Lucky thing for Silver’s reputation!—but then, he has always been lucky.