Big Data: Bad science on steroids? -

Big Data: Bad science on steroids?

Experts struggle with how to tell the signal from the noise


Jared Zimmerman / Flickr

In case you didn’t realize, we’re living in the era of big data. From sequencing the molecules of human life to divine our futures, to capturing fodder on Twitter to predict disease outbreaks, big data’s potential is massive. It’s the new black gold, after all, and it can cure cancer, transform business and government, foretell political outcomes, even deliver TV shows we didn’t know we wanted.

Or can it? Despite these big promises, the research community isn’t sold. Some say vast data collections—often user-generated and scraped from social media platforms or administrative databases—are not as prophetic or even accurate as they’ve been made out to be. Some big data practices are downright science-ish. Here’s why:

Low signal-noise ratio

Let’s start with genome sequencing, an example of micro level big data capture. A newly published commentary in the journal Nature argued that, for most common diseases, screening people’s genomes is excessive and ineffective. That’s because, so far, genetic differences have been poor predictors of most diseases in most people. For example, with obesity or type-2 diabetes, many folks who have a genetic variant don’t get the disease or condition, while others do. If we started screening everybody, we’d have too many false positives and negatives to make such an exercise worthwhile.

The commentary’s lead author, Dr. Muin Khoury of the Centers for Disease Control and Prevention in the U.S., told Science-ish that the problem with big data genomics is that it’s hard to tell the signal from the noise. In the past, researchers working with smaller data sets would have a relatively modest number of statistically significant associations that were the result of chance. “In the genome era, when people started looking at millions of variants in a data set, they lowered their signal-noise ratio.” In other words, the bigger the data, the more opportunity to find correlations—that may or may not actually tell us something real.

Nassim Taleb, the author of Antifragile, does a good job here of explaining why bogus statistical relationships may be more common in the big data era, pointing out the parallel challenges with observational studies in health research. “In observational studies, statistical relationships are examined on the researcher’s computer. In double-blind cohort experiments, however, information is extracted in a way that mimics real life. The former produces all manner of results that tend to be spurious more than eight times out of 10.”

Hypothesis-free science-ish

Big data also goes hand-in-hand with “hypothesis-free science,” which some say departs from scientific principles. Instead of starting with a hypothesis and working out which data you would need to test it, researchers cast around for associations in data sets that are already available. Dr. Helen Wallace (PhD), director of the genetic science public interest group GeneWatch, put it this way: “If you don’t have a good scientific hypothesis about what causes a disease, you’ll probably end up measuring the wrong data in the first place.”

Screening the genome of everyone at birth, for example, seeks out the genetic basis for disease but doesn’t account for things like environmental exposures or lifestyle factors, which may actually be important predictors of sickness. She likens this to forecasting the weather by only measuring the temperature with a thermometer. “You would miss out on other key weather predictors, like barometric pressure.”

Big quality?

There are also questions about the integrity of big data. A recent article on Google Flu Trends showed that the online tracker massively overestimated the year’s flu season. For Rumi Chunara, a researcher who works on the big data infectious disease surveillance project HealthMap, it comes down to the quality of the data. “Sometimes you can find relationships, things that come up that could be happenstance, and could be confounded by something you’re not paying attention to,” she said.

The Google Flu Trends misfire may have been caused by this year’s hysteria in the media around influenza. People read the headlines about the deadly flu season and hit the Internet for information. Google Flu Trends calculated those extra searches as reflecting actual flu sufferers, when they were actually a big media feedback loop.

Similarly, HealthMap does not capture every infectious disease around the world—only the ones that are reported in the news media. This limits tracking to events that are picked up by media outlets in the 12 languages HealthMap’s algorithms are designed to detect. So, even though researchers are constantly tweaking their algorithms to limit error, infectious disease trends can be more media construct than reality.

Still, Chunara made a key point: “You have issues with every kind of data set.” That’s certainly true when it comes to official flu-tracking mechanisms. She sees big data as an adjunct to other more traditional approaches, not something that will supplant them. “We’re not going to be able to match some gold standard but we need to think about what aspects these new data sets bring,” she said. “These methods provide information our traditional methods don’t.” And they’ll do it faster.

It’ll take time for the scientific community to recalibrate its methods for the big data era. For now, just be wary of bad science on steroids.

Science-ish is a joint project of Maclean’s, the Medical Post and the McMaster Health ForumJulia Belluz is the senior editor at the Medical Post. Got a tip? Message her at or on Twitter @juliaoftoronto

Filed under:

Big Data: Bad science on steroids?

  1. Bad science is one step along the way to better science.

    • And also one step on the way to disaster. As long as we continue to believe that we can “perfect” predictive analytics through better modeling and more data, we’ll continue to set ourselves up for everything from nuclear meltdowns to financial collapses.

      • The 2008 global economic meltdown was predicated on fraud not bad science. The flaky “Gaussian Copula” formula was merely a tool used by con men to run a pump-and-dump market manipulation scheme. The “masters of the universe” were playing musical chairs pumping up the value of junk-mortgage backed securities. When the music stopped they pocketed vast sums of money while average investors and taxpayers were left holding the bag.

        Economics has always been held back from becoming a science because of corrupt vested interests. Hopefully, economists will find a way to weed out the quacks and charlatans. (The survival of civilization probably depends on it.)

        As for scientific models, scientists are inherently skeptical, which means bad models will be challenged and exposed. The problem regarding nuclear power plants, for example, is political accountability. So it wouldn’t be bad science that cause a meltdown, it would be corrupt politics.

        • Economics never was a science. It’s only pretended to be one. And that’s when it got into trouble. It was, until about 70 years ago, considered a philosophical pursuit. A series of “what if” questions and an awful lot of tinkering and “heuristics”. If you’re delusional enough to think that the fancy equations used by economists somehow reflect what’s really happening in the marketplace, have at ‘er. Just don’t fool yourself into believing you’re saving civilization from disaster. Avoiding disaster will involve ignoring the predictions of sophists, which includes those who practice macroeconomics.

          The biggest tragedy of 2008 is that we still believe a handful of Wall Street crooks caused it. The fact is, every consumer, every home buyer, every government, and every corporation who binged on debt participated. The crooks just figured out how to take full advantage of it. That’s not to excuse them, but we all bear responsibility in giving them the matches to play with. Ever see a bidding war on a house? Those don’t happen unless credit is cheap. And if a housing bubble doesn’t happen, there’s trillions less mortgage debt for the Wall Street arsonists to play with. You don’t need complex macroeconomic models to tell you that.

          • It’s absurd to suggest macroeconomics can’t benefit from the scientific method or that the field should be ignored altogether. The fact is removing government influence over the economy is a form of macroeconomics (free-market ideology) and one that is proven to be disastrous causing two global economic meltdowns.

            The 2008 meltdown was most certainly caused by a handful of crooks. It was the result of *predatory* subprime mortgage lending and those junk mortgages being bundled up and sold as complex investments one needed to be a rocket scientist to calculate the risk. Canada didn’t suffer a banking meltdown because we had centrist regulations in place that prevented it.

            It also doesn’t make sense to blame people for getting conned into signing onto bad mortgages or getting caught up in a speculative bubble. It’s up to government to prevent people from getting exploited by con men and to reign in speculative assets bubbles before they wreak havoc.

            The real problem is free-market ideology which fails to meet its objectives time after time after time. It is nothing more than economic anarchy. It is the problem, not the cure.

  2. The problem with hypothesis-driven research is that it can introduce operator bias unless the scientist is impeccably careful. It’s easy to become so enamoured with an idea that experiments are designed to provide support for it instead of challenging its validity. In some cases, such obsession is a good thing. Had Barry Marshall not doggedly stuck to his guns in pursuing the radical idea that H. pylori was a causative agent for stomach cancer, we’d likely still be in the dark. Likewise, careful testing of big datasets can reduce the chance of bogus associations. The best approach is likely a combination of the two as their weaknesses are distinct.

    The fallibility in science is usually the operator. Data doesn’t mislead, it’s the design and interpretation where things usually go wrong.

  3. People are deeply, deeply stupid. A data set will infer exactly what the person who paid for its collection wants it to infer.
    And I say, long shall this be the case.

  4. As mentioned in the article, Nassim Taleb is way ahead of everyone else in calling BS on this nonsense. Whether you call it “creeping scientism” or “big data” or something else, we are a becoming society of stats monkeys that lives and dies by “data”, regardless of its dubious benefits. These days CEOs won’t make decisions unless then have some “data” to justify it. When their company blows up, they can say “we’re only as good as the data we’re given.” Political leaders make the same excuse.

    Parasitic consulting firms too numerous to mention are pimping their expensive services to gullible governments and corporations all the time. And when their “predictive analytics” and similar phony methods fail in spectacular fashion, they make the case for more and better data. In other words, we can do better but you need to pay us more. Seems we just love to fool ourselves into believing that with just a little more data, we can make better predictions. This might be true in very narrow applications, but we shouldn’t delude ourselves into believing we can predict economic outcomes, the timing and severity of disease outbreaks, or the emergence of revolutions and other social phenomena, with big data and predictive modeling. To do so is to confuse science with sophistry.

  5. Economics IS a science. I agree with explanation of the wrong interpretation – we try to subjectively explain complex data based on objective criteria.. Of course it may go wrong..

  6. Good article! I suspect we will be hearing more like it as industry attempts to colonize science.