“To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.”

– Ronald Fisher, Statistician and evolutionary biologist

Evidence-based decision-making is superior to intuition-based decision-making. If you disagree, please feel free to build a bridge or skyscraper on footings designed by engineers who prefer gut feel to empirically tested formulas.

And then come all the caveats, because as much as KJR has been a strong proponent of evidence-based decision-making, there are plenty of ways to go about it that are far inferior to, not to mention far more expensive than your average Magic 8 Ball®.

The most obvious (and not this week’s topic) is the popular pastime of solving for the number — of hiding intuition-based decision-making inside evidence-oriented clothing. Before big-data analytics became popular, Excel was the preferred tool for this job.

The Hadoop ecosystem includes far more sophisticated ways to reach the same foregone conclusions. Apply the right filters and shop around among the different statistical tests available to you in even the sparsest of statistical packages and if you can’t come up with the answer you want, you aren’t using your imagination.

But even with the best of intentions and no desire to distort, conscious or otherwise, statistical analysis holds plenty of pitfalls, even for professionals.

Take this recent correction request, filed by CNN with the Pew Research Center. As reported by the Washington Post’s Erik Wemple, a recent Pew study concluded that last January, Foxnews.com had more unique visitors than CNN.com.

CNN’s complaint: Pew’s analysis …

Uses a custom entity, [E] Foxnews.com, for Fox News against raw site-level property metrics, [S] for CNN.com. This is not an apples-to-apples comparison since a custom entity may contain a collection of other URLs that remain hidden. As it turns out, we learned from our inquiry to comScore that Fox News’ custom entity is also comprised of a variety off-site traffic assignment letters (TALs) and, as such, is not truly the audience of foxnews.com but instead is assigned traffic from other sites that is reallocated back to Fox News even though the visitor did not consume said content on foxnews.com.

I won’t comment as to whether the use of TALs is legitimate or not, on the grounds that I’m not remotely qualified to do so. If you’re interested, here’s a link for more on the topic.

Presumably, Pew’s analysts are properly qualified, but (1) might not have been aware that comScore included TALs in its Foxnews.com tallies; or (2) might have concluded that including them in web traffic statistics is legitimate.

Which gets us to your big-data repository. One of the attractions of NoSQL technologies like Hadoop is that you can pretty much dump data into them without worrying too much about how the data are organized. That’s addressed during the analysis phase, which is why another descriptor for this family of technologies is “schema on demand.”

It’s reasonably well-known that this also means a lot of the data being dumped into these “data lakes” has not been subjected to much in the way of cleansing. That’s almost the point of it: Hadoop and its brethren are adept at storing huge streams of inbound data (hence “big data”). They wouldn’t be so adept at it if some pre-processor had to cleanse it all first.

You have to pay the piper sometime, of course. In this case, it means you’ve shifted work from those who program data-loading into traditional data warehouses to those who analyze data stored in non-traditional NoSQL data lakes.

What’s less-well recognized is what Pew’s analysts either did or didn’t address with the TAL question: With traditional data warehouses, professional analysts make decisions like this as a conscious part of designing their extract, translate and load (ETL) processes.

They might miss something subtle too, of course … there never are any guarantees when those pesky human beings are involved … but at least there’s a defined task and assigned responsibility for taking care of the problem.

Not that this means it’s all taken care of: Whatever filtering decisions data warehouse analysts might consciously make while implementing the system will usually turn into hidden assumptions inherited by those who analyze the data stored there later on.

At least with schema-on-demand, analysts have to make these decisions consciously, and so are aware of them.

If, that is, they’re knowledgeable enough to be aware of the need to make them all.

Which is why, whether your analytics strategy is built on a traditional data warehouse or a schema-on-demand data lake, you need the services of a professional data scientist.

Or, as we used to call them, statisticians.