Wednesday, February 12, 2014

“The numbers are where the scientific discussion should start, not end.”

An editorial posted today in Nature, Scientific method: Statistical errors, summarizes a growing chorus of arguments against an over reliance on p-values in scientific research. Were the article only speaking to statisticians, I'd say it was preaching to the choir. However, most people who use statistics are not statisticians.

Also, over the course of my relatively short career, I've seen tiny p-values affixed to results I simply don't believe are real or robust and larger p-values on results that I believe are both real and robust. I once received a comment, "this p-value is only 0.07. Why are we even talking about this?" The reason, of course, is that the p-value is only one metric of validity of your results. At the end of the day, we have to think critically about all the evidence we have in front of us, which can include non-statistical measures like past experiences, intuition, or anecdotes.

To be clear, I am not saying that we can safely disregard statistical evidence in favor of gut feeling or whatever bias you may bring with you. Rather, critical thinking is required across the board. If we skewer the p-value without learning the real lesson, we'll find some other "gold standard" dogma of statistical significance later down the road and be back where we started.

Some of the quotes I enjoyed (in addition to the title of this post):
The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.
Others argue for a more ecumenical approach, encouraging researchers to try multiple methods on the same data set. [...] If the various methods come up with different answers, he says, 'that's a suggestion to be more creative and try to find out why', which should lead to a better understanding of the underlying reality.
But significance is no indicator of practical relevance, he says: We should be asking, 'How much of an effect is there?', not 'Is there an effect?'” 
And data miners/data scientists take note of the second quote. Approaching the same problem from different directions and with different methods should yield similar or mutually-supporting results. If not, why not? That may be an even more interesting question than the one you set out to answer.

No comments:

Post a Comment