Friday, February 21, 2014

I can haz buzzwords?

(Update 11/20/2014 - Sentences were updated to be more technically accurate with respect to bias and consistency.)

Catty title aside, this post takes a good swing at defining terms we hear thrown around about data these days and they mostly do a good job.

I particularly like the definitions for big data (it's mostly about your tools) and data science (the whole enchilada).

But I vehemently disagree with it's characterization of statistics as the discipline for small data. Ha.
One powerful method in statistics involves taking things to infinity and seeing what happens.  (Yes, I'm talking about theoretical statistics now.) But these asymptotic properties and distributions are how statistics can quantify uncertainty around a small dataset. We know what it should look like; we can compare it to what we see. Because such properties were derived from infinity, they should only match your large dataset better, so long as it's representative of the population of interest.*

But that last bit is key. When dealing with big data, experimental design really matters. Get enough biased data and you'll be uber confident in a wrong result. So if you collected data poorly (given your question), you'll be extremely confident in a bad result. (Bias doesn't go away by adding more observations. A biased sample doesn't become unbiased just because its bigger.)

And if you're using an estimator on "big" data, I hope you know its asymptotic properties, because you're operating in the asymptote. An inconsistent estimator is really bad if you have millions of observations.**
The author of the post has statistics nailed for a culture of conservatism, however. I often say that my job is to "be the wet blanket in the room." Statisticians will keep you honest, if frustrated. But if you really need to pull the trigger on something fast, grab a computer scientist. They get the job done and I tip my hat to them for it.

However, one of the keys to success, I think, is knowing when you're better served by the fast and workable solution or the slow but confident solution. That depends on context and the wrong choice can be embarrassing at best. C'est la vie.

* Caveats go with this sentence.

** One way to look at inconsistency is asymptotic bias. If your estimator is inconsistent, then your estimates converge to the wrong answer as your sample gets larger.

*** Update for clarity: Win-victor is a blog I like very much, lest anyone think otherwise. Many of us in this thing we call data science come from different disciplines with different perspectives and limited understanding of other disciplines. (I knowingly open myself to criticism and admit my own fallibility in saying this.) As data science evolves (perhaps into its own independent discipline) we'll all benefit from cross-disciplinary exchanges.

No comments:

Post a Comment