"There are two major misconceptions among the data science community that I’ve observed:
- Data scientists erroneously assume that the “big” in big data constitutes a solution overcoming selection biases in data. Somehow that the data are “big” implies that they represent a full population, and, therefore, are not observing a self-selected group of individuals that may be driving their significant results.
- However, irrespective of whether a scientist could in fact collect data for an entire population, e.g. all tweeters in the universe, “big” does not imply that the relationships in the data are somehow now causal and/or are observed without bias. Experiments remain necessary components to identifying causal vs. correlated relationships.
For those data scientists who understand (1) and are running experiments, a frequently overlooked or not well understood practice, particularly with regards to studying big data collected from online behavior, is computing the “power” of an experiment, or the probability of not committing a Type II error or the probability of correctly rejecting a false null hypothesis."I, of course, am hijacking the meaning as evidence for something written in my earlier post: "big data" and data science do not make traditional statistics obsolete by any means.