Wednesday, April 2, 2014

Google Flu + Data Science and Statisticians (again)

Several people have emailed me with articles on the fallout from  Google Flu’s big flop in the past week. A Financial Times article in particular stood out to me as being an excellent statement on the state of data science vis-a-vis statistical rigor.

A common criticism of big data/data science is that exuberance has caused folks to erroneously believe they can ignore basic statistical principles if they have “big data” and that statistics is only for “small data.” This seems to be what happened to Google Flu and the Financial Times makes that same case.

Big data and data science have most certainly been over hyped recently. It may be tempting to see this as justification that big data/data science is just media buzz. However, the technology that makes acquiring these data easy is here to stay.

Reconciling statistical best-practices and big data is actively being discussed in the data science and big data communities. (I can point to this post at Data Community DC as one piece of evidence.) There are also several university statistics programs that are actively bringing statistical rigor to big data/data science issues. (Stanford and Berkeley come immediately to mind.)

My tactic when discussing these issues in a professional settings has been to be a voice of caution if the audience is excited and to be the voice of optimism if the audience is skeptical. Google Flu is a perfect example of the promise and peril associated with big data and data science; the timing and volume of the data can add to predictive power, but poor design can lead to models that confidently point to the wrong answer. (We statisticians call this “bias”.)

No comments:

Post a Comment