Sunday, January 12, 2014

Data Science and the Future of Statistics

The post that follows is an outline of a (long overdue) paper I've been working on with a professor-turned-colleague of mine. I want to get the ideas out of my head and open to comment quickly, hence this post. But more than just getting an idea on paper, this post is really a statement of what this blog is about and where I am taking my career as a statistician.

A bit of background

There has been quite a bit of hand wringing and debate about the future of statistics withing the academic side of the discipline over the last several years in response to increased prominence of data science over traditional statistics in academia, business, government, and public perception. In an article in AMSTAT News aptly titled "Aren't We Data Science?" then ASA president  Marie Davidian summarizes these concerns.
"Many [statisticians] have expressed concern that these and other data-oriented initiatives have been or are being conceived on your campuses without involvement of or input from the department of statistics or similar unit. I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to “small data” and “traditional” “tools” for their analysis, while data science is focused on Big Data, Big Questions, and innovative new methods. I’ve also heard about presentations on data science efforts by campus and agency leaders in which the word “statistics” was not mentioned. On the flip side, I have heard from statistics faculty frustrated at the failure of their departments to engage proactively in such efforts."
This concern is not new, though considering the author and source of publication, it has again risen in prominence in the minds of statisticians, prompting a renewed back-and-forth debate over whether or not statistics is data science and/or whether statistics should or should not engage with data science.

Interestingly, it has been my perception that this debate is largely relegated to statistics academia. Applied statisticians in industry tend to be very focused on their immediate objectives, much more likely to cross discipline boundaries to accomplish those objectives, and in general are more "data sciency" than their academic colleagues. And with a few exceptions, the data scientists that I know tend to hold statistics (and mathematics) knowledge as very fundamental to doing data science "right." Which is to say, they don't perceive much of a schism at all and I think many would argue that data science is making statistics more important, not less.

Nevertheless, times are changing and change requires adaptation.

But What are "Big Data" and "Data Science"?

Both of the above terms lack a single clear definition. I suspect that a substantial portion of  debate in the data science and statistics communities is born of unclear definitions. People who might otherwise agree are unwittingly talking past each other. So, for the sake of clarity of this article let me explain what these terms mean to me.

Arguably the more confusing of these is “big data.” Originally, “big data” referred to gigantic data sets, terabytes in size, which pose significant technical challenges for storage, transfer, and computation. This definition is often described by the 3V model: volume, velocity, and variety of data. However more recently the term has been used colloquially to refer to a broad range of activities where data were central. Rather than a technical definition, this more recent use could refer to the awakening of non-scientists to the idea that data is important.

Data science is a multidisciplinary field that involves the use and study of data for various purposes, and is actually close to Webster’s definition of statistics as “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”   However, data science’s roots are largely in the computer science field and are by no means limited to numerical data.

There have been numerous attempts to better define data science. A popular Venn diagram produced in 2010  depicted data science as the intersection of “math and statistics knowledge”, “hacking skills”, and “substantive expertise”. A more recent update by another blogger, pictured below, contends that data science is the union of these skills and possibly more. And while there has been much discussion of the elusive “data scientist” who makes hundreds of thousands of dollars per year, a consensus has been forming more recently that data science is best performed by teams of experts from each of the involved disciplines.


What changed?

Even if one were to adopt the broad definition that statistics is the study of data, and thus "data science," (an argument that has been made many times) there have been substantive changes in the world of data that makes things different.

First, data have proliferated. This isn't about the volume of data in a "big data" sense, but rather that data sets exist in large number in everywhere. While some data sources are better (less biased, less noisy, more complete, etc.) than others, it has become much easier to gather data in general, especially through the web. Now everyone has an ability to do some basic data analysis. More people are analyzing data, coming from different disciplines and using different approaches. So much data makes spurious correlations easier to find; subject mater expertise then becomes all the more important to contextualize and assess the validity of the findings from a data analysis.

Tangential to the above, data has gotten popular. The need for compelling visualizations and narratives that convey a complicated story in simple direct ways has increased. It is not longer the case (if it ever was) that only scientists, trained to deal with complexity, are the consumers of data products. Now anyone with a Blogger account can track the usage statistics of their blog, for example. 

Next, as models have become more accurate, they have also become more complex. Ensembles of models are almost always better predictors than any one model. However, ensemble methods make understanding causality very difficult or impossible. And while these models are empirically accurate, their asymptotic properties are often unknown. And because of the number of variables in the model, an additional question arises, "asymptotic to what?" One could take any or all of the number of observations, predictors, models in the ensemble, etc. to infinity and possibly arrive at different solutions. In the age of "big data" asymptotic properties matter.

Finally, data got bigger in the "big data" sense. Storing, moving, and processing terabytes of data is no small task and not one that is at all "statistical" in nature. There has always been a working relationship between statistics and computer science in the design of statistical software. Now these software engineering tasks are prominent at every phase of many research projects as the volume, velocity, and variety of data must be managed efficiently if any useful analysis is to come from a project.

Whither statistics?

If we assume that the above are true, that changes is the way data are stored, analyzed, and consumed have lead us to this new thing called "data science" and that data science is inherently multidisciplinary, where can statistics best place itself for the future?

The more things change, the more they remain the same.

In an age of 'big data" the role of the statistician remains largely unchanged, though the models and the distributions may be new or at least less-studied than others. As we move towards more complex statistical and machine-learned models, there is still a need to understand the properties of and to get inferences from these models, beyond just prediction.

For example, suppose a data scientist develops an empirically accurate deep learning model predicting whether or not (or how likely it is that) a patient will develop a particular type of cancer in the next 5 years. How can the doctor recommend that the patient change his or her behaviors to reduce the risk of cancer without knowing which variables influence the prediction and in which ways, lacking explicit model structure and parameters? What if a positive prediction may require costly and disruptive preemptive procedures? How doctor and patient must balance the uncertainty of a positive prediction with costs without a distribution from which to construct a prediction interval? What if, unbeknownst to doctor, patient, and data scientist, the model predictions are inconsistent with respect to the number of visible predictive units of which there are thousands?

These questions are very fundamental problems that statisticians have been studying for a long time in different contexts. But now, we are using different estimation methods and some of these fundamental questions must be answered anew. At the risk of sounding grandiose, it is as though we are re-discovering linear regression and how to use it with confidence. (Pun intended.)

In fact, some of these areas of research are already being tackled. Within the realm of random forest: Gerard Biau, Luc Devroye, and Gabor Lugosi (of France, Canada, and Spain respectively) have demonstrated the consistency of random forest and other averaging classifiers. Stefan Wager, Trevor Hastie, and Bradley Efron from Stanford have proposed methods for standard errors of predictions from bootstrapped and bagged learners. In a forthcoming paper, Abhijit Dasgupta and his co-authors propose a method for estimating effect size of predictors in "black box" algorithms like random forest or deep-learning classifiers.

And in a world that is almost constantly streaming data, careful research design and data collection are as important as ever. Sometimes it may seem like common sense to realize you have a biased sample, but how many articles have you seen about the link between social media and the Arab spring? (I will explicitly point out, that those in developing countries most likely to have the economic means to use social media are not a representative sample of the populations of those same countries.) This is especially important in areas where the source of data is born of the internet and requires people to implicitly or explicitly opt in. These are challenges that survey statisticians face regularly.

But sometimes things just change.

While many of the fundamental problems facing statisticians are the same, the applications and environment are different. Statistics education, particularly at the graduate level, must adapt. As data gets "bigger" and research and applications become more multidisciplinary, the ability for a statistician to communicate to and collaborate with a wide range of professionals and laypeople increases. 

First, future statistics education should provide a minimum competency of fundamentals in computer science. Statisticians are not known for being able to program well or across platforms; this must change. While it may not be the statistician that is optimizing an approach to scale it up, the statistician must work closely with software engineers to develop solutions that can scale and to ensure that the scaled solution still has the properties of the statistical solution developed. The best way for this to happen is if both the statistician and the computer scientist understand the at least the basics of the technical and conceptual challenges the other faces. 

Next, many of the examples and basic applications taught in foundation statistics courses may need to be updated. For example, an understanding of ensemble methods is going to be as important to a statistician's basic knowledge as linear regression. And it wouldn't hurt to have spent at least one lecture learning about concepts like Zipf's and Heap's laws as analysis of unstructured data from text becomes more common.

Finally, communication to non-technical audiences is becoming more important. Anecdotally, this is an area where many technical fields, including statistics, is already weak. But as data products become more-and-more mainstream, an effective statistician must be able to get his or her message heard to a wider variety of audiences. Many statistics programs require a statistics workshop course to this end. Perhaps courses on effective data visualization would also be helpful. 

In sum...

It may have once been true that statistics was data science, but moving forward data science is a fully multidisciplinary field. That said, statistics is unambiguously one of those disciplines and its role remains largely unchanged. Statisticians are still needed for research design, uncertainty quantification, and the derivation and interpretation of the properties of models and methods. I would argue, then, that the role of the statistician is to keep the science in data science.

2 comments:

  1. Hi Tommy, I think that the early adopter curve should be considered here on this emerging field of data science. In emerging and less defined fields, early adopters tend to be entrepreneurs (Chief Data Scientist, Non-Employer Data Science Consultant, etc). What happens when the hype slows down and many more companies understand how to build data science teams? Will we go back to calling them statisticians, as the stats field catches up on leveraging computing power or will the field of statistics be forever marginalized? You mentioned that the Data Science subject is a lightly attended event at ASA's JSM. Under the ACM, KDD yearly conferences are quite large and they have begun to embrace the "Data Science" title: http://www.kdd.org/kdd2016/

    ReplyDelete