Thursday, January 30, 2014

Talk it out

Abhijit Dasgupta on the need for cross talk between disciplines in data science and statistics.

This quote summarizes something I come across regularly coming from a stats background but reading a lot of machine learning papers:
The new guys are coming up against the same brick walls as the earlier researchers, and there seems to be a lack of understanding among the new researchers of the path already travelled (since the keywords are different and not necessarily directly related, Google Scholar fails).
I wonder how much that goes in the other direction as well.

Lies or contextually relevant reporting?

In a recent blog post titled Lies, Damn Lies. “Data Journalism” and Charts That Don’t Start at 0, the author takes issue with the chart accompanying the below tweet from Heidi N. Moore.

Instead, we are told, this chart should have its y-axis start at zero. And in so doing, we see that the employment ratio is "nowhere near a 'ski jump.'" (The data are also broken out between men and women.)

Really? It is important to provide context when displaying data. In the context of these data, is it realistic that this employment ratio would ever be zero or even near zero? No? Then zero has no business being on the chart's axis and including it is the real distortion here.

But if I were to take issue with both of these charts, my peeve is that levels of a time series are often misleading. Let's, instead, look at the percent change in this ratio from a year earlier.

Whoa! What's up with that historically-low drop starting in 2008? It's almost like "falling off a cliff" or "a ski jump" or whatever hyperbole you choose.

Bottom line: context is important, not arbitrary axis rules. While Heidi Moore's chart was not perfect, it still got the right message across within the context of the story: employment took a nose dive going into the recession. In fact, such a drop is unprecedented over the history of the displayed data.

Monday, January 13, 2014

Katya Vasilaky on the continued need for good experimental design in a world of big data

Posted on Data Community DC's blog, Dr. Katya Vasilaky writes:
"There are two major misconceptions among the data science community that I’ve observed:
  1. Data scientists erroneously assume that the “big” in big data constitutes a solution overcoming selection biases in data. Somehow that the data are “big” implies that they represent a full population, and, therefore, are not observing a self-selected group of individuals that may be driving their significant results.
  2. However, irrespective of whether a scientist could in fact collect data for an entire population, e.g. all tweeters in the universe, “big” does not imply that the relationships in the data are somehow now causal and/or are observed without bias. Experiments remain necessary components to identifying causal vs. correlated relationships.
For those data scientists who understand (1) and are running experiments, a frequently overlooked or not well understood practice, particularly with regards to studying big data collected from online behavior, is computing the “power” of an experiment, or the probability of not committing a Type II error or the probability of correctly rejecting a false null hypothesis."
I, of course, am hijacking the meaning as evidence for something written in my earlier post: "big data" and data science do not make traditional statistics obsolete by any means.

Links I like 01/13/2014

Sunday, January 12, 2014

Data Science and the Future of Statistics

The post that follows is an outline of a (long overdue) paper I've been working on with a professor-turned-colleague of mine. I want to get the ideas out of my head and open to comment quickly, hence this post. But more than just getting an idea on paper, this post is really a statement of what this blog is about and where I am taking my career as a statistician.

A bit of background

There has been quite a bit of hand wringing and debate about the future of statistics withing the academic side of the discipline over the last several years in response to increased prominence of data science over traditional statistics in academia, business, government, and public perception. In an article in AMSTAT News aptly titled "Aren't We Data Science?" then ASA president  Marie Davidian summarizes these concerns.
"Many [statisticians] have expressed concern that these and other data-oriented initiatives have been or are being conceived on your campuses without involvement of or input from the department of statistics or similar unit. I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to “small data” and “traditional” “tools” for their analysis, while data science is focused on Big Data, Big Questions, and innovative new methods. I’ve also heard about presentations on data science efforts by campus and agency leaders in which the word “statistics” was not mentioned. On the flip side, I have heard from statistics faculty frustrated at the failure of their departments to engage proactively in such efforts."
This concern is not new, though considering the author and source of publication, it has again risen in prominence in the minds of statisticians, prompting a renewed back-and-forth debate over whether or not statistics is data science and/or whether statistics should or should not engage with data science.

Interestingly, it has been my perception that this debate is largely relegated to statistics academia. Applied statisticians in industry tend to be very focused on their immediate objectives, much more likely to cross discipline boundaries to accomplish those objectives, and in general are more "data sciency" than their academic colleagues. And with a few exceptions, the data scientists that I know tend to hold statistics (and mathematics) knowledge as very fundamental to doing data science "right." Which is to say, they don't perceive much of a schism at all and I think many would argue that data science is making statistics more important, not less.

Nevertheless, times are changing and change requires adaptation.

But What are "Big Data" and "Data Science"?

Both of the above terms lack a single clear definition. I suspect that a substantial portion of  debate in the data science and statistics communities is born of unclear definitions. People who might otherwise agree are unwittingly talking past each other. So, for the sake of clarity of this article let me explain what these terms mean to me.

Arguably the more confusing of these is “big data.” Originally, “big data” referred to gigantic data sets, terabytes in size, which pose significant technical challenges for storage, transfer, and computation. This definition is often described by the 3V model: volume, velocity, and variety of data. However more recently the term has been used colloquially to refer to a broad range of activities where data were central. Rather than a technical definition, this more recent use could refer to the awakening of non-scientists to the idea that data is important.

Data science is a multidisciplinary field that involves the use and study of data for various purposes, and is actually close to Webster’s definition of statistics as “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.”   However, data science’s roots are largely in the computer science field and are by no means limited to numerical data.

There have been numerous attempts to better define data science. A popular Venn diagram produced in 2010  depicted data science as the intersection of “math and statistics knowledge”, “hacking skills”, and “substantive expertise”. A more recent update by another blogger, pictured below, contends that data science is the union of these skills and possibly more. And while there has been much discussion of the elusive “data scientist” who makes hundreds of thousands of dollars per year, a consensus has been forming more recently that data science is best performed by teams of experts from each of the involved disciplines.

What changed?

Even if one were to adopt the broad definition that statistics is the study of data, and thus "data science," (an argument that has been made many times) there have been substantive changes in the world of data that makes things different.

First, data have proliferated. This isn't about the volume of data in a "big data" sense, but rather that data sets exist in large number in everywhere. While some data sources are better (less biased, less noisy, more complete, etc.) than others, it has become much easier to gather data in general, especially through the web. Now everyone has an ability to do some basic data analysis. More people are analyzing data, coming from different disciplines and using different approaches. So much data makes spurious correlations easier to find; subject mater expertise then becomes all the more important to contextualize and assess the validity of the findings from a data analysis.

Tangential to the above, data has gotten popular. The need for compelling visualizations and narratives that convey a complicated story in simple direct ways has increased. It is not longer the case (if it ever was) that only scientists, trained to deal with complexity, are the consumers of data products. Now anyone with a Blogger account can track the usage statistics of their blog, for example. 

Next, as models have become more accurate, they have also become more complex. Ensembles of models are almost always better predictors than any one model. However, ensemble methods make understanding causality very difficult or impossible. And while these models are empirically accurate, their asymptotic properties are often unknown. And because of the number of variables in the model, an additional question arises, "asymptotic to what?" One could take any or all of the number of observations, predictors, models in the ensemble, etc. to infinity and possibly arrive at different solutions. In the age of "big data" asymptotic properties matter.

Finally, data got bigger in the "big data" sense. Storing, moving, and processing terabytes of data is no small task and not one that is at all "statistical" in nature. There has always been a working relationship between statistics and computer science in the design of statistical software. Now these software engineering tasks are prominent at every phase of many research projects as the volume, velocity, and variety of data must be managed efficiently if any useful analysis is to come from a project.

Whither statistics?

If we assume that the above are true, that changes is the way data are stored, analyzed, and consumed have lead us to this new thing called "data science" and that data science is inherently multidisciplinary, where can statistics best place itself for the future?

The more things change, the more they remain the same.

In an age of 'big data" the role of the statistician remains largely unchanged, though the models and the distributions may be new or at least less-studied than others. As we move towards more complex statistical and machine-learned models, there is still a need to understand the properties of and to get inferences from these models, beyond just prediction.

For example, suppose a data scientist develops an empirically accurate deep learning model predicting whether or not (or how likely it is that) a patient will develop a particular type of cancer in the next 5 years. How can the doctor recommend that the patient change his or her behaviors to reduce the risk of cancer without knowing which variables influence the prediction and in which ways, lacking explicit model structure and parameters? What if a positive prediction may require costly and disruptive preemptive procedures? How doctor and patient must balance the uncertainty of a positive prediction with costs without a distribution from which to construct a prediction interval? What if, unbeknownst to doctor, patient, and data scientist, the model predictions are inconsistent with respect to the number of visible predictive units of which there are thousands?

These questions are very fundamental problems that statisticians have been studying for a long time in different contexts. But now, we are using different estimation methods and some of these fundamental questions must be answered anew. At the risk of sounding grandiose, it is as though we are re-discovering linear regression and how to use it with confidence. (Pun intended.)

In fact, some of these areas of research are already being tackled. Within the realm of random forest: Gerard Biau, Luc Devroye, and Gabor Lugosi (of France, Canada, and Spain respectively) have demonstrated the consistency of random forest and other averaging classifiers. Stefan Wager, Trevor Hastie, and Bradley Efron from Stanford have proposed methods for standard errors of predictions from bootstrapped and bagged learners. In a forthcoming paper, Abhijit Dasgupta and his co-authors propose a method for estimating effect size of predictors in "black box" algorithms like random forest or deep-learning classifiers.

And in a world that is almost constantly streaming data, careful research design and data collection are as important as ever. Sometimes it may seem like common sense to realize you have a biased sample, but how many articles have you seen about the link between social media and the Arab spring? (I will explicitly point out, that those in developing countries most likely to have the economic means to use social media are not a representative sample of the populations of those same countries.) This is especially important in areas where the source of data is born of the internet and requires people to implicitly or explicitly opt in. These are challenges that survey statisticians face regularly.

But sometimes things just change.

While many of the fundamental problems facing statisticians are the same, the applications and environment are different. Statistics education, particularly at the graduate level, must adapt. As data gets "bigger" and research and applications become more multidisciplinary, the ability for a statistician to communicate to and collaborate with a wide range of professionals and laypeople increases. 

First, future statistics education should provide a minimum competency of fundamentals in computer science. Statisticians are not known for being able to program well or across platforms; this must change. While it may not be the statistician that is optimizing an approach to scale it up, the statistician must work closely with software engineers to develop solutions that can scale and to ensure that the scaled solution still has the properties of the statistical solution developed. The best way for this to happen is if both the statistician and the computer scientist understand the at least the basics of the technical and conceptual challenges the other faces. 

Next, many of the examples and basic applications taught in foundation statistics courses may need to be updated. For example, an understanding of ensemble methods is going to be as important to a statistician's basic knowledge as linear regression. And it wouldn't hurt to have spent at least one lecture learning about concepts like Zipf's and Heap's laws as analysis of unstructured data from text becomes more common.

Finally, communication to non-technical audiences is becoming more important. Anecdotally, this is an area where many technical fields, including statistics, is already weak. But as data products become more-and-more mainstream, an effective statistician must be able to get his or her message heard to a wider variety of audiences. Many statistics programs require a statistics workshop course to this end. Perhaps courses on effective data visualization would also be helpful. 

In sum...

It may have once been true that statistics was data science, but moving forward data science is a fully multidisciplinary field. That said, statistics is unambiguously one of those disciplines and its role remains largely unchanged. Statisticians are still needed for research design, uncertainty quantification, and the derivation and interpretation of the properties of models and methods. I would argue, then, that the role of the statistician is to keep the science in data science.