Thursday, May 8, 2014

Jeff Leek asks questions near and dear to my heart

In the latest post on Simply Statistics, Jeff Leek asks some good questions:

  1. Given the importance of statistical thinking why aren't statisticians involved in these initiatives?
  2. When thinking about the big data era, what are some statistical ideas we've already figured out?
I'd say that (1) is changing, if slowly. But (2) is a good message for non-statistical folks in the data science community. Statistics is a field that is both wide and deep. There are many pressing data science problems that have been addressed in some fashion by someone in the statistics community. In many cases, we don't need to reinvent the wheel.

One area that I see as being quite underdeveloped in data science is how it approaches time series data. For that, we should look to the econometricians as much as statisticians. (What's  the difference between an econometrician and a statistician? About $15 K a year. Boom.) I am a fan of David Hendry's approach and I think the data science community would like it as well. He calls it "general to specific" modeling and I've seen a similar approach used to build machine-learning models.

Oh, but I'm off topic...

Anyway, the title of Leek's post "Why big data is in trouble: they forgot about applied statistics" is a bit melodramatic. Big data isn't in trouble because big data isn't going anywhere. (By big data, I mean the concept of a data-driven world.) As I said in an earlier post,
 It may be tempting to see [Google Flu] as justification that big data/data science is just media buzz. However, the technology that makes acquiring these data easy is here to stay. Reconciling statistical best-practices and big data is actively being discussed in the data science and big data communities. 

I look forward to the day that "data science" applications become mainstream in the statistics community. Then we'll really be cookin' with Crisco!


  1. can you elaborate on why you think 'data science' hasn't developed approaches to time-series data?

    (i'd like a diversion from studying for my time series analysis course final)

  2. Haha. Of course, time series analysis is hard!

    At the end of the day, I don't see sophisticated time series forecasting coming out of data science. The data vis folks just plot series next to each other over time (in very pretty ways), but trends can be very misleading. The ML folks tend to make Markovian assumptions in their models. That is rarely an assumption that holds in real world temporal data, making time series challenging. But you're probably neck deep in that stuff now. :)

    I am also excited to see what ML can do in a time series context, which is still dominated by regression so far as I can tell.

  3. this comes down to modeling (understanding through eqns) vs (black box) machine learning predicting. ARIMA, *GARCH, al models created by econometricians are only halfway decent b/c there was some economic intuition behind creating them. i give the ML folks credit for being creative but i just can't accept that a (general) algorithm is going to perform as well as a model. even if the algos can predict just as well... what do you learn from it?? isn't that what data science about? ..trying to extract knowledge from data?

    and as an aside: i don't like it when viz is only regarded as pretty pictures instead of using viz as a tool to try to understand by, say, trying to visualize structure in the data.