Thursday, April 24, 2014

From 1998 (!): Statistics should be rebranded as "data science."


Identity of statistics in science examined
C. F. Jeff Wu, professor of statistics, will present “Statistics = Data Science” at 4:10 p.m. Nov. 10 in Rackham Amphitheater. The lecture, in honor of Wu’s appointment to the H. C. Carver Collegiate Professorship in Statistics, will focus on the identity of statistics in science. Contrary to the perception of statistics as tables and figures, Wu characterizes statistical work as data modeling, analysis and decision making. He will conclude his lecture by proposing that statistics be renamed “data science” and statisticians “data scientists.”

I'm still processing this... 1998! That was 16 years ago. More on Jeff Wu is on Wikipedia.

Tuesday, April 22, 2014

It's all about the Beta

I am presenting a paper at this year's Joint Statistical Meetings (JSM). (It is also my first JSM.) The abstract is below.
Latent Dirichlet Allocation (LDA) is a popular hierarchical Bayesian model used in text mining. LDA models corpora as mixtures of categorical variables with Dirichlet priors. LDA is a useful model, but it is difficult to evaluate its effectiveness; the process that LDA models is not how people generate real language. Monte Carlo simulation is one approach to generating data where the "right" answers are known a priori. But sampling from the Dirichlet distributions that are often used as priors in LDA do not generate corpora with the property of natural language known as Zipf's law. We explore the relationship between the the Dirichlet distribution and Zipf's law within the framework of LDA. Considering Zipf's law allows researchers to more-easily explore the properties of LDA and make more-informed a priori decisions when modeling real textual data.
I will cut to the chase: If you generate data with a process mimicking LDA, the term frequency of the generated corpus depends only on beta, the Dirichlet parameter for topics distributed over words. Alpha factors out and sums to one.

What does it mean? You'll have to come see me talk to find out. ;)

If you'll be there, it's session 617 on the last day of the conference, August 7. They've got me slated for 8:30 AM; don't drink too much the night before.

Some LDA resources I've found helpful:


Johnathan Chang's lda package for R. (It converges much faster than topicmodels. I am personally not a fan of topic modeling with MALLET in R or Java.)

Wikipedia uses LDA as an example of a Dirichlet Multinomial distribution. (For the record, and with no offense to David Blei or any of the other brilliant folks doing topic modeling research, this Wikipedia example is much easier to understand than any "official" explanation I've read in a research paper so far.)

The BEST short paper on Gibbs sampling to fit/learn an LDA model.

What makes LDA better than pLSA? Why is Gibbs sampling different from variational Bayes? It's all about the priors, stupid.

Goldwater, Griffiths, and Johnson almost scooped me (in 2011). While they aren't as explicit about the LDA Zipf's law link as I am (will be?), they have a general framework for linguistic models of which LDA is a specific case.

Hey, are you modeling language? You should be reading Thomas Griffiths. At the very least, read this article. Ok ok. Only if you're actually interested in understanding causality in language models, you should read Griffiths.

Friday, April 18, 2014

Friday links April 18, 2014

This is the "where did you go?" edition of Friday links. This is a pretty lame excuse: I've been organizing my files. Specifically, I have data on three laptops, a desk top, an external hard drive, 4 thumb drives, 3 cell phones..... You get the idea. I've been separating the wheat from the chaff, storing it, and then wiping the disk. It's more time-consuming than I'd thought.

Nevertheless, you care about links and not my excuses!


More on how R works, AKA how to not fail at coding in R.

Someone just brought r/dataisbeautiful to my attention.

SAS vs R

"Instead of programming people to act like robots, why not teach them to become programmers, creative thinkers, architects, and engineers?"


Wednesday, April 2, 2014

Google Flu + Data Science and Statisticians (again)

Several people have emailed me with articles on the fallout from  Google Flu’s big flop in the past week. A Financial Times article in particular stood out to me as being an excellent statement on the state of data science vis-a-vis statistical rigor.

A common criticism of big data/data science is that exuberance has caused folks to erroneously believe they can ignore basic statistical principles if they have “big data” and that statistics is only for “small data.” This seems to be what happened to Google Flu and the Financial Times makes that same case.

Big data and data science have most certainly been over hyped recently. It may be tempting to see this as justification that big data/data science is just media buzz. However, the technology that makes acquiring these data easy is here to stay.

Reconciling statistical best-practices and big data is actively being discussed in the data science and big data communities. (I can point to this post at Data Community DC as one piece of evidence.) There are also several university statistics programs that are actively bringing statistical rigor to big data/data science issues. (Stanford and Berkeley come immediately to mind.)

My tactic when discussing these issues in a professional settings has been to be a voice of caution if the audience is excited and to be the voice of optimism if the audience is skeptical. Google Flu is a perfect example of the promise and peril associated with big data and data science; the timing and volume of the data can add to predictive power, but poor design can lead to models that confidently point to the wrong answer. (We statisticians call this “bias”.)