Thursday, November 5, 2015

More on statisticians in data science

The November issue of AMSTAT News has published an opinion piece by yours truly on the identity of statisticians in data science. My piece starts on page 25 of the print version. The online version is here. A quote:

I am not convinced that statistics is data science. But I am convinced that the fundamentals of probability and mathematical statistics taught today add tremendous value and cement our identity as statisticians in data science.
Please read the whole thing.

Friday, May 8, 2015


I made an accident yesterday.

What happened? When creating a matrix of zeros I accidentally typed matrix() instead of Matrix().

What's the difference? 4.8 terabytes versus less than one GB. I was creating a document-term-matrix of about 100,000 documents with a vocabulary of about 6,000,000 tokens. This is the thing with linguistic data: one little mistake is the difference between working on a Macbook Air with no fuss and something that would make a super computer choke. (Anyone want to get me a quote on hardware with 5 TB of RAM?)

Wednesday, January 14, 2015

Are microeconomists data scientists?

From The Economist:

Armed with vast data sets produced by tech firms, microeconomists can produce startlingly good forecasts of human behaviour. Silicon Valley firms have grown to love them: by bringing a cutting-edge economist in house, they are able to predict what customers or employees are likely to do next.

 Sounds like data scientists to me. The article is here. There's a related piece here.

Friday, January 9, 2015

Introducing R-squared for Topic Models

I have a new working paper added to my publications page. The abstract reads
This document proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or R2. Within the context of topic modeling, R2 has the same interpretation that it does when used in a broader class of statistical models. Reporting R2 with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. This paper proposes that R2 should be reported as a standard metric when constructing topic models.
In researching this paper, I came across a potentially significant (in the colloquial sense) finding.

These properties of R2 compel a change in focus for the topic modeling research community. Document length is an important factor in model fit whereas the number of documents is not. When choosing between fitting a topic model on short abstracts, full-page documents, or multi-page papers, it appears that more text is better. Second, since model fit is invariant for corpora over 1,000 documents (our lower bound for simulation), sampling from large corpora should yield reasonable estimates of population parameters. The topic modeling community has heretofore focused on scalability on large corpora of hundreds-of-thousands to millions of documents. Little attention has been paid to document length, however. These results, if robust, indicate that focus should move away from larger corpora and towards lengthier documents. 


  1. Stop building topic models on tweets and stop using the 20 news groups data.
  2. Your argument of needing gigantic corpora (and algorithms that scale to them) is invalid.
  3. Citing points (1) and (2), I am prone to hyperbole.
  4. Citing points (1) through (3), I like numbered lists.

In all seriousness, if you take the time to read the paper and have any comments, please let me know. You can comment below, use the "contact me directly" tool on your right, or email me (if you know my email already). I'll be circulating this paper among friends and colleagues to get their opinions as well.

Tuesday, December 23, 2014

Labeling topics from topic models

My friend Charlie Greenbacker showed me LDAvis, which creates interactive apps for visualizing topics from topic models. As often happens, this lead to a back and forth between Charlie and me covering a range of topic modeling issues. I've decided to share and expand on bits of the conversation over a few posts.

The old way

The traditional method for reporting a topic (let's call it topic X) is to list the top 3 to 5 words in topic X. What do I mean by "top" words? In topic modeling a "topic" is a probability distribution over words. Specifically, "topic X" is really P( words | topic X)

Here is an example from the empirical section of the R-squared paper I'm working on:

"mode"    "model"   "red"     "predict" "tim"     "data" 

 A few things pop out pretty quickly.

  1. These are all unigrams. The model includes bigrams, but they aren't at the top of the distribution. 
  2. A couple of the words seem to be truncated. Is "mode" supposed to be "model"? Is "tim" supposed to be "time"? It's really hard to tell without any context. (Even if these are truncated, it wouldn't greatly affect the fit of the model. It just makes it look ugly.)
  3. From the information we have, a good guess is that this topic is about data modeling or prediction or something like that.

The incoherence of these terms on their own requires topic modelers to spend massive amounts of time curating a dictionary for their final model. If you don't, you may end up with a topic that looks like topic 1 in this example. Good luck interpreting that!

(As a complete aside, it looks like topic 1 is the result of using an asymmetric Dirichlet prior for topics over documents. Those that attended my DC NLP talk know that I have ambivalent feelings about this.)

A new approach

I'm going to get a little theoretical on you: Zipf's law tells me that, in theory, the most probable terms in every topic should be stop words. Think about it. When I'm talking about cats, I still use words like "the", "this", "an", etc. waaaaay more than any cat-specific words. (That's why Zipf's law is, well...., a law.)

Even if we remove general stop words before modeling, I probably have a lot of corpus-specific stop words. Pulling those out, while trying to preserve the integrity of my data, is no easy task. (It's also a little like performing surgery with a chainsaw.) That's why so much time is spent on vocabulary curation.

My point is that I don't think P(words | topic X) is the right way to look at this. Zipf's law means that I expect the most probable words in that distribution to contain no contextual meaning. All that dictionary curation is isn't just time consuming, it's perverting our data.

But what happens if we throw a little Bayes' Theorem at this problem? Instead of ordering words by P( words | topic x), let's order them according to P( topic x | words).

"prediction_model" "causal_inference" "force_field" "kinetic_model" "markov" "gaussian"

As I said to Charlie: "Big difference in interpretability, no?"

Full labels

I think that all of our dictionary curation hurts us beyond being a time sink. I think it makes our models fit the data worse. This has two implications: we have less trust in the resulting analysis and our topics are actually more statistically muddled, not less. 

We (as in I and some other folks who work with me) have come up with some automated ways to label topics. This method works by grouping documents together by topic and then extracting keywords from the documents. (The difference between my work and the other guys I work with is in the keyword extraction step.)

The method basically works like this:
  1. For each topic:
  2. Grab a set of documents with high prevalence of that topic.
  3. In a document term matrix of bigrams and trigrams, calculate P( words | that set of documents) - P( words in the overall corpus )
  4. Take the n-gram with the highest score as your label.
  5. Next topic
My label for "topic X"?


It's not perfect. I have noticed that similar topics tend to get identical labels. The labeling isn't so good at picking up on subtle differences. Some topics are what I call "methods" rather than "subjects". (This is because most of my topic modeling is on scientific research papers.) The "methods" rarely have a high proportion in any document. The document simply isn't "about" its methods; it's about its subject. When this happens, sometimes I don't get any documents to go with a methods topic. The labeling algorithm just returns "NA". No bueno.

One last benefit

By not butchering the statistical signals in our documents by heavy-handed dictionary curation, we get some nice properties in the resulting model. One, for example, is that we can cluster topics together cleanly. So, I can create a nice hierarchical dendrogram of all my topics. (I can also use the labeling algorithm to label groups higher up on the tree if I want.)

You can check out one of the dendrograms I'm using for the R-squared paper by clicking here. The boxes are clusters of topics based on linguistic similarity and document occurrence. (It's easier to see if you zoom in.) It's a model of 100 topics on 10,000 randomly-sampled NIH grant abstracts. (You can get your own here.)

Wednesday, December 17, 2014

Notes on the culture of economics

I'm finally getting around to reading Piketty's Capital in the 21st Century. That and a project at work has put economics back to the front of my brain. I found the below posts interesting.

Paul Krugman says in "Notes on the Floating Crap Game (Economics Inside Baseball)"

So, academic economics is indeed very hierarchical; but I think it’s important to understand that it’s not a bureaucratic hierarchy, nor can status be conferred by crude patronage. The profession runs on reputation — basically the shared perception that you’re a smart guy. But how do you get reputation? [...] [R]eputation comes out of clever papers and snappy seminar presentations. 
[...]  Because everything runs on reputation, a lot of what you might imagine academic politics is like — what it may be like in other fields — doesn’t happen in econ. When young I would have relatives asking whether I was “in” with the department head or the senior faculty in my department, whether I was cultivating relationships, whatever; I thought it was funny, because all that mattered was your reputation, which was national if not global.

Not all Krugman says is rosy for economists. Nevertheless, this is consistent with my experience when I was in economics. Econ has a hierarchical structure, but it's not based on patronage or solely "length of service."  For example, when I was at the Fed, the internal structure was quite hierarchical in terms of both titles and managerial responsibility. (It kind of reminded me of the military.) However, it also had a  paradoxically "flat" culture. Ideas were swapped and debated constantly. Though I was a lowly research assistant, my forecasts were respected and my input listened to. I was no exception; this was just how we operated.

Simply statistics brought another post to my attention. From Kevin Drum at Mother Jones: Economists are Almost Inhumanly Impartial.

Over at 538, a team of researchers takes on the question of whether economists are biased. Given that economists are human beings, it would be pretty shocking if the answer turned out to be no, and sure enough, it's not. In fact, say the researchers, liberal economists tend to produce liberal results and conservative economists tend to produce conservative results. This is unsurprising, but oddly enough, I'm also not sure it's the real takeaway here. [...]
 What I see is a nearly flat regression line with a ton of variance. [...] If these results are actually true, then congratulations economists! You guys are pretty damn evenhanded. The most committed Austrians and the most extreme socialists are apparently producing numerical results that are only slightly different. If there's another field this side of nuclear physics that does better, I'd be surprised.

(I'll leave it to you to check out the regression line in question.)

Simply statistics's Jeff Leek has a different take.
I'm not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line.

I don't know what Leek is getting at exactly; maybe we agree. What I see is a nearly flat line through a cloud of points. My take isn't that economists are unbiased. Rather, their bias is generally uncorrelated with their ideology. That's still a good thing, right? (Either way, I am not one for the philosophy of p < 0.05 means it's true and p > 0.05 means it's false.)

Here's what I've told other people: microeconomics is about as close to a science as you're going to get. It's a lot like studying predator prey systems in the wild. There's definitely stochastic variation, but the trends are pretty clear; not much to argue about. Macroeconomics, on the other hand is a lot trickier. It's not that macroeconomists are any less objective than microeconomists. Rather, measurement and causality are much trickier. In the resulting vacuum, there's room for different assumptions and philosophies. This is what macroeconomists debate about.

Nevertheless, my experience backs up a comment to Drum's article:
Economists generally avoid and form consensus in regard to fringe theories. 

Translation: the differences in philosophies between macroeconomists isn't as big as you'd think. And they're tiny compared to our political differences.