Tuesday, December 23, 2014

Labeling topics from topic models

My friend Charlie Greenbacker showed me LDAvis, which creates interactive apps for visualizing topics from topic models. As often happens, this lead to a back and forth between Charlie and me covering a range of topic modeling issues. I've decided to share and expand on bits of the conversation over a few posts.

The old way


The traditional method for reporting a topic (let's call it topic X) is to list the top 3 to 5 words in topic X. What do I mean by "top" words? In topic modeling a "topic" is a probability distribution over words. Specifically, "topic X" is really P( words | topic X)

Here is an example from the empirical section of the R-squared paper I'm working on:

"mode"    "model"   "red"     "predict" "tim"     "data" 

 A few things pop out pretty quickly.

  1. These are all unigrams. The model includes bigrams, but they aren't at the top of the distribution. 
  2. A couple of the words seem to be truncated. Is "mode" supposed to be "model"? Is "tim" supposed to be "time"? It's really hard to tell without any context. (Even if these are truncated, it wouldn't greatly affect the fit of the model. It just makes it look ugly.)
  3. From the information we have, a good guess is that this topic is about data modeling or prediction or something like that.

The incoherence of these terms on their own requires topic modelers to spend massive amounts of time curating a dictionary for their final model. If you don't, you may end up with a topic that looks like topic 1 in this example. Good luck interpreting that!

(As a complete aside, it looks like topic 1 is the result of using an asymmetric Dirichlet prior for topics over documents. Those that attended my DC NLP talk know that I have ambivalent feelings about this.)


A new approach


I'm going to get a little theoretical on you: Zipf's law tells me that, in theory, the most probable terms in every topic should be stop words. Think about it. When I'm talking about cats, I still use words like "the", "this", "an", etc. waaaaay more than any cat-specific words. (That's why Zipf's law is, well...., a law.)

Even if we remove general stop words before modeling, I probably have a lot of corpus-specific stop words. Pulling those out, while trying to preserve the integrity of my data, is no easy task. (It's also a little like performing surgery with a chainsaw.) That's why so much time is spent on vocabulary curation.

My point is that I don't think P(words | topic X) is the right way to look at this. Zipf's law means that I expect the most probable words in that distribution to contain no contextual meaning. All that dictionary curation is isn't just time consuming, it's perverting our data.

But what happens if we throw a little Bayes' Theorem at this problem? Instead of ordering words by P( words | topic x), let's order them according to P( topic x | words).

"prediction_model" "causal_inference" "force_field" "kinetic_model" "markov" "gaussian"

As I said to Charlie: "Big difference in interpretability, no?"

Full labels


I think that all of our dictionary curation hurts us beyond being a time sink. I think it makes our models fit the data worse. This has two implications: we have less trust in the resulting analysis and our topics are actually more statistically muddled, not less. 

We (as in I and some other folks who work with me) have come up with some automated ways to label topics. This method works by grouping documents together by topic and then extracting keywords from the documents. (The difference between my work and the other guys I work with is in the keyword extraction step.)

The method basically works like this:
  1. For each topic:
  2. Grab a set of documents with high prevalence of that topic.
  3. In a document term matrix of bigrams and trigrams, calculate P( words | that set of documents) - P( words in the overall corpus )
  4. Take the n-gram with the highest score as your label.
  5. Next topic
My label for "topic X"?

"statistical_method"

It's not perfect. I have noticed that similar topics tend to get identical labels. The labeling isn't so good at picking up on subtle differences. Some topics are what I call "methods" rather than "subjects". (This is because most of my topic modeling is on scientific research papers.) The "methods" rarely have a high proportion in any document. The document simply isn't "about" its methods; it's about its subject. When this happens, sometimes I don't get any documents to go with a methods topic. The labeling algorithm just returns "NA". No bueno.


One last benefit


By not butchering the statistical signals in our documents by heavy-handed dictionary curation, we get some nice properties in the resulting model. One, for example, is that we can cluster topics together cleanly. So, I can create a nice hierarchical dendrogram of all my topics. (I can also use the labeling algorithm to label groups higher up on the tree if I want.)

You can check out one of the dendrograms I'm using for the R-squared paper by clicking here. The boxes are clusters of topics based on linguistic similarity and document occurrence. (It's easier to see if you zoom in.) It's a model of 100 topics on 10,000 randomly-sampled NIH grant abstracts. (You can get your own here.)





No comments:

Post a Comment