The old way
"mode" "model" "red" "predict" "tim" "data"
A few things pop out pretty quickly.
- These are all unigrams. The model includes bigrams, but they aren't at the top of the distribution.
- A couple of the words seem to be truncated. Is "mode" supposed to be "model"? Is "tim" supposed to be "time"? It's really hard to tell without any context. (Even if these are truncated, it wouldn't greatly affect the fit of the model. It just makes it look ugly.)
- From the information we have, a good guess is that this topic is about data modeling or prediction or something like that.
A new approach
"prediction_model" "causal_inference" "force_field" "kinetic_model" "markov" "gaussian"
As I said to Charlie: "Big difference in interpretability, no?"
- For each topic:
- Grab a set of documents with high prevalence of that topic.
- In a document term matrix of bigrams and trigrams, calculate P( words | that set of documents) - P( words in the overall corpus )
- Take the n-gram with the highest score as your label.
- Next topic
It's not perfect. I have noticed that similar topics tend to get identical labels. The labeling isn't so good at picking up on subtle differences. Some topics are what I call "methods" rather than "subjects". (This is because most of my topic modeling is on scientific research papers.) The "methods" rarely have a high proportion in any document. The document simply isn't "about" its methods; it's about its subject. When this happens, sometimes I don't get any documents to go with a methods topic. The labeling algorithm just returns "NA". No bueno.