I have a new working paper
added to my publications page
. The abstract reads
This document proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or R2. Within the context of topic modeling, R2 has the same interpretation that it does when used in a broader class of statistical models. Reporting R2 with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. This paper proposes that R2 should be reported as a standard metric when constructing topic models.
In researching this paper, I came across a potentially significant (in the colloquial sense) finding.
These properties of R2 compel a change in focus for the topic modeling research community. Document length is an important factor in model fit whereas the number of documents is not. When choosing between fitting a topic model on short abstracts, full-page documents, or multi-page papers, it appears that more text is better. Second, since model fit is invariant for corpora over 1,000 documents (our lower bound for simulation), sampling from large corpora should yield reasonable estimates of population parameters. The topic modeling community has heretofore focused on scalability on large corpora of hundreds-of-thousands to millions of documents. Little attention has been paid to document length, however. These results, if robust, indicate that focus should move away from larger corpora and towards lengthier documents.
- Stop building topic models on tweets and stop using the 20 news groups data.
- Your argument of needing gigantic corpora (and algorithms that scale to them) is invalid.
- Citing points (1) and (2), I am prone to hyperbole.
- Citing points (1) through (3), I like numbered lists.
In all seriousness, if you take the time to read the paper and have any comments, please let me know. You can comment below, use the "contact me directly" tool on your right, or email me (if you know my email already). I'll be circulating this paper among friends and colleagues to get their opinions as well.