Tuesday, December 9, 2014

Simulating Realistic Data for Topic Modeling

Brian and I have finally submitted our paper to IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the culmination of a year of hard work. (There's more work yet to be done; I doubt we'll make it through peer-review without having to revise.)

I presented our preliminary results at JSM in August, as described in this earlier post.

Here is the abstract.

Latent Dirichlet Allocation (LDA) is a popular Bayesian methodology for topic modeling. However, the priors in LDA analysis are not reflective of natural language. In this paper we introduce a Monte Carlo method for generating documents that accurately reflect word frequencies in language by taking advantage of Zipf’s Law. In developing this method we see a result for incorporating the structure of natural language into the prior of the topic model. Technical issues with correctly assigning power law priors drove us to use ensemble estimation methods. The ensemble estimation technique has the additional benefit of improving the quality of topics and providing an approximation of the true number of topics.

The rest of the paper can be read here.

No comments:

Post a Comment