Tuesday, April 22, 2014

It's all about the Beta

I am presenting a paper at this year's Joint Statistical Meetings (JSM). (It is also my first JSM.) The abstract is below.
Latent Dirichlet Allocation (LDA) is a popular hierarchical Bayesian model used in text mining. LDA models corpora as mixtures of categorical variables with Dirichlet priors. LDA is a useful model, but it is difficult to evaluate its effectiveness; the process that LDA models is not how people generate real language. Monte Carlo simulation is one approach to generating data where the "right" answers are known a priori. But sampling from the Dirichlet distributions that are often used as priors in LDA do not generate corpora with the property of natural language known as Zipf's law. We explore the relationship between the the Dirichlet distribution and Zipf's law within the framework of LDA. Considering Zipf's law allows researchers to more-easily explore the properties of LDA and make more-informed a priori decisions when modeling real textual data.
I will cut to the chase: If you generate data with a process mimicking LDA, the term frequency of the generated corpus depends only on beta, the Dirichlet parameter for topics distributed over words. Alpha factors out and sums to one.

What does it mean? You'll have to come see me talk to find out. ;)

If you'll be there, it's session 617 on the last day of the conference, August 7. They've got me slated for 8:30 AM; don't drink too much the night before.

Some LDA resources I've found helpful:


Johnathan Chang's lda package for R. (It converges much faster than topicmodels. I am personally not a fan of topic modeling with MALLET in R or Java.)

Wikipedia uses LDA as an example of a Dirichlet Multinomial distribution. (For the record, and with no offense to David Blei or any of the other brilliant folks doing topic modeling research, this Wikipedia example is much easier to understand than any "official" explanation I've read in a research paper so far.)

The BEST short paper on Gibbs sampling to fit/learn an LDA model.

What makes LDA better than pLSA? Why is Gibbs sampling different from variational Bayes? It's all about the priors, stupid.

Goldwater, Griffiths, and Johnson almost scooped me (in 2011). While they aren't as explicit about the LDA Zipf's law link as I am (will be?), they have a general framework for linguistic models of which LDA is a specific case.

Hey, are you modeling language? You should be reading Thomas Griffiths. At the very least, read this article. Ok ok. Only if you're actually interested in understanding causality in language models, you should read Griffiths.

2 comments:

  1. Congrats, Tommy! Will the conference proceedings be published online?

    ReplyDelete
    Replies
    1. Thanks, Charlie!

      Looking at the 2013 conference (http://www.amstat.org/meetings/jsm/2013/proceedings.cfm) the answer seems to be ..."maybe?"

      So long as I'm not violating some legal agreement, I'll host a copy of my paper here after the conference.

      Delete