Tuesday, December 23, 2014

Labeling topics from topic models

My friend Charlie Greenbacker showed me LDAvis, which creates interactive apps for visualizing topics from topic models. As often happens, this lead to a back and forth between Charlie and me covering a range of topic modeling issues. I've decided to share and expand on bits of the conversation over a few posts.

The old way


The traditional method for reporting a topic (let's call it topic X) is to list the top 3 to 5 words in topic X. What do I mean by "top" words? In topic modeling a "topic" is a probability distribution over words. Specifically, "topic X" is really P( words | topic X)

Here is an example from the empirical section of the R-squared paper I'm working on:

"mode"    "model"   "red"     "predict" "tim"     "data" 

 A few things pop out pretty quickly.

  1. These are all unigrams. The model includes bigrams, but they aren't at the top of the distribution. 
  2. A couple of the words seem to be truncated. Is "mode" supposed to be "model"? Is "tim" supposed to be "time"? It's really hard to tell without any context. (Even if these are truncated, it wouldn't greatly affect the fit of the model. It just makes it look ugly.)
  3. From the information we have, a good guess is that this topic is about data modeling or prediction or something like that.

The incoherence of these terms on their own requires topic modelers to spend massive amounts of time curating a dictionary for their final model. If you don't, you may end up with a topic that looks like topic 1 in this example. Good luck interpreting that!

(As a complete aside, it looks like topic 1 is the result of using an asymmetric Dirichlet prior for topics over documents. Those that attended my DC NLP talk know that I have ambivalent feelings about this.)


A new approach


I'm going to get a little theoretical on you: Zipf's law tells me that, in theory, the most probable terms in every topic should be stop words. Think about it. When I'm talking about cats, I still use words like "the", "this", "an", etc. waaaaay more than any cat-specific words. (That's why Zipf's law is, well...., a law.)

Even if we remove general stop words before modeling, I probably have a lot of corpus-specific stop words. Pulling those out, while trying to preserve the integrity of my data, is no easy task. (It's also a little like performing surgery with a chainsaw.) That's why so much time is spent on vocabulary curation.

My point is that I don't think P(words | topic X) is the right way to look at this. Zipf's law means that I expect the most probable words in that distribution to contain no contextual meaning. All that dictionary curation is isn't just time consuming, it's perverting our data.

But what happens if we throw a little Bayes' Theorem at this problem? Instead of ordering words by P( words | topic x), let's order them according to P( topic x | words).

"prediction_model" "causal_inference" "force_field" "kinetic_model" "markov" "gaussian"

As I said to Charlie: "Big difference in interpretability, no?"

Full labels


I think that all of our dictionary curation hurts us beyond being a time sink. I think it makes our models fit the data worse. This has two implications: we have less trust in the resulting analysis and our topics are actually more statistically muddled, not less. 

We (as in I and some other folks who work with me) have come up with some automated ways to label topics. This method works by grouping documents together by topic and then extracting keywords from the documents. (The difference between my work and the other guys I work with is in the keyword extraction step.)

The method basically works like this:
  1. For each topic:
  2. Grab a set of documents with high prevalence of that topic.
  3. In a document term matrix of bigrams and trigrams, calculate P( words | that set of documents) - P( words in the overall corpus )
  4. Take the n-gram with the highest score as your label.
  5. Next topic
My label for "topic X"?

"statistical_method"

It's not perfect. I have noticed that similar topics tend to get identical labels. The labeling isn't so good at picking up on subtle differences. Some topics are what I call "methods" rather than "subjects". (This is because most of my topic modeling is on scientific research papers.) The "methods" rarely have a high proportion in any document. The document simply isn't "about" its methods; it's about its subject. When this happens, sometimes I don't get any documents to go with a methods topic. The labeling algorithm just returns "NA". No bueno.


One last benefit


By not butchering the statistical signals in our documents by heavy-handed dictionary curation, we get some nice properties in the resulting model. One, for example, is that we can cluster topics together cleanly. So, I can create a nice hierarchical dendrogram of all my topics. (I can also use the labeling algorithm to label groups higher up on the tree if I want.)

You can check out one of the dendrograms I'm using for the R-squared paper by clicking here. The boxes are clusters of topics based on linguistic similarity and document occurrence. (It's easier to see if you zoom in.) It's a model of 100 topics on 10,000 randomly-sampled NIH grant abstracts. (You can get your own here.)





Wednesday, December 17, 2014

Notes on the culture of economics

I'm finally getting around to reading Piketty's Capital in the 21st Century. That and a project at work has put economics back to the front of my brain. I found the below posts interesting.

Paul Krugman says in "Notes on the Floating Crap Game (Economics Inside Baseball)"

So, academic economics is indeed very hierarchical; but I think it’s important to understand that it’s not a bureaucratic hierarchy, nor can status be conferred by crude patronage. The profession runs on reputation — basically the shared perception that you’re a smart guy. But how do you get reputation? [...] [R]eputation comes out of clever papers and snappy seminar presentations. 
[...]  Because everything runs on reputation, a lot of what you might imagine academic politics is like — what it may be like in other fields — doesn’t happen in econ. When young I would have relatives asking whether I was “in” with the department head or the senior faculty in my department, whether I was cultivating relationships, whatever; I thought it was funny, because all that mattered was your reputation, which was national if not global.

Not all Krugman says is rosy for economists. Nevertheless, this is consistent with my experience when I was in economics. Econ has a hierarchical structure, but it's not based on patronage or solely "length of service."  For example, when I was at the Fed, the internal structure was quite hierarchical in terms of both titles and managerial responsibility. (It kind of reminded me of the military.) However, it also had a  paradoxically "flat" culture. Ideas were swapped and debated constantly. Though I was a lowly research assistant, my forecasts were respected and my input listened to. I was no exception; this was just how we operated.

Simply statistics brought another post to my attention. From Kevin Drum at Mother Jones: Economists are Almost Inhumanly Impartial.

Over at 538, a team of researchers takes on the question of whether economists are biased. Given that economists are human beings, it would be pretty shocking if the answer turned out to be no, and sure enough, it's not. In fact, say the researchers, liberal economists tend to produce liberal results and conservative economists tend to produce conservative results. This is unsurprising, but oddly enough, I'm also not sure it's the real takeaway here. [...]
 What I see is a nearly flat regression line with a ton of variance. [...] If these results are actually true, then congratulations economists! You guys are pretty damn evenhanded. The most committed Austrians and the most extreme socialists are apparently producing numerical results that are only slightly different. If there's another field this side of nuclear physics that does better, I'd be surprised.

(I'll leave it to you to check out the regression line in question.)

Simply statistics's Jeff Leek has a different take.
I'm not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line.

I don't know what Leek is getting at exactly; maybe we agree. What I see is a nearly flat line through a cloud of points. My take isn't that economists are unbiased. Rather, their bias is generally uncorrelated with their ideology. That's still a good thing, right? (Either way, I am not one for the philosophy of p < 0.05 means it's true and p > 0.05 means it's false.)

Here's what I've told other people: microeconomics is about as close to a science as you're going to get. It's a lot like studying predator prey systems in the wild. There's definitely stochastic variation, but the trends are pretty clear; not much to argue about. Macroeconomics, on the other hand is a lot trickier. It's not that macroeconomists are any less objective than microeconomists. Rather, measurement and causality are much trickier. In the resulting vacuum, there's room for different assumptions and philosophies. This is what macroeconomists debate about.

Nevertheless, my experience backs up a comment to Drum's article:
Economists generally avoid and form consensus in regard to fringe theories. 

Translation: the differences in philosophies between macroeconomists isn't as big as you'd think. And they're tiny compared to our political differences.

Tuesday, December 16, 2014

Empowering People with Machine Learning

From an article in the Wall Street Journal:

When system designers begin a project, they first consider the capabilities of computers, with an eye toward delegating as much of the work as possible to the software. The human operator is assigned whatever is left over, which usually consists of relatively passive chores such as entering data, following templates and monitoring displays.
This philosophy traps people in a vicious cycle of de-skilling. By isolating them from hard work, it dulls their skills and increases the odds that they will make mistakes. When those mistakes happen, designers respond by seeking to further restrict people’s responsibilities—spurring a new round of de-skilling.
Because the prevailing technique “emphasizes the needs of technology over those of humans,” it forces people “into a supporting role, one for which we are most unsuited,” writes the cognitive scientist and design researcher Donald Norman of the University of California, San Diego.
There is an alternative.
In “human-centered automation,” the talents of people take precedence. Systems are designed to keep the human operator in what engineers call “the decision loop”—the continuing process of action, feedback and judgment-making. That keeps workers attentive and engaged and promotes the kind of challenging practice that strengthens skills.
In this model, software plays an essential but secondary role. It takes over routine functions that a human operator has already mastered, issues alerts when unexpected situations arise, provides fresh information that expands the operator’s perspective and counters the biases that often distort human thinking. The technology becomes the expert’s partner, not the expert’s replacement.
Pushing automation in a more humane direction doesn't require any technical breakthroughs. It requires a shift in priorities and a renewed focus on human strengths and weaknesses

Two thoughts come to mind:

First, there's Tyler Cowen's analogy of freestyle chess. He uses this analogy liberally in Average is Over. And the division of labor between human and computer in freestyle chess mirrors the above quote.

Second, I was taught the dichotomy of these two philosophies in the Marine Corps. I enlisted just before 9/11; ten + years of war may have changed the budgetary environment. But at the time, Marine infantry units did not have much of a budget. As a result, we trained ourselves (and our minds) first, and supplemented with what technology we could afford. On occasional training with some other (unnamed) branches of the military, we observed that these other units were awash in technology, helpless without it, and not any better than us with it. (Think fancy GPS versus old GPS + map & compass.)

I believe that the latter thought is an example of another quote from the WSJ article:

If we let our own skills fade by relying too much on automation, we are going to render ourselves less capable, less resilient and more subservient to our machines. 

Something to keep in mind as you're implementing your decision support systems.

Monday, December 15, 2014

Ukrain?

So, I've noticed a trend over the last few month's in the blog's traffic. The vast majority of hits seem to be coming from domains ending in ".ru".


Of course, these are bots. (I am heartened to see that when you aggregate URLs to sites, twitter, meetup, and datasciencecentral are still close to the top.)

When looking at the geography of the traffic sources, I'm seeing a whole lot of... Ukrain?


Who knew stats were so popular in Ukrain? (Kidding.)

But seriously, this only started a few months ago. I'm wondering if the conflict in Ukraine has anything to do with this. It's conceivable that computers and servers are getting hijacked over there as part of the war. Anyone have any thoughts?

Thursday, December 11, 2014

Saved by plagiarism!

I am writing a paper on goodness-of-fit for topic models. (Specifically, I've derived an R-squared metric for use with topic models.) I came across this definition for goodness-of-fit in our friend, Wikipedia.

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.

I love it! It's concise and to the point. But do I really want to cite Wikipedia in an article for peer review?

A Google search for the verbatim quote above reveals that this definition appears in countless books, papers, and websites without attribution. Did these authors plagiarize Wikipedia? Did Wikipedia plagiarize these authors? Who knows.

My solution, put the definition in quotes and attach a footnote. "This quote appears verbatim on Wikipedia and countless books, papers, and websites."

Done.

Tuesday, December 9, 2014

Simulating Realistic Data for Topic Modeling

Brian and I have finally submitted our paper to IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the culmination of a year of hard work. (There's more work yet to be done; I doubt we'll make it through peer-review without having to revise.)

I presented our preliminary results at JSM in August, as described in this earlier post.

Here is the abstract.

Latent Dirichlet Allocation (LDA) is a popular Bayesian methodology for topic modeling. However, the priors in LDA analysis are not reflective of natural language. In this paper we introduce a Monte Carlo method for generating documents that accurately reflect word frequencies in language by taking advantage of Zipf’s Law. In developing this method we see a result for incorporating the structure of natural language into the prior of the topic model. Technical issues with correctly assigning power law priors drove us to use ensemble estimation methods. The ensemble estimation technique has the additional benefit of improving the quality of topics and providing an approximation of the true number of topics.

The rest of the paper can be read here.

Monday, December 8, 2014

Look up

I've added a couple pages to the blog here. The about me page has a quick bio. The publications and presentations page is where I'll be putting up my bragging rights research portfolio.