Wednesday, November 26, 2014

Economics and Data Mining


He's mining for data.


I stumbled across this video.

Cosma Shalizi, a stats professor at Carnegie Mellon, argues that economists should stop "fitting large complex models to a small set of highly correlated time series data. Once you add enough variables, parameters, bells and whistles, your model can fit past data very well, and yet fail miserably in the future."

I think there's a bit of a conflation of problems here. Not all economic data sets are small. An economist friend of mine pointed out that he's been working with datasets that have millions of observations. I am told this is common in microeconomics.

Nevertheless, my experience is that "acceptable" econometric methods are overly-conservative. As stated in the video, an economist saying someone is "data mining" is tantamount to an accusation of academic dishonesty. I was indoctrinated early in the ways of David Hendry's general to specific modeling, which is basically data mining (but doing it intelligently).  This, I think, made machine learning an intuitive move for me, and I've always thought that economics research would benefit greatly from machine learning methods.

There are some important caveats to all this. First, I don't see anyone beating out economics the same way computer science is sticking it to statistics. For "big data analytics" to live up to its hype, data scientists have to think a lot like economists, not the other way around. A big part of an economics education is economic thinking; this goes above and beyond statistical methods. Second, (and more importantly) you should take anything I say here with a grain of salt. Though I have a background in (and profound love for) economics, I never held a graduate degree in econ and I've been out of the field (and professional network) for several years. My knowledge may be dated.

Even so, I'm happy to hear voices like Dr. Shalizi's. It adds to Hal Varian's paper on "big data" tricks for econometrics. Maybe instead of worrying about the AI singularity, we should be worrying about economists using machine learning and then taking all of our jobs. ;-)

Wednesday, November 19, 2014

What do you do when you see a bad study?

Debate: how should we respond in the face of a study using bad statistics? - This post actually has a bit of history to it, citing Andrew Gelman and Jeff Leeks. I'd recommend clicking through and taking in its totality.

Friday, November 14, 2014

LDA and Topic Models Reading List


A big thank you to everyone that came to see me talk about topic models at DC-NLP on Wednesday. I am grateful for the feedback that I received. I'd also like to give a big shout out to my co-author, Brian St. Thomas. Not only has his hard work made our research shine, he is the one who came up with the "ball and urns" graphic to explain topic models. Many people came up to me afterwords saying how intuitive that was; I wish I could take the credit, but it was all Brian.

While I wait on approval from work to release my slides, I thought I'd put together an LDA-related reading list of many of my sources. I've done a bit of that before here. Some of those papers are also below, as well as others.

LDA Basics

  1. Rethinking LDA: Why Priors Matter (This is a good paper, though I am skeptical of the conclusion.)
  2. Comparison of topic models, their estimation algorithms, and priors. (Very underrated, MUST READ.)
  3. Incorporating Zipf's law in language models
  4. A note on estimating LDA with asymmetric priors

Evaluating LDA/Issues With LDA

  1. LDA is an inconsistent estimator
  2. Reading Tea Leaves: How humans interpret topic models (Also, MUST READ.)
  3. A coherence (cohesion?) metric for topic models. (Note: This metric has the issue of "liking" topics full of statistically-independent words. It is still useful though.)

Other Topic Models

  1. Spherical topic models. (My co-author assures me that these are consistent estimators; we've not yet implemented them though. Know anyone that has?) (Update 2:48: I was wrong, this model is *not* consistent but it could be. See Brian's note, below.)
  2. Dynamic topic models
  3. Ensembles of topic models (not our stuff, but from Jordan Boyd-Graber who is super smart and a friend of DC-NLP)

Other Stuff

  1. KERA keyword extraction used to label topics in one of my examples. (The paper applying it to LDA is forthcoming, however.)
  2. Rethinking Language: How probabilities shape the words we use (MUST READ, though not about topic modeling specifically.)
  3. David Blei's topic modeling website

From Brian on spherical topic models: "A small note on spherical topic models - the basic spherical topic model that is out there (SAM) is *not* a consistent estimator, but we have a framework to make a consistent estimator from my work on estimating mixtures of linear subspaces by tweaking the prior."

Statistics, Computer Science, and How to Move Forward

I'm still here! Took a break from blogging/Twitter/etc. over the last couple months. My brain needed a break and I picked up a real hobby. But this blog isn't dead yet!

This month's issue of Amstat News features an editorial by Norman Matloff titled "Statistics Losing Ground to Computer Science." Provocative title, no?

I was expecting yet another article whose argument could be summed up as "get off of my lawn, you punk computer scientists!" When I read/hear these kinds of arguments from statisticians, I usually roll my eyes and move on with my life. But this time... I agreed.

Dr. Matloff's article is quite critical of CS research involving statistics. And maybe I'm getting crotchety, but I've run into many of these issues myself in my topic modeling research.  An exemplar quote is below.

Due in part to the pressure for rapid publication and the lack of long-term commitment to research topics, most CS researchers in statistical issues have little knowledge of the statistics literature, and they seldom cite it. There is much “reinventing the wheel,” and many missed opportunities.

The fact of the matter is, CS and statistics come from very different places culturally. This doesn't always lend itself to clear communication and cross-disciplinary respect.  Dr. Matloff touches on this mismatch. At one end...

CS people tend to have grand—and sometimes starry-eyed—ambitions. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a large crowd. But this mentality leads to an oversimplified view, with everything being viewed as a paradigm shift.

And at the other...

Statistics researchers should be much more aggressive in working on complex, large-scale, “messy” problems, such as the face recognition example cited earlier.

I 100% agree with the above. CS didn't start "overshadowing statistics researchers in their own field" simply because computer scientists "move fast and break things." In addition, our (statisticians') conservatism stifled creativity and ambitions to solve grand problems, like facial recognition (or text analyses).

Dr. Matloff recommends several changes for statistics to make. I particularly like the suggestion that more CS and statistics professors have joint appointments. A criticism that I regularly hear from my CS colleagues is that many statisticians are mediocre programmers, and that they lack pragmatism on the tradeoff between mathematical rigor and a useful application. We've covered CS's sometimes cavalier attitude towards modeling above. Perhaps more joint appointments will not only influence faculty, but also educate students early on the needs and advantages of both approaches.


Monday, August 18, 2014

From JSM 2014: Steven Stigler's Seven Pillars of Statistics

The full list plus explanations can be found here.

In response to those that fall in the "more data means we don't have to worry about anything camp":

The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the "root n rule." The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.

I have noticed lately that when I tell people they might be better off with a well-collected sample, rather than trying to get "all the data" they look at me like I've lost my mind.

Then there's this:

Design: R. A. Fisher, in an address to the Indian Statistical Congress (1938) said "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." 
Of course, maybe actually I have lost my mind; I chose to be a statistician. :)