Wednesday, June 1, 2016

Weird Error: fatal error in wrapper code

I suppose I'm publishing this so that I can save the next programmer the effort of tracking down the source of a weird error thrown by mclapply.
fatal error in wrapper code
What? The cause, according to this, is that (I think) mclapply is using too many threads or taking up too much memory. This is pretty consistent with the code I was running at the time.

The solution, then, would be to use fewer threads. I'll re-run the code tonight and see if that fixes it. I'll update this post when I confirm or reject the solution.

Monday, May 2, 2016


I (quietly) released an R package back in January, textmineR. It's a text mining tool designed for usability following 3 basic principles:

  1. Syntax that is idomatic to R (basically, useRs will find it intuitive)
  2. Maximal integration with the wider R ecosystem (use base classes as much as possible, use widely-adopted classes in all other cases)
  3. Scaleable because NLP data is pretty big

I implemented this in textmineR by
  1. Making the document term matrix (DTM) and term-coocurrence matrix (TCM) the central mathematical objects in the workflow
  2. Creating DTMs and TCMs in one step, with options given as arguments.
  3. Documents are expected to be character vectors (nothing more).
  4. You can store your corpus metadata however you want. I like data frames (which is probably what you created when you read the data into R anyway)
  5. Making DTMs and TCMs of class dgCMatrix from the Matrix library. These objects...
    1. are widely adopted
    2. have methods and functions making their syntax familiar to base R dense matrices
  6. Writing wrappers for a bunch of topic models, so they take DTMs/TCMs as input and return similarly-formatted outputs
  7. Adding a bunch of topic model utility functions, some of which come from my own research (R-squared, anyone?)
  8. Building textmineR on top of text2vec, which is really, really, really wicked fast. (really)
I gave a talk at statistical programming DC on Wednesday night announcing. The turnout was great (terrifying) and I received a bunch of great feedback.

I walked into work Friday morning and found a bug that affects all Windows users. :(

Fortunately, the development version is patched and ready to rock!

Unfortunately, I blew my one chance to get a rapid patch up on CRAN by pushing another patch the previous weekend. We'll have to wait until the end of May for me to push the latest patch. I'm afraid that all the good will that came from the talk will be squandered when half (or more) of the people that try it, end up getting errors.

Hopefully, that will give me time to finish writing the vignette.

Thursday, November 5, 2015

More on statisticians in data science

The November issue of AMSTAT News has published an opinion piece by yours truly on the identity of statisticians in data science. My piece starts on page 25 of the print version. The online version is here. A quote:

I am not convinced that statistics is data science. But I am convinced that the fundamentals of probability and mathematical statistics taught today add tremendous value and cement our identity as statisticians in data science.
Please read the whole thing.

Friday, May 8, 2015


I made an accident yesterday.

What happened? When creating a matrix of zeros I accidentally typed matrix() instead of Matrix().

What's the difference? 4.8 terabytes versus less than one GB. I was creating a document-term-matrix of about 100,000 documents with a vocabulary of about 6,000,000 tokens. This is the thing with linguistic data: one little mistake is the difference between working on a Macbook Air with no fuss and something that would make a super computer choke. (Anyone want to get me a quote on hardware with 5 TB of RAM?)

Wednesday, January 14, 2015

Are microeconomists data scientists?

From The Economist:

Armed with vast data sets produced by tech firms, microeconomists can produce startlingly good forecasts of human behaviour. Silicon Valley firms have grown to love them: by bringing a cutting-edge economist in house, they are able to predict what customers or employees are likely to do next.

 Sounds like data scientists to me. The article is here. There's a related piece here.

Friday, January 9, 2015

Introducing R-squared for Topic Models

I have a new working paper added to my publications page. The abstract reads
This document proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or R2. Within the context of topic modeling, R2 has the same interpretation that it does when used in a broader class of statistical models. Reporting R2 with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. This paper proposes that R2 should be reported as a standard metric when constructing topic models.
In researching this paper, I came across a potentially significant (in the colloquial sense) finding.

These properties of R2 compel a change in focus for the topic modeling research community. Document length is an important factor in model fit whereas the number of documents is not. When choosing between fitting a topic model on short abstracts, full-page documents, or multi-page papers, it appears that more text is better. Second, since model fit is invariant for corpora over 1,000 documents (our lower bound for simulation), sampling from large corpora should yield reasonable estimates of population parameters. The topic modeling community has heretofore focused on scalability on large corpora of hundreds-of-thousands to millions of documents. Little attention has been paid to document length, however. These results, if robust, indicate that focus should move away from larger corpora and towards lengthier documents. 


  1. Stop building topic models on tweets and stop using the 20 news groups data.
  2. Your argument of needing gigantic corpora (and algorithms that scale to them) is invalid.
  3. Citing points (1) and (2), I am prone to hyperbole.
  4. Citing points (1) through (3), I like numbered lists.

In all seriousness, if you take the time to read the paper and have any comments, please let me know. You can comment below, use the "contact me directly" tool on your right, or email me (if you know my email already). I'll be circulating this paper among friends and colleagues to get their opinions as well.