Monday, May 2, 2016

textmineR



I (quietly) released an R package back in January, textmineR. It's a text mining tool designed for usability following 3 basic principles:


  1. Syntax that is idomatic to R (basically, useRs will find it intuitive)
  2. Maximal integration with the wider R ecosystem (use base classes as much as possible, use widely-adopted classes in all other cases)
  3. Scaleable because NLP data is pretty big

I implemented this in textmineR by
  1. Making the document term matrix (DTM) and term-coocurrence matrix (TCM) the central mathematical objects in the workflow
  2. Creating DTMs and TCMs in one step, with options given as arguments.
  3. Documents are expected to be character vectors (nothing more).
  4. You can store your corpus metadata however you want. I like data frames (which is probably what you created when you read the data into R anyway)
  5. Making DTMs and TCMs of class dgCMatrix from the Matrix library. These objects...
    1. are widely adopted
    2. have methods and functions making their syntax familiar to base R dense matrices
  6. Writing wrappers for a bunch of topic models, so they take DTMs/TCMs as input and return similarly-formatted outputs
  7. Adding a bunch of topic model utility functions, some of which come from my own research (R-squared, anyone?)
  8. Building textmineR on top of text2vec, which is really, really, really wicked fast. (really)
I gave a talk at statistical programming DC on Wednesday night announcing. The turnout was great (terrifying) and I received a bunch of great feedback.

I walked into work Friday morning and found a bug that affects all Windows users. :(

Fortunately, the development version is patched and ready to rock!

Unfortunately, I blew my one chance to get a rapid patch up on CRAN by pushing another patch the previous weekend. We'll have to wait until the end of May for me to push the latest patch. I'm afraid that all the good will that came from the talk will be squandered when half (or more) of the people that try it, end up getting errors.

Hopefully, that will give me time to finish writing the vignette.

No comments:

Post a Comment