textmineR's number one concern is usability! Thanks for looking out for us @thos_jones #datadc #NLP pic.twitter.com/iIfFBNaALW— Danielle Beaulieu (@andDunny) April 27, 2016
I (quietly) released an R package back in January, textmineR. It's a text mining tool designed for usability following 3 basic principles:
- Syntax that is idomatic to R (basically, useRs will find it intuitive)
- Maximal integration with the wider R ecosystem (use base classes as much as possible, use widely-adopted classes in all other cases)
- Scaleable because NLP data is pretty big
- Making the document term matrix (DTM) and term-coocurrence matrix (TCM) the central mathematical objects in the workflow
- Creating DTMs and TCMs in one step, with options given as arguments.
- Documents are expected to be character vectors (nothing more).
- You can store your corpus metadata however you want. I like data frames (which is probably what you created when you read the data into R anyway)
- Making DTMs and TCMs of class dgCMatrix from the Matrix library. These objects...
- are widely adopted
- have methods and functions making their syntax familiar to base R dense matrices
- Writing wrappers for a bunch of topic models, so they take DTMs/TCMs as input and return similarly-formatted outputs
- Adding a bunch of topic model utility functions, some of which come from my own research (R-squared, anyone?)
- Building textmineR on top of text2vec, which is really, really, really wicked fast. (really)