Friday, November 16, 2018

textmineR v3.0 is here

textmineR version 3 (and up!) is here. This represents a major overhaul. The two most substantive changes are a native implementation of LDA and a more object-oriented take on topic models. The former allows for more flexibility in setting priors and a better Bayesian treatment of model fitting (e.g. averaging over the chain after a pre-determined burn in period). The latter enables a predict method for models, making textmineR's topic models have syntax similar to more traditional models in R. A longer list of changes is below.

  • Several functions that were slated for deletion in version 2.1.3 are now gone.
    • RecursiveRbind
    • Vec2Dtm
    • JSD
    • HellDist
    • GetPhiPrime
    • FormatRawLdaOutput
    • Files2Vec
    • DepluralizeDtm
    • CorrectS
    • CalcPhiPrime
  • FitLdaModel has changed significantly.
    • Now only Gibbs sampling is a supported training method. The Gibbs sampler is no longer wrapping lda::lda_collapsed_gibbs_sampler. It is now native to textmineR. It's a little slower, but has additional features.
    • Asymmetric priors are supported for both alpha and beta.
    • There is an option, optimize_alpha, which updates alpha every 10 iterations based on the value of theta at the current iteration.
    • The log likelihood of the data given estimates of phi and theta is optionally calculated every 10 iterations.
    • Probabilistic coherence is optionally calculated at the time of model fit.
    • R-squared is optionally calculated at the time of model fit.
  • Supported topic models (LDA, LSA, CTM) are now object-oriented, creating their own S3 classes. These classes have their own predict methods, meaning you do not have to do your own math to make predictions for new documents.
  • A new function SummarizeTopics has been added.
  • tm is no longer a dependency for stopwords. We now use the stopwords package. The extended result of this is that there is no longer any Java dependency.
  • Several packages have been moved from "Imports" to "Suggests". The result is a faster install and lower likelihood of install failure based on packages with system dependencies. (Looking at you, topicmodels!)
  • Finally, I have changed the textmineR license to the MIT license. Note, however, that some dependencies may have more restrictive licenses. So if you're looking to use textmineR in a commercial project, you may want to dig deeper into what is/isn't permissable.