Friday, November 16, 2018

textmineR v3.0 is here

textmineR version 3 (and up!) is here. This represents a major overhaul. The two most substantive changes are a native implementation of LDA and a more object-oriented take on topic models. The former allows for more flexibility in setting priors and a better Bayesian treatment of model fitting (e.g. averaging over the chain after a pre-determined burn in period). The latter enables a predict method for models, making textmineR's topic models have syntax similar to more traditional models in R. A longer list of changes is below.

  • Several functions that were slated for deletion in version 2.1.3 are now gone.
    • RecursiveRbind
    • Vec2Dtm
    • JSD
    • HellDist
    • GetPhiPrime
    • FormatRawLdaOutput
    • Files2Vec
    • DepluralizeDtm
    • CorrectS
    • CalcPhiPrime
  • FitLdaModel has changed significantly.
    • Now only Gibbs sampling is a supported training method. The Gibbs sampler is no longer wrapping lda::lda_collapsed_gibbs_sampler. It is now native to textmineR. It's a little slower, but has additional features.
    • Asymmetric priors are supported for both alpha and beta.
    • There is an option, optimize_alpha, which updates alpha every 10 iterations based on the value of theta at the current iteration.
    • The log likelihood of the data given estimates of phi and theta is optionally calculated every 10 iterations.
    • Probabilistic coherence is optionally calculated at the time of model fit.
    • R-squared is optionally calculated at the time of model fit.
  • Supported topic models (LDA, LSA, CTM) are now object-oriented, creating their own S3 classes. These classes have their own predict methods, meaning you do not have to do your own math to make predictions for new documents.
  • A new function SummarizeTopics has been added.
  • tm is no longer a dependency for stopwords. We now use the stopwords package. The extended result of this is that there is no longer any Java dependency.
  • Several packages have been moved from "Imports" to "Suggests". The result is a faster install and lower likelihood of install failure based on packages with system dependencies. (Looking at you, topicmodels!)
  • Finally, I have changed the textmineR license to the MIT license. Note, however, that some dependencies may have more restrictive licenses. So if you're looking to use textmineR in a commercial project, you may want to dig deeper into what is/isn't permissable.

Friday, February 23, 2018

textmineR 2.1.0 is up

Over the weekend I released textmineR 2.1.0 to CRAN (current version here). The current version contains a couple minor updates and 5 vignettes to get you up and running with text mining.

The vignettes cover the philosophy of textmineR, basic corpus statistics, document clustering, topic modeling, text embeddings (which is basically topic modeling of a term co-occurrence matrix), and building a basic document summarizer. That last vignette uses text embeddings plus a variation of the TextRank algorithm.

The other updates are relatively minor. @manuelbickle discovered that my implementation of CalcProbCoherence was scaled differently from what I'd intended. That's fixed, though it shouldn't affect the qualitative use of probabilistic coherence. Second, I realized that my documentation for CreateTcm was misleading. So, that's now fixed.