Tuesday, July 15, 2014

Recap: NLP toolkit (focus on R)

A quick synopsis of last week's DC2 presentation on NLP in Python and R: The talk was hosted jointly by Statistical Programming DC, Data Wranglers DC, and Natural Language Processing DC.

Charlie Greenbacker presented on NLP in Python. His code and write up is here.

I presented on NLP in R. My code, slides, and example data are here.

If I had to sum up the big take aways from the R bit...
  • Use R because you are focused on quantitative research. R has big advantages in the quant realm, but sharp edges in terms of memory usage and (sometimes) speed. If you are working on a general programming application, use a GPL.
  • The key data structure is a document term matrix (DTM).
  • Use sparse representations of the DTM so you don't run out of memory.
  • Use linear algebra wherever possible. R likes linear algebra; it's linear algebra functions (e.g. "%*%") are coded in C and are fast.
  • Parallelize wherever possible. You have many choices for easy parallelization in R. I like snowfall.
  • Remember, the DTM is a matrix. Once you have that, it's (mostly) just math from here on out. Have fun!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.