Charlie Greenbacker presented on NLP in Python. His code and write up is here.
I presented on NLP in R. My code, slides, and example data are here.
If I had to sum up the big take aways from the R bit...
- Use R because you are focused on quantitative research. R has big advantages in the quant realm, but sharp edges in terms of memory usage and (sometimes) speed. If you are working on a general programming application, use a GPL.
- The key data structure is a document term matrix (DTM).
- Use sparse representations of the DTM so you don't run out of memory.
- Use linear algebra wherever possible. R likes linear algebra; it's linear algebra functions (e.g. "%*%") are coded in C and are fast.
- Parallelize wherever possible. You have many choices for easy parallelization in R. I like snowfall.
- Remember, the DTM is a matrix. Once you have that, it's (mostly) just math from here on out. Have fun!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.