Thursday, December 14, 2017

A few things I'm working on...

I've got a few things in the pipe over the next 6 months or so that I want to get out of my brain and on to paper. Some of them will even end up on this blog!

  1. A proper vignette for textmineR

    It turns out that "here you go, just read the documentation" isn't the best way to get people to use your package. I am going to write a proper vignette that explains what problem(s) textmineR is trying to solve, its framework, and lots of examples of text mining using textmineR.

  2. A derivation of neural nets using matrix math, with example R code

    I took a machine learning class this past semester. One of our assignments was to code a neural network from (basically) scratch. Almost every example I found had the mathematical derivation written as if it was in the middle of a "for" loop. I think this makes the notation cumbersome and it doesn't let you leverage a vectorized programming language (like R). So, I did the derivations myself (though I'm sure they exist somewhere else on the internet).

  3. Calculating marginal effects for arbitrary predictive models

    I've had this idea kicking around in my head for quite some time. It builds off of ICE curves and my friend Abhijit's work. Abhijit has volunteered to work on it with me to turn it into a proper paper and R package. For now, the code is here.

  4. Updating textmineR's code base

    I want to write my own implementations of some topic models in C++. I'm planning to do this over the summer. The main push is to write a parallel Gibbs sampler for LDA and to allow for asymmetric priors. I am (still) doing topic model research for my dissertation. Implementing some topic models from scratch will be good practice for me and (hopefully) useful to the community. I may also implement DTM and TCM calculation on my own too. If I do all of that, I may be able to change textmineR's license to the more permissive MIT license. I'd like to do that.

  5. Using topic models and or word embeddings to track narrative arcs within (longer) documents

    So, I literally just thought about this last night when I was going to bed. The gist is, build a topic model (either traditionally or by using word embeddings) off of a corpus. Then predict topic distributions over a sliding window within a document. This should create several time series of topics. Then one can use regime detection and lagging to parameterize how the narrative changes and relates to itself throughout the document. I have no idea if this will work or if it's already been done.

I'm hoping to get (1) and (2) out sometime between now and January 10 (basically before classes start again). I hope (3) will be done by JSM this August. (I guess that means I should submit an abstract?) And I hope (4) will be done by September (when Fall classes start, my last semester of classes). I have no idea if and when I'll tackle (5).