Thursday, December 14, 2017

A few things I'm working on...

I've got a few things in the pipe over the next 6 months or so that I want to get out of my brain and on to paper. Some of them will even end up on this blog!


  1. A proper vignette for textmineR

    It turns out that "here you go, just read the documentation" isn't the best way to get people to use your package. I am going to write a proper vignette that explains what problem(s) textmineR is trying to solve, its framework, and lots of examples of text mining using textmineR.

  2. A derivation of neural nets using matrix math, with example R code

    I took a machine learning class this past semester. One of our assignments was to code a neural network from (basically) scratch. Almost every example I found had the mathematical derivation written as if it was in the middle of a "for" loop. I think this makes the notation cumbersome and it doesn't let you leverage a vectorized programming language (like R). So, I did the derivations myself (though I'm sure they exist somewhere else on the internet).

  3. Calculating marginal effects for arbitrary predictive models

    I've had this idea kicking around in my head for quite some time. It builds off of ICE curves and my friend Abhijit's work. Abhijit has volunteered to work on it with me to turn it into a proper paper and R package. For now, the code is here.

  4. Updating textmineR's code base

    I want to write my own implementations of some topic models in C++. I'm planning to do this over the summer. The main push is to write a parallel Gibbs sampler for LDA and to allow for asymmetric priors. I am (still) doing topic model research for my dissertation. Implementing some topic models from scratch will be good practice for me and (hopefully) useful to the community. I may also implement DTM and TCM calculation on my own too. If I do all of that, I may be able to change textmineR's license to the more permissive MIT license. I'd like to do that.

  5. Using topic models and or word embeddings to track narrative arcs within (longer) documents

    So, I literally just thought about this last night when I was going to bed. The gist is, build a topic model (either traditionally or by using word embeddings) off of a corpus. Then predict topic distributions over a sliding window within a document. This should create several time series of topics. Then one can use regime detection and lagging to parameterize how the narrative changes and relates to itself throughout the document. I have no idea if this will work or if it's already been done.

I'm hoping to get (1) and (2) out sometime between now and January 10 (basically before classes start again). I hope (3) will be done by JSM this August. (I guess that means I should submit an abstract?) And I hope (4) will be done by September (when Fall classes start, my last semester of classes). I have no idea if and when I'll tackle (5). 

Monday, October 16, 2017

textmineR has a logo



I was at the EARL conference in San Francisco a couple months ago and got inspiration from AirBnb. AirBnb has its own R package it uses internally. To gin up interest and encourage employees to use it and contribute to it, they distributed swag.

So, in that vain, I present the textmineR logo. I'll be getting stickers made and throwing them around wherever I am.

I acknowledge this is the easy way out. I still need to write a vignette on textmineR. I'm probably being naive, but I expect to get several papers out when I'm done with classes and in dissertation land. So, maybe then?

Thursday, October 12, 2017

Anyone else think implementing back propagation from scratch is kind of fun?


from Twitter https://twitter.com/thos_jones

Wednesday, June 1, 2016

Weird Error: fatal error in wrapper code

I suppose I'm publishing this so that I can save the next programmer the effort of tracking down the source of a weird error thrown by mclapply.
fatal error in wrapper code
What? The cause, according to this, is that (I think) mclapply is using too many threads or taking up too much memory. This is pretty consistent with the code I was running at the time.

The solution, then, would be to use fewer threads. I'll re-run the code tonight and see if that fixes it. I'll update this post when I confirm or reject the solution.

Monday, May 2, 2016

textmineR



I (quietly) released an R package back in January, textmineR. It's a text mining tool designed for usability following 3 basic principles:


  1. Syntax that is idomatic to R (basically, useRs will find it intuitive)
  2. Maximal integration with the wider R ecosystem (use base classes as much as possible, use widely-adopted classes in all other cases)
  3. Scaleable because NLP data is pretty big

I implemented this in textmineR by
  1. Making the document term matrix (DTM) and term-coocurrence matrix (TCM) the central mathematical objects in the workflow
  2. Creating DTMs and TCMs in one step, with options given as arguments.
  3. Documents are expected to be character vectors (nothing more).
  4. You can store your corpus metadata however you want. I like data frames (which is probably what you created when you read the data into R anyway)
  5. Making DTMs and TCMs of class dgCMatrix from the Matrix library. These objects...
    1. are widely adopted
    2. have methods and functions making their syntax familiar to base R dense matrices
  6. Writing wrappers for a bunch of topic models, so they take DTMs/TCMs as input and return similarly-formatted outputs
  7. Adding a bunch of topic model utility functions, some of which come from my own research (R-squared, anyone?)
  8. Building textmineR on top of text2vec, which is really, really, really wicked fast. (really)
I gave a talk at statistical programming DC on Wednesday night announcing. The turnout was great (terrifying) and I received a bunch of great feedback.

I walked into work Friday morning and found a bug that affects all Windows users. :(

Fortunately, the development version is patched and ready to rock!

Unfortunately, I blew my one chance to get a rapid patch up on CRAN by pushing another patch the previous weekend. We'll have to wait until the end of May for me to push the latest patch. I'm afraid that all the good will that came from the talk will be squandered when half (or more) of the people that try it, end up getting errors.

Hopefully, that will give me time to finish writing the vignette.

Thursday, November 5, 2015

More on statisticians in data science



The November issue of AMSTAT News has published an opinion piece by yours truly on the identity of statisticians in data science. My piece starts on page 25 of the print version. The online version is here. A quote:

I am not convinced that statistics is data science. But I am convinced that the fundamentals of probability and mathematical statistics taught today add tremendous value and cement our identity as statisticians in data science.
Please read the whole thing.


Friday, May 8, 2015

Oops.

I made an accident yesterday.


What happened? When creating a matrix of zeros I accidentally typed matrix() instead of Matrix().

What's the difference? 4.8 terabytes versus less than one GB. I was creating a document-term-matrix of about 100,000 documents with a vocabulary of about 6,000,000 tokens. This is the thing with linguistic data: one little mistake is the difference between working on a Macbook Air with no fuss and something that would make a super computer choke. (Anyone want to get me a quote on hardware with 5 TB of RAM?)