Thursday, July 24, 2014

An update on "A Comparison of Programming Languages in Economics"

A couple weeks ago, I beat up an NBER working paper, A Comparison of Programming Languages in Economics. My response was a bit harsh though not entirely unfounded. However, there were some points on which I was clearly wrong, as I'll explain below.

The post also lead to a fruitful back and forth between the authors, particularly Jesus Fernandez-Villaverde, and me. (I also gather that others responded to the earlier version as well.) The result is an updated and much-improved version of their working paper. It incorporates the usage of Rcpp and includes a note on vectorizing the problem. In the end, I think the new version does R justice in its advantages and limitations.

In my earlier post, I made three basic points:

  1. One must consider a programming language's preferred paradigm (functional programming, vector operations, etc.) in comparing languages. This affects the speed of different problems and the human burden of coding.
  2. R's paradigm is geared towards vector operations. If a vectorized approach exists, R will almost uniformly perform it much faster than in a loop. (While for loops aren't always bad in R. Nesting them together is.)
  3. R fails hard on problems that don't vectorize (or don't vectorize well). This makes functional knowledge of C/C++ a "must" for R coders, at least intermediate to advanced ones. Rcpp makes integrating C++ code in R incredibly easy.
In my email exchanges with the authors, I also raised an additional point. 
  1. Dynamically re-assigning values to objects in the middle of loops greatly impacts performance. 
An example of the above and its assumed-preferred alternative is below.

    for( j in 1:100){
        x <- j + 1
    }

The assumed better alternative would be

    x <- rep(0, 100)

    for( j in 1:100){
        x[ j ] <- j + 1
    }


The stuff I got wrong


Last thing's first: Dynamic reassignment of a scalar is considerably faster than initializing a vector beforehand. Dynamic reassignment of a vector is slower than initializing the vector beforehand. This is my mistake. 




Second, my knee-jerk reaction to seeing triple-nested for loops inside of a while loop was understandable for the general case but off the mark in this specific case. As the new copy of the working paper indicates, vectorization does not work well for this problem. This does, however, highlight one of R's limitations. If your problem does not vectorize and cannot be run in parallel, you're kind of screwed if you don't know another language.

The stuff that was worth listening to


For better or worse, C/C++ are a big part of optimizing R. When we say "R is optimized for vector operations," we really mean "R's vector operations, like matrix algebra, are actually written in C." As a result, knowing C/C++ (or maybe even FORTRAN) is a part of R. Fortunately Rcpp makes this fairly painless and dispatches with much of the confusing programming overhead involved with C++ coding. 

The problem that Drs Aruoba and Fernandez-Villaverde used for their speed test is a prime candidate for some Rcpp. This is also an example of what I mean by adopting a language's assumed paradigm. The "correct" R solution to this problem is to code it in C++, not R. Use R to reshape the data on the front end, call the C++ code that does the main analysis, and then use R to reshape the result, make your graphics etc.

One last thing


I am still dubious of the utility of speed tests for most research applications. I've certainly run into times when speed matters, but those tend to be the exception rather than the rule. Though the authors made two points in our emails: speed is easy to measure, while other criteria are not; and many modern macroeconomic problems can take weeks to estimate in C++ or similar, making speed a more mainstream concern.

R is also poorly-represented in its comparison of coding speed, and I'm not talking about using  Rcpp. In 90% or more of the cases I've seen where R is performing abysmally, it is because of user error. Specifically, the R code is written like C/C++ code, lots of loops and "if" statements, rather than matrix operations and the use of logical vectors. Writing R code like it's... well... meant for R, is generally the best fix. 

Forgive me for relying on anecdote, but I've written R code that works as well or better than some Python code. The issue was with how the code was written, not the property of the language. (I am sure the "best" Python code is faster than the "best" R code, but how often is it the case that we're all writing our "best" code?)

With those caveats: Drs Aruoba and Fernandez-Villaverde's new paper is not one I'd take much issue with, at least with its treatment of R. They give treatment to 3 clear and easy solutions to approaching their problem in R: raw R code, compiled R code, and Rcpp; the result is plain to see. Unlike other approaches I've seen, this paper shows you how fast R really is compared to other languages.

Tuesday, July 15, 2014

Recap: NLP toolkit (focus on R)

A quick synopsis of last week's DC2 presentation on NLP in Python and R: The talk was hosted jointly by Statistical Programming DC, Data Wranglers DC, and Natural Language Processing DC.

Charlie Greenbacker presented on NLP in Python. His code and write up is here.

I presented on NLP in R. My code, slides, and example data are here.

If I had to sum up the big take aways from the R bit...
  • Use R because you are focused on quantitative research. R has big advantages in the quant realm, but sharp edges in terms of memory usage and (sometimes) speed. If you are working on a general programming application, use a GPL.
  • The key data structure is a document term matrix (DTM).
  • Use sparse representations of the DTM so you don't run out of memory.
  • Use linear algebra wherever possible. R likes linear algebra; it's linear algebra functions (e.g. "%*%") are coded in C and are fast.
  • Parallelize wherever possible. You have many choices for easy parallelization in R. I like snowfall.
  • Remember, the DTM is a matrix. Once you have that, it's (mostly) just math from here on out. Have fun!

Friday, July 11, 2014

On the lack of Friday links lately

The summer of 2014 is turning into one of the busiest I've yet had. Between work, courses, presentations, and friends getting married, I've been spending less time reading which means fewer links for the blog. :(

On deck over the next couple weeks though:

  • I'll post slides from Wednesday night's talk along with a little write-up.
  • Related to the talk, actually, I'll give my overdue review of the high-performance computing in R workshop.
  • I'll hopefully get a preview of what's going to be presented at JSM up here for all of you that love topic models as much as I do. (Also, I now have a co-author. Ish just got real.)
Stay tuned!

Example 47,385 of poor coding being blamed on R

I am beginning to think of speed tests of programming languages as being just about useless. An NBER working paper, A Comparison of Programming Languages in Economics, (un-gated version here) is only reenforcing the point.

The authors run a common macroeconometric model in a handful of languages and compare speeds. There is a twist, however. 

To make the comparison as unbiased as possible, we coded the same algorithm in each language (which could reflect more about our knowledge of each language than its objective virtues.)

In my mind, this is a poor choice and is itself more biased. Part of choosing a programming language is choosing your programming paradigm. A more fair comparison would be to adopt the dominant paradigm of each language and compare speeds that way.  This choice essentially pits poorly-written R code against well-written C++ code. How is this helpful?

The authors also make the following claims:

Issues such as avoiding loops through vectorization, which could help Matlab or R, are less important in our case. With 17,820 entries in a vector, vectorization rarely helps much in comparison with standard loops.

We did not explore the possibility of mixing language programming such as Rcpp in R. While such alternatives are often useful (although cumbersome to implement), a detailed analysis falls beyond the scope of this paper.

First, I have found that the larger the vector, the more vectorization is an important concept in R.(Edit 7/14/2014 - In fairness, I am not sure that the authors' procedure would vectorize well anyway.) Second, I do not buy the implication that coding in C++ on its own is somehow much less cumbersome that calling a single Rcpp function to import that  C++ code as a function to be called from R.

Their R code, by the way, had an "if" statement in the middle of 3 nested "for" loops which were themselves nested in a "while" loop. Ummmm.... yeah.

Oh, yes. I do believe that R code written that way was "500 to 700 times slower than C++." Imagine that...

Wednesday, July 9, 2014

Today! NLP Toolkit: Python and R

Data Community DC (DC2) is holding a joint meetup with Statistical Programming DC, Data Wranglers DC, and Natural Language Processing DC. The topic: Natural Language Processing (NLP) tools. The first presentation is on Python; the second (presented by yours truly) is on R.

There is more on DC2's blog.

I am both excited and terrified. I'm excited because I love using R for NLP. I'm terrified because we've got 350 RSVPs.

Friday, June 20, 2014

Geekcitement

Pirates code in arrrrrrr (R).
Source: http://constructingkids.com/2013/08/23/coding-pirates-beta-test-invitation/


I'm really excited to be attending DC2's workshop tomorrow, High-performance Computing in R. I'll post a recap here after the class.

I've gotten quite good at R coding over the last couple years. I've posted before that those who would complain about R's speed are likely not putting enough thought into their coding. R is a vectorized language. So, study your matrix algebra, people! (Also, note that R is an interpreted language; it can only go so fast even at its best.)

There are, however, some things that you can't vectorize. Gibbs sampling, for example, is index dependent. For example, your sample at iteration j + 1 is dependent on the result of your sample at iteration j. For this, C, C++, and Fortran play nicely with R.

There's also the issue of parallelization. I use the snowfall package regularly with great success. I am lately interested in playing with CUDA-enabled GPUs to get more cores. Not sure if this'll be covered in the course, but we'll see.

Friday, May 30, 2014

Friday links: May 30, 2014

Image via Simply Statistics


Explanation vs prediction as the goal of statistical models - H/T Majid alDosari@msdtechcode

Talking about uncertainty in science to lay audiences (possible repost)

Whole lotta slides about R and finance - H/T Revolutions

Big data is a social construct I

Big data is a social construct II - source of the image above. Note that (even using real, non whiteboard, data) we are on trend in terms of data size. We're above trend for data utilization.

What statistics teaches us about big data