Thursday, July 24, 2014

An update on "A Comparison of Programming Languages in Economics"

A couple weeks ago, I beat up an NBER working paper, A Comparison of Programming Languages in Economics. My response was a bit harsh though not entirely unfounded. However, there were some points on which I was clearly wrong, as I'll explain below.

The post also lead to a fruitful back and forth between the authors, particularly Jesus Fernandez-Villaverde, and me. (I also gather that others responded to the earlier version as well.) The result is an updated and much-improved version of their working paper. It incorporates the usage of Rcpp and includes a note on vectorizing the problem. In the end, I think the new version does R justice in its advantages and limitations.

In my earlier post, I made three basic points:

  1. One must consider a programming language's preferred paradigm (functional programming, vector operations, etc.) in comparing languages. This affects the speed of different problems and the human burden of coding.
  2. R's paradigm is geared towards vector operations. If a vectorized approach exists, R will almost uniformly perform it much faster than in a loop. (While for loops aren't always bad in R. Nesting them together is.)
  3. R fails hard on problems that don't vectorize (or don't vectorize well). This makes functional knowledge of C/C++ a "must" for R coders, at least intermediate to advanced ones. Rcpp makes integrating C++ code in R incredibly easy.
In my email exchanges with the authors, I also raised an additional point. 
  1. Dynamically re-assigning values to objects in the middle of loops greatly impacts performance. 
An example of the above and its assumed-preferred alternative is below.

    for( j in 1:100){
        x <- j + 1
    }

The assumed better alternative would be

    x <- rep(0, 100)

    for( j in 1:100){
        x[ j ] <- j + 1
    }


The stuff I got wrong


Last thing's first: Dynamic reassignment of a scalar is considerably faster than initializing a vector beforehand. Dynamic reassignment of a vector is slower than initializing the vector beforehand. This is my mistake. 




Second, my knee-jerk reaction to seeing triple-nested for loops inside of a while loop was understandable for the general case but off the mark in this specific case. As the new copy of the working paper indicates, vectorization does not work well for this problem. This does, however, highlight one of R's limitations. If your problem does not vectorize and cannot be run in parallel, you're kind of screwed if you don't know another language.

The stuff that was worth listening to


For better or worse, C/C++ are a big part of optimizing R. When we say "R is optimized for vector operations," we really mean "R's vector operations, like matrix algebra, are actually written in C." As a result, knowing C/C++ (or maybe even FORTRAN) is a part of R. Fortunately Rcpp makes this fairly painless and dispatches with much of the confusing programming overhead involved with C++ coding. 

The problem that Drs Aruoba and Fernandez-Villaverde used for their speed test is a prime candidate for some Rcpp. This is also an example of what I mean by adopting a language's assumed paradigm. The "correct" R solution to this problem is to code it in C++, not R. Use R to reshape the data on the front end, call the C++ code that does the main analysis, and then use R to reshape the result, make your graphics etc.

One last thing


I am still dubious of the utility of speed tests for most research applications. I've certainly run into times when speed matters, but those tend to be the exception rather than the rule. Though the authors made two points in our emails: speed is easy to measure, while other criteria are not; and many modern macroeconomic problems can take weeks to estimate in C++ or similar, making speed a more mainstream concern.

R is also poorly-represented in its comparison of coding speed, and I'm not talking about using  Rcpp. In 90% or more of the cases I've seen where R is performing abysmally, it is because of user error. Specifically, the R code is written like C/C++ code, lots of loops and "if" statements, rather than matrix operations and the use of logical vectors. Writing R code like it's... well... meant for R, is generally the best fix. 

Forgive me for relying on anecdote, but I've written R code that works as well or better than some Python code. The issue was with how the code was written, not the property of the language. (I am sure the "best" Python code is faster than the "best" R code, but how often is it the case that we're all writing our "best" code?)

With those caveats: Drs Aruoba and Fernandez-Villaverde's new paper is not one I'd take much issue with, at least with its treatment of R. They give treatment to 3 clear and easy solutions to approaching their problem in R: raw R code, compiled R code, and Rcpp; the result is plain to see. Unlike other approaches I've seen, this paper shows you how fast R really is compared to other languages.

2 comments:

  1. (you just know that i'm going to pop up here!)

    interesting stuff.

    without devaluing the merits of R, pushing R on execution speed is futile. it's just not its thing and you end up with code that's got little tricks all over in an attempt to keep up speed which makes things difficult to follow.

    matrix and vector ops are not going away but so is writing loops. it's simply much clearer to go ahead and write that in compiled Python, C/C++, Julia, or Fortran.

    ReplyDelete
  2. I was wondering why you hadn't commented on the first post, actually. :)

    I have found that my fastest R code (with some exceptions) is also cleaner to read and follow. That said, R will never be a "fast" language.

    For further interest in advanced R programming though, check out this site: http://adv-r.had.co.nz/

    ReplyDelete