Monday, August 18, 2014

From JSM 2014: Steven Stigler's Seven Pillars of Statistics

The full list plus explanations can be found here.

In response to those that fall in the "more data means we don't have to worry about anything camp":

The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the "root n rule." The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.

I have noticed lately that when I tell people they might be better off with a well-collected sample, rather than trying to get "all the data" they look at me like I've lost my mind.

Then there's this:

Design: R. A. Fisher, in an address to the Indian Statistical Congress (1938) said "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." 
Of course, maybe actually I have lost my mind; I chose to be a statistician. :)

Tuesday, August 12, 2014

Pew canvasses experts on "AI, Robotics, and the Future of Jobs"

From the report:

Key themes: reasons to be hopeful
  1. Advances in technology may displace certain types of work, but historically they have been a net creator of jobs.
  2. We will adapt to these changes by inventing entirely new types of work, and by taking advantage of uniquely human capabilities.
  3. Technology will free us from day-to-day drudgery, and allow us to define our relationship with “work” in a more positive and socially beneficial way.
  4. Ultimately, we as a society control our own destiny through the choices we make.

Key themes: reasons to be concerned
  1. Impacts from automation have thus far impacted mostly blue-collar employment; the coming wave of innovation threatens to upend white-collar work as well.
  2. Certain highly-skilled workers will succeed wildly in this new environment—but far more may be displaced into lower paying service industry jobs at best, or permanent unemployment at worst.
  3. Our educational system is not adequately preparing us for work of the future, and our political and economic institutions are poorly equipped to handle these hard choices.

Read the full report here.

Thursday, July 24, 2014

An update on "A Comparison of Programming Languages in Economics"

A couple weeks ago, I beat up an NBER working paper, A Comparison of Programming Languages in Economics. My response was a bit harsh though not entirely unfounded. However, there were some points on which I was clearly wrong, as I'll explain below.

The post also lead to a fruitful back and forth between the authors, particularly Jesus Fernandez-Villaverde, and me. (I also gather that others responded to the earlier version as well.) The result is an updated and much-improved version of their working paper. It incorporates the usage of Rcpp and includes a note on vectorizing the problem. In the end, I think the new version does R justice in its advantages and limitations.

In my earlier post, I made three basic points:

  1. One must consider a programming language's preferred paradigm (functional programming, vector operations, etc.) in comparing languages. This affects the speed of different problems and the human burden of coding.
  2. R's paradigm is geared towards vector operations. If a vectorized approach exists, R will almost uniformly perform it much faster than in a loop. (While for loops aren't always bad in R. Nesting them together is.)
  3. R fails hard on problems that don't vectorize (or don't vectorize well). This makes functional knowledge of C/C++ a "must" for R coders, at least intermediate to advanced ones. Rcpp makes integrating C++ code in R incredibly easy.
In my email exchanges with the authors, I also raised an additional point. 
  1. Dynamically re-assigning values to objects in the middle of loops greatly impacts performance. 
An example of the above and its assumed-preferred alternative is below.

    for( j in 1:100){
        x <- j + 1

The assumed better alternative would be

    x <- rep(0, 100)

    for( j in 1:100){
        x[ j ] <- j + 1

The stuff I got wrong

Last thing's first: Dynamic reassignment of a scalar is considerably faster than initializing a vector beforehand. Dynamic reassignment of a vector is slower than initializing the vector beforehand. This is my mistake. 

Second, my knee-jerk reaction to seeing triple-nested for loops inside of a while loop was understandable for the general case but off the mark in this specific case. As the new copy of the working paper indicates, vectorization does not work well for this problem. This does, however, highlight one of R's limitations. If your problem does not vectorize and cannot be run in parallel, you're kind of screwed if you don't know another language.

The stuff that was worth listening to

For better or worse, C/C++ are a big part of optimizing R. When we say "R is optimized for vector operations," we really mean "R's vector operations, like matrix algebra, are actually written in C." As a result, knowing C/C++ (or maybe even FORTRAN) is a part of R. Fortunately Rcpp makes this fairly painless and dispatches with much of the confusing programming overhead involved with C++ coding. 

The problem that Drs Aruoba and Fernandez-Villaverde used for their speed test is a prime candidate for some Rcpp. This is also an example of what I mean by adopting a language's assumed paradigm. The "correct" R solution to this problem is to code it in C++, not R. Use R to reshape the data on the front end, call the C++ code that does the main analysis, and then use R to reshape the result, make your graphics etc.

One last thing

I am still dubious of the utility of speed tests for most research applications. I've certainly run into times when speed matters, but those tend to be the exception rather than the rule. Though the authors made two points in our emails: speed is easy to measure, while other criteria are not; and many modern macroeconomic problems can take weeks to estimate in C++ or similar, making speed a more mainstream concern.

R is also poorly-represented in its comparison of coding speed, and I'm not talking about using  Rcpp. In 90% or more of the cases I've seen where R is performing abysmally, it is because of user error. Specifically, the R code is written like C/C++ code, lots of loops and "if" statements, rather than matrix operations and the use of logical vectors. Writing R code like it's... well... meant for R, is generally the best fix. 

Forgive me for relying on anecdote, but I've written R code that works as well or better than some Python code. The issue was with how the code was written, not the property of the language. (I am sure the "best" Python code is faster than the "best" R code, but how often is it the case that we're all writing our "best" code?)

With those caveats: Drs Aruoba and Fernandez-Villaverde's new paper is not one I'd take much issue with, at least with its treatment of R. They give treatment to 3 clear and easy solutions to approaching their problem in R: raw R code, compiled R code, and Rcpp; the result is plain to see. Unlike other approaches I've seen, this paper shows you how fast R really is compared to other languages.