Monday, May 26, 2014

The magical parts of R that you'll miss in pandas

- Gautham

I am a fan of R for data analysis, but for various reasons I am learning to use its best-known competitor: the pandas package written for Python. pandas has achieved considerable clout in the loosely-defined data science community, and is reportedly replacing R in everyday use.

The switch has been quite unpleasant for me so far. As far as I can tell the main advantages of pandas are:

  1. Speed and efficiency. It is apparently very fast. Even faster than data.table (the R package you should read up on and use if your data.frame calculation is taking too long). Speed is one of the most important concerns for pandas's creator, Wes McKinney. 
  2. Integration into larger software. Python is a great language to build software in. pandas is close to the best one could imagine for doing R-like work while within Python. Python is fun and clean for writing modular code. Being able to write the data analysis pipeline and the web server that dispenses the results both in the same language is a great boon.
So it is clear why data scientists are using it. R can't compete with Python's very nice module structure, or its ecosystem of packages for general-purpose programming. 

Usually, if you go on the internet and ask about R vs pandas, the main advantage listed for R is the immense number of specialized packages for statistical computing and plotting. That is not at all what I am missing since I've switched. 

Its only certain packages and certain behaviors. Its the magical parts of R that you will miss if you switch. The part where you write a short script with a few ddply, transform, subset and ggplot commands and you're already looking at a beautiful, informative plot of your data.

You might wonder why no other data analysis languages seem to feel like that.

Its because they can't. 

All the uniquely sweet aspects of programming data analysis in R have to do with Lazy Evaluation, its approach to evaluating expressions in function arguments. Nearly nothing in Python nor Matlab is allowed to be lazy about evaluation, so no matter how hard you might work, you cannot truly reproduce those features. Instead, at best, you end up messing around with quotation marks everywhere.

"When you call a function in MATLAB, MATLAB first evaluates all the inputs, and then passes these (possibly) computed values as the inputs." - Loren Shure (same thing in Python)

"R has powerful tools for computing not only on values, but also on the actions that lead to those values. These tools are powerful and magical. If you’re coming from another programming language, they are one of its most surprising features" - Hadley Wickham


In fact, I used to hesitate to use bare unquoted expressions in R for the first few months I used it because it was unfamiliar and scary. But after a while it became an irreplaceable part of my workflow. If you don't need your data analysis to fit into a larger piece of software, or to be blazing fast, stick to R's expressiveness. They don't have that magic anywhere else.

(for an excellent description of R's magic by R's greatest magician, see: http://adv-r.had.co.nz/Computing-on-the-language.html#nse)


1 comment:

  1. "MATLAB first evaluates all the inputs, and then passes these (possibly) computed values as the inputs." - Loren Shure (same thing in Python)" That is not true. Python has built-in Lazy evaluation using the functional programming paradigm as well as other libraries. Take a look at itertools, and functools.

    ReplyDelete