up to: HeronStatsPlugins

R from a Programmer's Point of View

by Dan Connolly

These are some notes on R from The 8th International R User Conference in Nashville, June 2012.

In a lot of ways, R is a little like JavaScript: scheme with C-like syntax. It's a dynamic language, much like python or ruby, with a large standard library for math and statistics:

1 + 1
## [1] 2
## [1] 1

Vectors Everywhere

The first thing to get used to is: there are no scalars. The most primitive datatype is vector:

sizes <- c(2, 4, 6, 8)
sizes * 3
## [1]  6 12 18 24
## [1] 1.0000 0.7071 0.5000 0.3827
sizes + 1:4
## [1]  3  6  9 12
sizes + 0:1
## [1] 2 5 6 9
round(runif(3, min = 0, max = 5))
## [1] 1 3 3

c() is for concatenate. 1:3 is short for seq(from=1, to=3). Note that adding a short vector (0:1) to a long vector wraps the short one.

Copy on write, no sharing

Assignment in R is more like php than python: vectors get copied:

x <- c(1, 2, 3)
y <- x
x[2] = 9
## [1] 1 9 3
## [1] 1 2 3

Indexing starts at 1, as opposed to 0 as in C etc.

Assignment, or “gets”

R has a slightly novel approach to the = vs == syntax:

a <- 2
a * a
## [1] 4

Using = in place of <- reportedly works almost everywhere, but it's frowned upon.

But that's a minor issue.

Argument evaluation, delayed

The stuff that turns your head sideways is lazy/delayed evaluation of arguments:

ignore.first <- function(a, b) {
    b * 2
ignore.first(1/0, 3)
## [1] 6

Argument evaluation is not only delayed but the so-called promises (the R manual calls them promises, but since they can't be broken, that's something of a misnomer) include the expression:

expression.parts <- function(e) {
expression.parts(x + y * z)
## x + y * z

That looks pretty wonky from the perspective of most general-purpose computing languages, but it makes a lot of sense when doing statistical modelling and plotting, as we'll see below.

Working with data in dataframes

Modelling typically starts with some data. We'll synthesize it here, using the workhorse dataframe (much like a database table):

speeds <- runif(10, min = 25, max = 50)
erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
stopping <- data.frame(speed = speeds, distance = (speeds^2 + erf(speeds)))

Formulas in Interactive Plotting

Interactive visualization through plotting is one use of unevaluated arguments. We can plot stopping distance as a function of speed.

plot(distance ~ speed, stopping)

plot of chunk unnamed-chunk-8

The parabola becomes more clear if we zoom out to include the origin:

plot(distance ~ speed, stopping, xlim = c(0, max(stopping$speed)), 
    ylim = c(0, max(stopping$distance)))

plot of chunk unnamed-chunk-9

Formulas in linear models

Another use of unevaluated formulas is linear models; I don't yet understand them, but I gather they're a mainstay of analysis with R:

m <- lm(distance ~ 0 + speed^2, stopping)
## speed 
## 39.51 

The R Learning Cliff

“I’m going to assume you know what a generalized linear model is,” said Bill Venables in the R short course. Nowhere in the R world is there a definition of basic concepts such as linear model or standard deviation. The help for sd says:

This function computes the standard deviation of the values in x.

Gee, thanks. The reference to the var function looked promising, but nope. They just bottom out with


Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

In fairness, I suppose the python docs for sort() don't spell out how to sort items in a list.

So then it’s off to wikipedia’s statistics materials; but I haven’t found a good connection between what I know (set theory, discrete mathematics, real analysis, toplogy) and the foundations of statistics. I understand some examples and special cases, but not many general definitions.

R help is “obtuse”

In the “crash course” tutorial, I learned I'm not the only one who doesn't find R's help very helpful: “frequently the complaint about R help is that the help is obtuse. Far too subtle for mere mortals. To get help with xyz, search for R xyz example”

R Development Tools

Bill Venables uses ESS, the R mode for Emacs, but says these days he’d recommend R Studio.

I used Emacs + ESS when developing rgate. Since I'd rather not infect the next generation with the emacs virus, I installed R for Eclipse in preparation for the conference. But I don't think I've used R for Eclipse since.

The talk by the Rstudio guys (JJ Alair, who developed ColdFusion etc.) was pretty cool. I'm using Rstudio and the MarkDown integration to draft this little article. Knitter is cool, but this “literate programming”“ style seems somewhat inside-out, to me. I prefer to generate documentation from the normal source code. (.R, .py, .C) a la doxygen, doctest, sphynx. But I'm giving it a try.

doctest for R? Almost…

In the crash course (slide: "Unit tests in R”) I learned about a convention for mixing runnable examples with package documentation (*.Rd), much like python's doctest. Yay!

But… the conventions don't include checking that the output of the examples matches any expected results. Sigh. So close.

I suppose one could add something to make the examples fail if they don't produce the expected results. That's possibly useful, but not nearly as useful as an established community norm for doing so.

Wickham’s devtools package looks interesting. (Wickham is clearly a leading light… his ggplot2 was used everywhere.)

R performance, profiling

Performance was a theme of the conference (as well as reproducible research, which is another article altogether).

Norm Matloff gave a great invited talk (slides) on Parallel R, Revisited, summarizing major hardware trends, e.g.

Then he showed his “software alchemy” technique, which achieves super-linear speedup in some applications that seemed common/important.

Another talk by Justin Talbot went into more detail. He reminded us that clock speed topped out in ~2003 at ~3Ghz, but 2 to 4 cores is consumer technology today and he expectes 8 to 16 cores on laptops in ~5 years. Memory isn’t getting much faster either. Since we have more cores, we’re memory-starved. Have to do more with registers and caches.

He noted 3 distinct performance areas in R: scalar (interpreter), matrix, and vector. The conventional wisdom is that vector operations in R are as fast as C code, but he found that they were only as fast as poorly written C code (too many copies); 7% as fast as hand-tuned C code. So he was able to get 60x speed-up through a combination of better C code and multi-core use.

Tim Hesterberg from Google gave an invited talk including speeding up dataframes… reducing the number of copies from 8 to 3 in some common operations. He mentioned a few ways to find out what to optimize: tracemem, Rprofmem, –enable-memory-profiling