HeronStatsPlugins: r-learning.Rmd

File r-learning.Rmd, 7.2 KB (added by dconnolly, 5 years ago)

notes on R and the R conference from a programmer's perspective (source Rmd)

Line 
1up to: [HeronStatsPlugins](http://informatics.kumc.edu/work/wiki/HeronStatsPlugins)
2
3R from a Programmer's Point of View
4===================================
5
6by Dan Connolly
7
8These are some notes on R from [The 8th International R User Conference](http://biostat.mc.vanderbilt.edu/wiki/Main/UseR-2012) in Nashville, June 2012.
9
10In a lot of ways, R is a little like JavaScript: scheme with C-like syntax. It's a dynamic language, much like python or ruby, with a large standard library for math and statistics:
11
12```{r}
131+1
14sin(pi/2)
15```
16
17Vectors Everywhere
18------------------
19
20The first thing to get used to is: there are no scalars. The most primitive datatype is vector:
21
22```{r}
23sizes <- c(2,4,6,8)
24sizes * 3
25sin(pi/sizes)
26sizes + 1:4
27sizes + 0:1
28set.seed(1234)
29round(runif(3, min=0, max=5))
30```
31
32c() is for concatenate. 1:3 is short for seq(from=1, to=3). Note that adding a short vector (0:1) to a long vector wraps the short one.
33
34## Copy on write, no sharing
35
36Assignment in R is more like php than python: vectors get copied:
37
38```{r}
39x <- c(1,2,3)
40y <- x
41x[2] = 9
42x
43y
44```
45
46Indexing starts at 1, as opposed to 0 as in C etc.
47
48## Assignment, or "gets"
49
50R has a slightly novel approach to the = vs == syntax:
51
52```{r}
53a <- 2
54a * a
55```
56
57Using = in place of <- reportedly works *almost* everywhere, but it's frowned upon.
58
59But that's a minor issue.
60
61Argument evaluation, delayed
62----------------------------
63
64The stuff that turns your head sideways is lazy/delayed evaluation of arguments:
65
66```{r}
67ignore.first <- function(a, b) { b * 2 }
68ignore.first(1/0, 3)
69```
70
71Argument evaluation is not only delayed  but the so-called promises (the R manual calls them promises, but since they can't be broken, that's something of a misnomer) include the expression:
72
73```{r}
74expression.parts <- function(e) {
75  substitute(e)
76}
77expression.parts(x + y * z)
78```
79
80That looks pretty wonky from the perspective of most general-purpose computing languages, but it makes a lot of sense when doing statistical modelling and plotting, as we'll see below.
81
82## Working with data in dataframes
83
84Modelling typically starts with some data. We'll synthesize it here, using the workhorse dataframe (much like a database table):
85
86```{r}
87speeds <- runif(10, min=25, max=50)
88erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
89stopping <- data.frame(speed=speeds, distance=(speeds ** 2 + erf(speeds)))
90```
91
92## Formulas in Interactive Plotting
93
94Interactive visualization through plotting is one use of unevaluated arguments. We can plot stopping distance as a function of speed.
95
96```{r}
97plot(distance ~ speed, stopping)
98```
99
100The parabola becomes more clear if we zoom out to include the origin:
101
102```{r}
103plot(distance ~ speed, stopping,
104      xlim=c(0, max(stopping$speed)),
105      ylim=c(0, max(stopping$distance))
106     )
107```
108
109## Formulas in linear models
110Another use of unevaluated formulas is linear models; I don't yet understand them, but I gather they're a mainstay of analysis with R:
111
112```{r}
113m <- lm(distance ~ 0 + speed ** 2, stopping)
114coef(m)
115```
116
117The R Learning Cliff
118--------------------
119
120“I’m going to assume you know what a generalized linear model is,” said Bill Venables in the R short course. Nowhere in the R world is there a definition of basic concepts such as linear model or standard deviation. The help for sd says:
121
122> This function computes the standard deviation of the values in x.
123
124Gee, thanks. The reference to the var function looked promising, but nope. They just bottom out with
125
126> ## References
127>
128> Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
129
130In fairness, I suppose the python docs for sort() don't spell out how to sort items in a list.
131
132So then it’s off to wikipedia’s statistics materials; but I haven’t found a good connection between what I know (set theory, discrete mathematics, real analysis, toplogy) and the foundations of statistics. I understand some examples and special cases, but not many general definitions.
133
134## R help is "obtuse"
135
136In the "crash course" tutorial, I learned I'm not the only one who doesn't find R's help very helpful: “frequently the complaint about R help is that the help is obtuse. Far too subtle for mere mortals. To get help with xyz, search for R xyz example”
137
138R Development Tools
139-------------------
140
141Bill Venables uses ESS, the R mode for Emacs, but says these days he’d recommend R Studio.
142
143I used Emacs + ESS when developing rgate. Since I'd rather not infect the next generation with the emacs virus, I installed R for Eclipse in preparation for the conference. But I don't think I've used R for Eclipse since.
144
145The talk by the Rstudio guys (JJ Alair, who developed ColdFusion etc.) was pretty cool. I'm using Rstudio and the MarkDown integration to draft this little article. [Knitter](http://yihui.name/knitr) is cool, but this "literate programming"" style seems somewhat inside-out, to me. I prefer to generate documentation from the normal source code. (.R, .py, .C) a la doxygen, doctest, sphynx. But I'm giving it a try.
146
147## doctest for R? Almost...
148
149In the crash course (slide: "Unit tests in R") I learned about a convention for mixing runnable examples with package documentation (*.Rd), much like python's doctest. Yay!
150
151But... the conventions don't include checking that the output of the examples matches any expected results. Sigh. So close.
152
153I suppose one could add something to make the examples fail if they don't produce the expected results. That's possibly useful, but not nearly as useful as an established community norm for doing so.
154
155Wickham’s devtools package looks interesting. (Wickham is clearly a leading light... his ggplot2 was used everywhere.)
156
157## R performance, profiling
158
159Performance was a theme of the conference (as well as reproducible research, which is another article altogether).
160
161Norm Matloff gave a great invited talk ([slides](http://heather.cs.ucdavis.edu/UseR2012.pdf)) on **Parallel R, Revisited**, summarizing major hardware trends, e.g.
162
163 - NVIDIA currently dominant in the GPU world. Intel likely to enter the market, with the obvious fragmentation risk.
164 - Knight’s Ferry “next year” for several years
165 - CUDA currently dominant
166 - OpenCL stalled?
167 - OpenACC - a la openmp
168
169Then he showed his "software alchemy" technique, which achieves super-linear speedup in some applications that seemed common/important.
170
171Another talk by **Justin Talbot** went into more detail. He reminded us that clock speed topped out in ~2003 at ~3Ghz, but 2 to 4 cores is consumer technology today and he expectes 8 to 16 cores on laptops in ~5 years. Memory isn’t getting much faster either. Since we have more cores, we’re memory-starved. Have to do more with registers and caches.
172
173He noted 3 distinct performance areas in R: scalar (interpreter), matrix, and vector. The conventional wisdom is that vector operations in R are as fast as C code, but he found that they were only as fast as _poorly written_ C code (too many copies); 7% as fast as hand-tuned C code. So he was able to get 60x speed-up through a combination of better C code and multi-core use.
174
175Tim Hesterberg from Google gave an invited talk including speeding up dataframes... reducing the number of copies from 8 to 3 in some common operations. He mentioned a few ways to find out what to optimize: tracemem, Rprofmem, --enable-memory-profiling