# HeronStatsPlugins: r-learning.Rmd

File r-learning.Rmd, 7.2 KB (added by dconnolly, 5 years ago) |
---|

Line | |
---|---|

1 | up to: [HeronStatsPlugins](http://informatics.kumc.edu/work/wiki/HeronStatsPlugins) |

2 | |

3 | R from a Programmer's Point of View |

4 | =================================== |

5 | |

6 | by Dan Connolly |

7 | |

8 | These are some notes on R from [The 8th International R User Conference](http://biostat.mc.vanderbilt.edu/wiki/Main/UseR-2012) in Nashville, June 2012. |

9 | |

10 | In a lot of ways, R is a little like JavaScript: scheme with C-like syntax. It's a dynamic language, much like python or ruby, with a large standard library for math and statistics: |

11 | |

12 | ```{r} |

13 | 1+1 |

14 | sin(pi/2) |

15 | ``` |

16 | |

17 | Vectors Everywhere |

18 | ------------------ |

19 | |

20 | The first thing to get used to is: there are no scalars. The most primitive datatype is vector: |

21 | |

22 | ```{r} |

23 | sizes <- c(2,4,6,8) |

24 | sizes * 3 |

25 | sin(pi/sizes) |

26 | sizes + 1:4 |

27 | sizes + 0:1 |

28 | set.seed(1234) |

29 | round(runif(3, min=0, max=5)) |

30 | ``` |

31 | |

32 | c() is for concatenate. 1:3 is short for seq(from=1, to=3). Note that adding a short vector (0:1) to a long vector wraps the short one. |

33 | |

34 | ## Copy on write, no sharing |

35 | |

36 | Assignment in R is more like php than python: vectors get copied: |

37 | |

38 | ```{r} |

39 | x <- c(1,2,3) |

40 | y <- x |

41 | x[2] = 9 |

42 | x |

43 | y |

44 | ``` |

45 | |

46 | Indexing starts at 1, as opposed to 0 as in C etc. |

47 | |

48 | ## Assignment, or "gets" |

49 | |

50 | R has a slightly novel approach to the = vs == syntax: |

51 | |

52 | ```{r} |

53 | a <- 2 |

54 | a * a |

55 | ``` |

56 | |

57 | Using = in place of <- reportedly works *almost* everywhere, but it's frowned upon. |

58 | |

59 | But that's a minor issue. |

60 | |

61 | Argument evaluation, delayed |

62 | ---------------------------- |

63 | |

64 | The stuff that turns your head sideways is lazy/delayed evaluation of arguments: |

65 | |

66 | ```{r} |

67 | ignore.first <- function(a, b) { b * 2 } |

68 | ignore.first(1/0, 3) |

69 | ``` |

70 | |

71 | Argument evaluation is not only delayed but the so-called promises (the R manual calls them promises, but since they can't be broken, that's something of a misnomer) include the expression: |

72 | |

73 | ```{r} |

74 | expression.parts <- function(e) { |

75 | substitute(e) |

76 | } |

77 | expression.parts(x + y * z) |

78 | ``` |

79 | |

80 | That looks pretty wonky from the perspective of most general-purpose computing languages, but it makes a lot of sense when doing statistical modelling and plotting, as we'll see below. |

81 | |

82 | ## Working with data in dataframes |

83 | |

84 | Modelling typically starts with some data. We'll synthesize it here, using the workhorse dataframe (much like a database table): |

85 | |

86 | ```{r} |

87 | speeds <- runif(10, min=25, max=50) |

88 | erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1 |

89 | stopping <- data.frame(speed=speeds, distance=(speeds ** 2 + erf(speeds))) |

90 | ``` |

91 | |

92 | ## Formulas in Interactive Plotting |

93 | |

94 | Interactive visualization through plotting is one use of unevaluated arguments. We can plot stopping distance as a function of speed. |

95 | |

96 | ```{r} |

97 | plot(distance ~ speed, stopping) |

98 | ``` |

99 | |

100 | The parabola becomes more clear if we zoom out to include the origin: |

101 | |

102 | ```{r} |

103 | plot(distance ~ speed, stopping, |

104 | xlim=c(0, max(stopping$speed)), |

105 | ylim=c(0, max(stopping$distance)) |

106 | ) |

107 | ``` |

108 | |

109 | ## Formulas in linear models |

110 | Another use of unevaluated formulas is linear models; I don't yet understand them, but I gather they're a mainstay of analysis with R: |

111 | |

112 | ```{r} |

113 | m <- lm(distance ~ 0 + speed ** 2, stopping) |

114 | coef(m) |

115 | ``` |

116 | |

117 | The R Learning Cliff |

118 | -------------------- |

119 | |

120 | âIâm going to assume you know what a generalized linear model is,â said Bill Venables in the R short course. Nowhere in the R world is there a definition of basic concepts such as linear model or standard deviation. The help for sd says: |

121 | |

122 | > This function computes the standard deviation of the values in x. |

123 | |

124 | Gee, thanks. The reference to the var function looked promising, but nope. They just bottom out with |

125 | |

126 | > ## References |

127 | > |

128 | > Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. |

129 | |

130 | In fairness, I suppose the python docs for sort() don't spell out how to sort items in a list. |

131 | |

132 | So then itâs off to wikipediaâs statistics materials; but I havenât found a good connection between what I know (set theory, discrete mathematics, real analysis, toplogy) and the foundations of statistics. I understand some examples and special cases, but not many general definitions. |

133 | |

134 | ## R help is "obtuse" |

135 | |

136 | In the "crash course" tutorial, I learned I'm not the only one who doesn't find R's help very helpful: âfrequently the complaint about R help is that the help is obtuse. Far too subtle for mere mortals. To get help with xyz, search for R xyz exampleâ |

137 | |

138 | R Development Tools |

139 | ------------------- |

140 | |

141 | Bill Venables uses ESS, the R mode for Emacs, but says these days heâd recommend R Studio. |

142 | |

143 | I used Emacs + ESS when developing rgate. Since I'd rather not infect the next generation with the emacs virus, I installed R for Eclipse in preparation for the conference. But I don't think I've used R for Eclipse since. |

144 | |

145 | The talk by the Rstudio guys (JJ Alair, who developed ColdFusion etc.) was pretty cool. I'm using Rstudio and the MarkDown integration to draft this little article. [Knitter](http://yihui.name/knitr) is cool, but this "literate programming"" style seems somewhat inside-out, to me. I prefer to generate documentation from the normal source code. (.R, .py, .C) a la doxygen, doctest, sphynx. But I'm giving it a try. |

146 | |

147 | ## doctest for R? Almost... |

148 | |

149 | In the crash course (slide: "Unit tests in R") I learned about a convention for mixing runnable examples with package documentation (*.Rd), much like python's doctest. Yay! |

150 | |

151 | But... the conventions don't include checking that the output of the examples matches any expected results. Sigh. So close. |

152 | |

153 | I suppose one could add something to make the examples fail if they don't produce the expected results. That's possibly useful, but not nearly as useful as an established community norm for doing so. |

154 | |

155 | Wickhamâs devtools package looks interesting. (Wickham is clearly a leading light... his ggplot2 was used everywhere.) |

156 | |

157 | ## R performance, profiling |

158 | |

159 | Performance was a theme of the conference (as well as reproducible research, which is another article altogether). |

160 | |

161 | Norm Matloff gave a great invited talk ([slides](http://heather.cs.ucdavis.edu/UseR2012.pdf)) on **Parallel R, Revisited**, summarizing major hardware trends, e.g. |

162 | |

163 | - NVIDIA currently dominant in the GPU world. Intel likely to enter the market, with the obvious fragmentation risk. |

164 | - Knightâs Ferry ânext yearâ for several years |

165 | - CUDA currently dominant |

166 | - OpenCL stalled? |

167 | - OpenACC - a la openmp |

168 | |

169 | Then he showed his "software alchemy" technique, which achieves super-linear speedup in some applications that seemed common/important. |

170 | |

171 | Another talk by **Justin Talbot** went into more detail. He reminded us that clock speed topped out in ~2003 at ~3Ghz, but 2 to 4 cores is consumer technology today and he expectes 8 to 16 cores on laptops in ~5 years. Memory isnât getting much faster either. Since we have more cores, weâre memory-starved. Have to do more with registers and caches. |

172 | |

173 | He noted 3 distinct performance areas in R: scalar (interpreter), matrix, and vector. The conventional wisdom is that vector operations in R are as fast as C code, but he found that they were only as fast as _poorly written_ C code (too many copies); 7% as fast as hand-tuned C code. So he was able to get 60x speed-up through a combination of better C code and multi-core use. |

174 | |

175 | Tim Hesterberg from Google gave an invited talk including speeding up dataframes... reducing the number of copies from 8 to 3 in some common operations. He mentioned a few ways to find out what to optimize: tracemem, Rprofmem, --enable-memory-profiling |