class: center, middle, inverse, title-slide # Control Flow ## ⚔
and task repetition ### Julia Piaskowski ### 2022-03-10 --- class: center, middle, inverse ## By popular demand, how to do repeat operations in R --- # Repeating operations across a vector * Reminder: a vector is an object with a length attribute composed of items each of the same class. It can have a name attribute (not required) * `ifelse(test, action-if-yes, action-if-no)` * The 'test' should be an R function that will return a TRUE or FALSE, e.g. `is.na()`, `is.numeric()` * Vector example ```r x <- 1:10 y <- ifelse(x < 5, NA, x) ``` * Vector example inside a data.frame ```r data(storms, package = "dplyr") storms$category_simple <- ifelse(storms$wind <= 50, "small", "big") ``` --- class: center, middle, inverse # [exercise](exercises/exercise-10.html) --- # Repeating operations across a data.frame * `apply()` a simple handy function to repeat things across a data.frame (or tibble, or matrix) * This operation is *vectorised*, meaning all processes proceed simultaneously * Across rows ```r storm_num <- select_if(storms, is.numeric) apply(storm_num, 1, median, na.rm = TRUE) ``` * Across columns ```r storm_num <- select_if(storms, is.numeric) apply(storm_num, 2, median, na.rm = TRUE) ``` --- class: center, middle, inverse # [exercise](exercises/exercise-11.html) --- # Special R functions for rows and columns * These functions are very, very fast * They are not forgiving of non-numeric data ```r rowSums() colSums() rowMeans() colMeans() ``` ```r library(dplyr) data("storms") storms %>% select_if(is.numeric) %>% colMeans(na.rm = TRUE) ``` ``` ## year month ## 2001.282317 8.784805 ## day hour ## 15.832617 9.116789 ## lat long ## 24.758040 -64.089316 ## wind pressure ## 53.637743 991.980521 ## tropicalstorm_force_diameter hurricane_force_diameter ## 145.253271 18.147664 ``` --- # Comments on `NA` values * I often get the question "how can I replace "NA" across my entire data object?" * `NA` is a reserved word in the R language referring to missing data. Often the best way to handle how missing values from your data are handled is to specify that when the file is read in. Check the documentation for your import function, e.g. `read.csv(..., na.string = "...")`. * To do a global replacement of `NA` with another value, use `tidyr::replace_na()` * To do a global replacement of another value with `NA`, this can be handled in the import, or use `dplyr::na_if()` --- # `lapply()` 😍😍😍😍😍 * Vectorised over *lists* * If you can express one aspect of your operation as a list, this can probably work for you! * Downside: everything comes back as a list * It uses very simple notation: ```r lapply(list, some_function) ``` * Here are [some examples](exercises/lapply-examples.html) --- class: center, middle, inverse # [exercise](exercises/exercise-12.html) --- # `lapply()` flotsam & jetsam * Dealing with lists can be challenging: they follow different rules; they typically require lots and lots of indexing to extract content. And you usually can't write a list straight to file like a data.frame. * `sapply()` is just like `lapply()`, except it tries to simplify to common R data objects - a matrix or array. This works if the return data is one row. * `dplyr::bind_rows()` can concatenate data.frames better than `rbind()`. * Getting things out of a list and into the desired format can be one of the most challenging aspects of working with `lapply()` (bonus: it makes you understand R data types really well!). * **purrr** to the rescue! --- # purrr **[purrr](https://purrr.tidyverse.org/)** does all of the hard work of iteration *plus* conversion of output to the object type you want! ```r library(purrr) mtcars %>% split(.$cyl) %>% map(~lm(mpg ~ wt, data = .x)) %>% map_dfr(~as.data.frame(t(as.matrix(coef(.))))) ``` ``` ## (Intercept) wt ## 1 39.57120 -5.647025 ## 2 28.40884 -2.780106 ## 3 23.86803 -2.192438 ``` .footnote[[purrr cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/purrr.pdf)] --- class: middle, center, inverse # [exercise](exercises/exercise-13.html) --- # Base/Tidyverse equivalents |base functions | tidyverse equivalent | |--------------|-----------------------| | `lapply()` `sapply()` `vapply()`| **purrr** package | | `mapply()` | `purrr::pmap()` | | `tapply()` | `dplyr::group_by() %>% dplyr::summarise()` | | `replicate()` | `purrr:rerun()` | | `ifelse()` | `dplyr::case_when()` | --- # The oft-abused `for` A `for` loop: ```r for (i in thingy) { do_something() } ``` I see things like this frequently: ```r x <- LETTERS[1:10] for (i in x) print(i) # same as sapply(x, print) ``` Or worse: ```r for (i in item1) { for (j in item2) { here_we_go(...) } } ``` --- # Optimal use of `for` * Better usage of `for` is when you *require* the previous value(s) to proceed through the loop * pre-allocation of your vector/data.frame/list/etc will result in faster code ```r # vector pre-allocation f <- c(0, 1, rep(NA, 98)) f[1:10] ``` ``` ## [1] 0 1 NA NA NA NA NA NA NA NA ``` ```r # Fibonacci sequence for (i in 3:100) { f[i] = f[i - 1] + f[i - 2] } f[1:10] ``` ``` ## [1] 0 1 1 2 3 5 8 13 21 34 ``` --- # Traditional control flow * These are standard control variables that exist across many languages ```r if else for while next break ``` * Note that these are reserved words in the R language; you cannot use these words for any other purpose in the R language than what they are designed to do (no function masking is possible). * There ua not time to delve into these, but here is a [short introduction](exercises/while-if-break.html).