class: center, middle, inverse, title-slide # Reproducible Research ## ⚔
and Data Workflow ### Julia Piaskowski ### 2022-03-03 --- # What is data workflow? * Process of of data management and analysis from gathering raw data to eventual storage in a long-term repository * Coordinated procedures for: * Planning, organizing, and documenting research * Data preparation * Data analysis * Data backup and archive --- # Why do we need it? * It is essential for reproducibility. * It helps us actually get the right answers. * Reliable workflow results in efficient use of time. * Errors are inevitable. A good workflow can help you find them and correct them. * You probably have a workflow, but now, let's explicitly state what it is. --- # What is reproducible research? > ...reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility” - Reproducibility and Replicability in Science (2018) National Academies of Sciences .right[.ref[*[Reference](https://www.nationalacademies.org/our-work/reproducibility-and-replicability-in-science)*]] --- # Reason for good workflow practices Avoid things like this: *(or at least make it an easy fix)* <br> <img src="images/email_error_format.png" width="110%" /> *(actual email I received)* --- # Reason for good workflow practices Clarify how data were processed: <img src="images/colorful_spreadsheet_format.png" width="100%" /> --- # The "Replication Crisis" The inability to replicate a study's findings. <img src="images/Ioannidis2010_format.png" width="90%" /> .right[.ref[(2005, PLOS Medicine)]] **Causes:** bad statistics, incorrect interpretation, lab assays poorly described, lab assay results can't be replicated, *poor documentation of data processing* --- # A typical data analysis <img src="images/horst-eco-r4ds.png" width="6400" /> .right[.ref[*Artwork by [Allison Horst](https://www.allisonhorst.com/)*]] --- # A typical data analysis Time allocation: <img src="images/pie_chart.png" width="90%" /> *Data preparation is a <u>major</u> part of the analytic process.* --- # What happens during 'data preparation'? * Data entry * Data import * Calculations (growing degree days, Julian calendar days) * Data reformatting (e.g., wide to long) * Merging of data sets * Data transformations: summary of technical reps, consistency checks, calibration curves --- # Famous errors in analysis * Analysis of livestock predation by wolves by Washington State University professor Robert Weilgus, checked by the Washington Policy Center for Environmental policy and WSU Statistician Marc Evans. The results disagreed, eventually leading to [Prof Weilgus' resignation.](https://dailyevergreen.com/32158/news/wolf-researcher-resigns-in-settlement-with-wsu/) * In the Fall of 2020, the [UK accidentally omitted 16,000 COVID cases](https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england) from their official "database" (Microsoft Excel), drastically altering their official COVID counts. --- # Famous errors in analysis > We replicate Reinhart and Rogoff and find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP (gross domestic product) growth among 20 advanced economies in the post-war period," --Thomas Herndon, Michael Ash, and Robert Pollin *Original study was cited by Rep. Paul Ryan in his 2013 budget and used to justify austerity measures* .right[.ref[*[Reference](http://peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf)*]] --- # Sign of workflow issues * Incorrect results with creative explanations * Delayed publication of papers to determine why results changed * Results that cannot be reproduced with the same data set – due to unrecorded changes made to the data * Results that cannot be reproduced with the same data set – due to unwieldly 1000-line script for data cleaning * Results for the wrong variable * Miscoded data that delay progress * Loss of early raw files that prevents any future error checking --- class: middle, center, inverse # What can we do about this? --- background-image: url(images/spiderman_2.jpg) background-position: top center class: center, bottom, inverse Can a person using README files, a data dictionary, code and/or other descriptors repeat the analytical processes AND obtain the same results for a given study?? --- # The reproducible research plan 1. Plan. 1. Archive data and methods. 1. Document processes. 1. Publish data and methods. ---------------- #### 3 Document the process * **The source is real:** Save computing scripts for all processing, analytical and reporting steps. These scripts are ~~as important as the~~ data. * Made a data dictionary for data sets that explains a data set's variables, their data types and expected values. * Prepare a "README' file to accompany a data set --- # Dealing with spreadsheets * Spreadsheets are a fact of life * Good for data storage, bad for data curation * Some [best practices](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989): <img src="images/spreadsheets_format.png" width="90%" /> --- ## What a reproducibility script should capture * The data file(s) imported * if your data is downloaded from elsewhere, a comprehensive description of how it was accessed * Add-on libraries used in data processing and analysis * If you use a homespun script from someone else for any part of the data processing or analysis, this script should be included * Data manipulations used to generate the final data products: * all variable transformations * data aggregation * filtering steps * file merges * Anything that contributes to the generation of the final data set that will be uploaded to a long-term data repository --- ## What a reproducibility script does not need to capture * Data explorations that aren't directly used to prepare data * histograms, range checks, cross tabulations, etc * Package installations (note that this is different from packages loaded and used) * Failed code (although you may want to save this for other reasons) * Analytical paths that were ultimately abandoned --- # Code itself is not always enough In a study that attempted to rerun R code from 2,899 studies, only 470 ran as expected *(data from [Chris Chen](https://dash.harvard.edu/bitstream/handle/1/38811561/CHEN-SENIORTHESIS-2018.pdf?sequence=3))* <img src="images/sankey.png" width="75%" /> --- # Tips for computational reproducibility * Restart your programming session often * Disable automatic saving of *.RData* * Use R Projects (or equivalent in other IDE's) * Capture R session information with `SessionInfo()` OR consider the **renv** package for capturing the computing environment * See the [Reproducible Research](https://cran.r-project.org/web/views/ReproducibleResearch.html) CRAN task view for more resources --- class: middle, center, inverse ## Future you will (probably) thank your past self