Your closest collaborator is you six months ago, but you don’t reply to emails.
Figure 13.1: Poor file management creates chaos! By xkcd (https://xkcd.com/1459). Shared under CC BY-NC 2.5
Have you ever returned to an old project folder to find a chaotic mess of files with names like analysis-FINAL
, analysis-FINAL-COPY
, and analysis-FINAL-COPY-v2
? Which file is actually the final version!? Or perhaps you’ve spent hours searching for a data file to send to your advisor, only to realize with horror that it was only stored on your old laptop – the one that experienced a catastrophic hard drive failure when you spilled coffee all over it one sleepy Sunday morning. These experiences may make you sympathetic to Karl Broman’s quip: good project management practices not only make it easier to share your research with others, they also make for a more efficient and less error prone workflow that will avoid giving your future-self a headache. This chapter is about the process of managing all of the products of your research workflow — methodological protocols, materials 191 We use the term “materials” here to cover a range of things another researcher might need in order to repeat your study, for example, stimuli, survey instruments, and code for computer-based experiments., data, and analysis scripts – in ways that maximize their value to you and to the broader research community.
When we talk about research products, we typically think of articles in academic journals, which have been scientists’ main method of communication since the scientific revolution in the 1600s.192 The world’s oldest scientific journal is the Philosophical Transactions of the Royal Society, first published in 1665. But articles only provide written summaries of research; they are not the original research products. In recent years, there have been widespread calls for increased sharing of research products, such as materials, data, and analysis code (Munafò et al., 2017). When shared appropriately, these other products can be at least as valuable as a summary article. Shared stimulus materials can be reused for new studies in creative ways; shared analysis scripts can allow for reproduction of reported results and become templates for new analyses; and shared data can enable new analyses or meta-analyses. Indeed, many funding agencies, and some journals, now require that research products be shared publicly, except when there are justified ethical or legal constraints, such as with sensitive medical data (B. A. Nosek et al., 2015).
There have been particularly intensive efforts to improve data sharing, which has been associated with benefits in terms of error detection (Hardwicke et al., 2021b), creative re-use that generates new discoveries (Voytek, 2016), increased citations (Piwowar & Vision, 2013), and detection of fraud (Simonsohn, 2013). According to surveys, researchers are usually willing to share data in principle (Houtkoop et al., 2018), but unfortunately, in practice, they often do not, even if you directly ask them (Hardwicke & Ioannidis, 2018)! Sometimes it is reported that data have been lost because they were stored on a misplaced or damaged computer or external drive, or team members with access to the data are no longer contactable (Tenopir et al., 2020).
As we have discussed in Chapter 3, even when data are shared, they are not always formatted in a way that they can be easily understood and re-used by other researchers, or even the original authors! This issue highlights the critical role of meta-data: information that documents the data (and other products) that you share, including README files, codebooks that document datasets themselves, licenses that provide legal restrictions on reuse, etc. We will discuss best-practices for meta-data throughout the chapter.
Sound project management practices and sharing of research projects are mutually reinforcing goals that bring benefits for both yourself, the broader research community, and scientific progress. One particularly important benefit of good project management practices is that they enable reproducibility. As we discussed in Chapter 3, computational reproducibility involves being able to trace the provenance of any reported analytic result in a research report back to its original source. That means being able to recreate the entire analytic chain from data collection to data files, though analytic specifications to the research results reported in text, tables, and figures. If data collection is documented appropriately, and if data are stored, organized, and shared, then the provenance of a particular result is relatively easy to verify. But once this chain is broken it can be hard to reconstruct [Hardwicke et al. (2018); Figure 13.2]. That’s why it’s critical to build good project management practices into your research workflow right from the start.
Figure 13.2: Illustration of the analytic chain from raw data through to research report.
In this chapter, you will learn how to manage your research project both efficiently and transparently. These goals create a virtuous cycle: if you organize your research products well, they are easier to share later, and if you assume that you will be sharing, you will be motivated to organize your work better! We begin by discussing some important principles of project management, including folder structure, file naming, organization, and version control. Then we zoom in specifically on data and discuss best practices for data sharing. We end by discussing the question of what research products to share and some of the potential ethical issues that might limit your ability to share in certain circumstances.193 This chapter – especially the last section – draws heavily on O. Klein et al. (2018), an article on research transparency that several of us contributed to.
A lot of project management problems can be avoided by following a very simple file organisation system. For those researchers that “grew up” managing their files locally on their own computers and emailing colleagues versions of data files and manuscripts with names like manuscript-FINAL-JS-rev1.xlsx
, a few aspects of this system may seem disconcerting. However, with a little practice, this new way of working will start to feel intuitive and have substantial benefits. Here are the principles:
fifo_manuscript.Rmd
or fifo_manuscript.docx
is the write-up of the “fifo” project as a journal manuscript./analysis/experiment1/eye_tracking_preprocessing.Rmd
is clearly the file that performs pre-processing for the analysis of eye-tracking data from Experiment 1.Keeping these principles in mind, we discuss best practices for project organization, version control, and file naming.
To the greatest extent possible, all files related to a project should be stored in the same project folder (with appropriate sub-folders), and on the same storage provider.195 There are cases where this is impractical due to the limitations of different software packages. For example, in many cases a team will manage its data and analysis code via github but decide to write collaboratively using google docs, overleaf, or another collaborative platform. (It can also be hard to ask all collaborators to use a version control system they are unfamiliar with.) In that case, the final paper should still be linked in some way to the project repository. The biggest issue that comes up in using a split workflow like this is the need to ensure reproducible written products, a process we cover in Chapter 14.
Figure 13.3 shows an example project stored on the Open Science Framework. The top level folder contains sub-folders for analyses, materials, raw and processed data (kept separately). It also contains the paper manuscript, and, critically, a README file in a text format that describes the project (as well as any other meta-data that the authors would like to be associated with the research products, for example a license, explained below).
Figure 13.3: Sample top level folder structure for a project. From Klein et al., 2018. Original visible on the Open Science Framework.
There’s no single established way to organize the sub-folders of a research project, but the broad categories of materials, data, analysis, and writing are typically present. In some projects – such as those involving multiple experiments or complex data types – you may have to adopt a more complex structure. In our projects, it’s not uncommon to find paths like /data/raw_data/exp1/demographics
. The key principle here is to create a hierarchical structure in which subfolders uniquely identify the part of the broader space of research products that are found inside them – that is, /data/raw_data/exp1
contains all the raw data from Experiment 1, and /data/raw_data/exp1/demographics
contains all the raw demographics data from that particular experiment.196 If you’re interested, a more extensive guide to folder organization is found in the online supplement to O. Klein et al. (2018).
Probably everyone who has ever collaborated electronically has experienced the frustration of editing a document, only to find out that you are editing the wrong version – perhaps some of the problems you are working on have already been corrected, or perhaps the section you are adding has already been written by someone else. A second source of frustration comes when you take a wrong turn in a project, perhaps by reorganizing a manuscript in a way that doesn’t work or refactoring code in a way that turns out to be short-sighted.
These two classes of problems are solved effectively by modern version control systems. Here we focus on the use of git, which is perhaps the most widely used version control system. Git is a great general solution for version control, but many people – including several of us – don’t love it for collaborative manuscript writing. We’ll introduce git and its principles here, while noting that online collaboration tools like Google Docs and Overleaf197 Overleaf is actually supported by git on the backend! can be easier for writing; we cover this topic in a bit more depth in Chapter 14.
Figure 13.4: Visualisation of Git version control showing a series of commits (circles) on three different branches: the main branch (green) and two others (blue and red). Branches can be created and then merged back into the main branch.
Git is a tool for creating and managing projects, which are called repositories. A Git repository is a directory whose revision history is tracked via a series of commits – snapshots of the state of the project. These commits can form a tree with different branches, as when two contributors to the project are working on two different parts simultaneously (Figure 13.4). These branches can later be merged either automatically or via manual intervention in the case of conflicting changes.
Commonly, Git repositories are hosted by an online service like Github to facilitate collaboration. With this workflow. a user makes changes to a local version of the repository on their own computer and pushes those changes to the online repository. Another user can then pull those changes from the online repository to their own local version. The online “origin” copy is always the definitive copy of the project and a record is kept of all changes. Appendix A provides a practical introduction to Git and Github, and there are a variety of good tutorials available online and in print (Blischak et al., 2016).
Collaboration using version control tools is designed to solve many of the problems we’ve been discussing:
Organizing a project repository for collaboration and hosting on a remote platform is an important first step towards sharing! Many of our projects (like this book) are actually “born open” in the sense that we do all of our work on a publicly hosted repository for everyone to see (Rouder, 2015). This philosophy of ‘working in the open’ encourages good organization practices from the beginning. It can feel uncomfortable at first, but this discomfort soon vanishes as you realize that no one is actively looking at your in-progress project.200 One concern that many people raise about sharing in-progress research openly is the possibility of “scooping” – that is, other researchers getting an idea or even data from the repository and writing a paper before you do. We have two responses to this concern. First, the empirical frequency of this sort of scooping is difficult to determine, but likely very low – we don’t know of any documented cases. Mostly, the problem is getting people to care about your experiment at all, not people caring so much that they would publish using your data or materials! In Gary King’s words, “The thing that matters the least is being scooped. The thing that matters the most is being ignored.” On the other hand, if you are in an area of research that you perceive to be competitive, or where there is some significant risk of this kind of shenanigans, it’s very easy to keep part, or all, of a repository, private among your collaborators until you are ready to share more widely. All of the benefits we described still accrue. For an appropriately organized and hosted project, often the only steps required to share materials, data, and code is to make the hosted repository public and link it to an archival storage platform like the Open Science Framework.
As Phil Karlton reportedly said, “There are only two hard things in Computer Science: cache invalidation and naming things.” What’s true for computer science is true for research in general.201 We won’t talk about cache invalidation; that’s a more technical problem in computer science that is beyond the scope of this book. Naming files is hard! Some very organized people survive on systems like info-r1-draft-2020-07-13-js.docx
- meaning, “the info project revision 1 draft of July 13th, 2020, with edits by JS.” But this kind of system needs a lot of rules and discipline, and it requires everyone in a project to buy in completely.
On the other hand, if you are naming a file in a hierarchically organized version control repository, the naming problem gets dramatically easier. All of a sudden, you have a context in which names make sense. data.csv
is a terrible name for a data file on its own. But the name is actually perfectly informative – in the context of a project repository with a README that states that there is only a single experiment, a repository structure such that the file lives in a folder called raw_data
, and a commit history that indicates the file’s commit date and author.
As this example shows, naming is hard out of context. So here’s our rule: name a file with what it contains. Don’t use the name to convey the context of who edited it, when, or where it should go in a project.
We’ve just discussed how to manage projects in general; in this section we zoom in on datasets specifically. Data are often the most valuable research product because they represent the evidence generated by our research. We maximize the value of the evidence when other scientists can reuse it for independent verification or generation of novel discoveries. Yet lots of research data are not reusable, even when they are shared. In Chapter 3, we discussed Hardwicke et al. (2018)’s study of analytic reproducibility. But before we were able to even try and reproduce the analytic results we found that only 64% of shared datasets were both complete and understandable.
How can you make sure that your data are managed so as to enable effective sharing? We make four primary recommendations. First, save your raw data! Second, document your data collection process. Third, organize your raw data for later analysis – we provide guidance on organization for both spreadsheets and for data retrieved from software platforms, like Qualtrics. Fourth and finally, document your data using a codebook or other appropriate metadata.
Raw data take many forms. For many of us, the raw data are those returned by the experimental software; for others, the raw data are videos of the experiment being carried out. Regardless of the form of these data, save them! They are often the only way to check issues in whatever processing pipeline brings these data from their initial state to the form you analyze. They also can be invaluable for addressing critiques or questions about your methods or results later in the process. If you need to correct something about your raw data, do not alter the original files. Make a copy, and make a note about how the copy differs from the original.202 Future you will thank present you for explaining why there are two copies of subject 19’s data.
Raw data are often not anonymized or anonymizable. Anonymizing them sometimes means altering them (e.g., in the case of downloaded logs from a service that might include IDs or IP addresses). Or in some cases, anonymization is difficult or impossible without significant effort and loss of some value from the data, e.g. for video data or MRI data (Bischoff-Grethe et al., 2007). Unless you have specific permission for broad distribution of these identifiable data, the raw data may then need to be stored in a different way. In these cases, we recommend saving your raw data in a separate repository with the appropriate permissions. For example, in the ManyBabies 1 study we described above, the public repository does not contain the raw data contributed by participating labs, which the team could not guarantee was anonymized; these data are instead stored in a private repository.203 The precise repository you use for this task is likely to vary by the kind of data that you’re trying to store and the local regulatory environment. For example, in the United States, to store de-anonymized data with certain fields requires a server that is certified for HIPAA (the relevant medical privacy law). Many – but by no means all – universities provide HIPAA-compliant cloud storage.
You can use your repository’s README to describe what is and is not shared. For example, a README might state that “We provide anonymized versions of the files originally downloaded from Qualtrics” or “Participants did not provide permission for public distribution of raw video recordings, which are retained on a secure university server.” Critically, if you still share the derived tabular data, it should still be possible to reproduce the analytic results in your paper, even if checking the provenance of those numbers from the raw data is not possible for every reader.
Figure 13.5: Example participant (top) and trial (bottom) level data from the ManyBabies (2020) case study.
One common practice is the use of participant identifiers to link specific experimental data – which, if they are responses on standardized measures, rarely pose a significant identifiability risk – to demographic data sheets that might include more sensitive and potentially identifiable data.204 A word about subject identifiers. These should be anonymous identifiers, like randomly generated numbers, that cannot be linked to participant identities (like data of birth) and are unique. You laugh, but one of us was in a lab where all the subject IDs were the date of test and the initials of the participant. These were neither unique nor anonymous. One common convention is to give your study a code-name and to number participants sequentially, so your first participant in a sequence of experiments on information processing might be INFO-1-01
. Depending on the nature of the analyses being reported, the experimental data can then be shared with limited risk. Then a selected set of demographic variables – for example, those that do not increase privacy risks but are necessary for particular analyses – can be distributed as a separate file and joined back into the data later.
In order to understand the meaning of the raw data, its helpful to share as much as possible about the context in which it was collected. This also helps communicate the experience that participants had in your experiment. Documentation of this experience can take many forms.
If the experimental experience was a web-based questionnaire, archiving this experience can be as simple as downloading the questionnaire source.205 If it’s in a proprietary format like a Qualtrics .QSF
file, a good practice is to convert it to a simple plain text format as well so it can be opened and re-used by folks who do not have access to Qualtrics (which may include future you!) On the other hand, for many more involved studies it can be more difficult to reconstruct what participants went through. This kind of situation is where video data can shine (Gilmore & Adolph, 2017). A video recording of a typical experimental session can provide a valuable tutorial for other experimenters – as well as good context for readers of your paper. This is doubly true if there is a substantial interactive element to your experimental experience, as is often the case for experiments with children. For example, the ManyBabies case study that we examined shared “walk through” videos of experimental sessions for many of the participating labs, creating a repository of standard experiences for infant development studies. If nothing else, a video of an experimental session can sometimes be a very nice archive of a particular context.206 Videos of experimental sessions also are great to show in a talk, provided you have permission from the participant.
Regardless of what other documentation you keep, it’s critical to create some record linking your data to the particular documentation you have. For a questionnaire study, for example, this documentation might be as simple as a README that says that the data in the raw_data
directory were collected on a particular date using the file named experiment1.qsf
. This kind of “connective tissue” linking data to materials can be very important when you return to a project with questions. If you spot a potential error in your data, you will want to be able to examine the precise version of the materials that you used to gather those data in order to identify the source of the problem.
Data come in many forms, but chances are that at some point during your project you will end up with a spreadsheet full of information. Well-organized spreadsheets cam mean the difference between project success and failure! A wonderful article by Broman & Woo (2018) gives a guide to spreadsheet organization that lays out the principles of good spreadsheet design. We highlight some of their principles here (with our own, opinionated ordering):
Figure 13.6: Examples of non-rectangular spreadsheet formats that are likely to cause problems in analysis. From Broman and Woo (2018).
Make it a rectangle207 Think of your data like a well-ordered plate of sushi, neatly packed together without any gaps.. Nearly all data analysis software, like SPSS, Stata, Jamovi and JASP (and many R packages), require data to be in a tabular format.208 Tabular data is a precursor to “tidy” data, which we describe in more detail in Appendix C. If you are used to analyzing data exclusively in a spreadsheet, this kind of tabular data isn’t quite as readable, but readable formatting gets in the way of almost any analysis you want to do. Figure 13.6 gives some examples of non-rectangular spreadsheets. All of these will cause any analytic package to choke because of inconsistencies in how rows and columns are used!
Choose good names for your variables. No one convention for name formatting is best, but it’s important to be consistent. We tend to follow the tidyverse style guide and use lowercase words separated by underscores (_
). It’s also helpful to give units where these are available, e.g., are reaction times in seconds or milliseconds. Table 13.1 gives some examples of good and bad variable names.
Table 13.1: Examples of good and bad variable names. Adapted from Broman and Woo (2018).
Good name | Good alternative | Avoid |
---|---|---|
subject_id | SubID | subject # |
sex | female | M/F |
rt_msec | reaction_time_ms | reaction time (millisec.) |
Most software packages have a standard value for missing data (e.g. NA
is what R uses). If you are writing dates, please be sure to use the “global standard” (ISO 8601), which is YYYY-MM-DD. Anything else can be misinterpreted easily. Dates in Excel deserve special mention as a source of terribleness. Excel has an unfortunate habit of interpreting information that has nothing to do with dates as dates, destroying the original content in the process.209 Excel’s issue with dates has caused unending horror in the genetics literature, where gene names are automatically converted to dates, sometimes without the researchers noticing (Ziemann et al., 2016). In fact, some gene names have had to be changed in order to avoid this issue!
Decoration isn’t data. Decorating your data with bold headings or highlighting may seem useful for humans, but it isn’t uniformly interpreted or even recognized by analysis software (e.g., reading an Excel spreadsheet into R will scrub all your beautiful highlighting and artistic fonts) so do not rely on it.
Save data in plain text files. The CSV (comma-delimited) file format is a common standard for data that is uniformly understood by most analysis software (it is an “interoperable” file format).210 Be aware of some interesting differences in how these files are output by European vs. American versions of Microsoft Excel! You might find semi-colons instead of commas in some datasets. The advantage of CSVs is that they are not proprietary to Microsoft or another tech company, can be inspected in a text editor, but be careful: they do not preserve Excel formulas or formatting!
Given the points above, we recommend that you avoid analyzing your data in Excel. If it is necessary to analyze your data in a spreadsheet program, we urge you to save the raw data as a separate CSV and then create distinct analysis spreadsheets so as to be sure to retain the raw data unaltered by your (or Excel’s) manipulations.
Many researchers do not create data by manually entering information into a spreadsheet. Instead they receive data as the output from a web platform, software package, or device. These tools typically provide researchers limited control over the format of the resulting tabular data export. Case in point is the survey platform Qualtrics, which provides data with not one but two header rows, complicating import into almost all analysis software!211 The R package qualtRics
can help with this.
That said, if your platform does allow you to control what comes out, you can try to use the principles of good tabular data design outlined above. For example, try to give your variables (e.g., questions in Qualtrics) sensible names!
Even the best-organized tabular data are not always easy to understand by other researchers, or even yourself, especially after some time has passed. For that reason, best practices for data sharing include making a codebook (also known as a data dictionary) that explicitly documents what each variable is. Figure 13.7 shows an example codebook for the trial-level data in the bottom of Figure 13.5. Each row represents one variable in the associated dataset. Codebooks often describe what type of variable a column is (e.g., numeric, string), and what values can appear in that column. A human-readable explanation is often given as well, providing providing units (e.g., “seconds”) and a translation of numeric codes (e.g., “test condition is coded as 1”) where relevant.
Figure 13.7: Codebook for trial-level data (see above) from the ManyBabies (2020) case study.
Creating a codebook need not require a lot of work. Almost any documentation is better than nothing! There are also several R packages that can automatically generate a codebook for you, for example codebook
, dataspice
, and dataMaid
(Arslan, 2019). Adding a codebook can substantially increase the reuse value of the data and prevent hours of frustration as future-you and others try to decode your variable names and assumptions.
All of the hard work you put into your experiments – not to mention the contributions of your participants – can be undermined by bad data and project management. As our accident reports and case study show, bad organizational practices can at a minimum cause huge headaches. Sometimes the consequences can be even worse. On the flip side, starting with a firm organizational foundation sets your experiment up for success. These practices also make it easier to share all of the products of your research, not just your findings. Such sharing is both useful for individual researchers and for the field as a whole.