Setting Up Data Project Folders (1/3)

Martin Frigaard
11/15/2017

I re-wrote and published these after reading David Robinson’s excellent post on varianceexplained.org. Check him out on Twitter and take his new tidyverse course on DataCamp .

File and folder organization are topics I was never explicitly taught, and I think it’s tragic. Organizing your project files can help you think through the different steps in your workflow (i.e. ./Project/Data, ./Project/Code/, ./Project/Reports/). This can reduce the stress than tends to accompany searching for the latest version of a particular file (or even remembering where you left off on a given project). A little forethought and planning can help you locate and access files quickly, which means working faster and with more efficiency.

There are multiple resources for project organization–see Max Masnick’s post on folder organization, Jake Feala’s Python approach, and Eric Talevich’s very detailed description. James Scott Long has an excellent set of slides on workflow for Stata users.

If you prefer texts I recommend Christopher Gandrud’s Reproducible Research with R and RStudio and Implementing Reproducible Research by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Another great resource is Bioinformatics Data Skills because is has an excellent chapter on using Git and the Unix shell.

I’m going to echo what all these resources say,

The method you use doesn’t matter as much as actually choosing, documenting, and then sticking with a particular method.

I emphasized those two points because I found it harder to regularly document and consistently use a file organizational system. The upfront investment in discipline and repetition pays off in the long run, though.

I loosely base my project folders around the basic method presented in The Elements of Data Analytic Style by Jeff Leek. I think this layout is simple and straightforward enough to get a project started. I eventually had to create additional sub-folders when I began running into issues with storage and file version names.

I’ve found that explicitly documenting the folder structure, how to deal with temporary files, naming conventions, etc. helps collaborators who may not be familiar with the way my projects are organized.

I’m going to use the PLoS article, “Contagion in Mass Killings and School Shootings”, published on July 2, 2015, to demonstrate how to set up a data project. PLoS has an open data policy, so I should be able to download the data, check the descriptive statistics (study sample sizes, group tabulations, etc.) and re-create at least 1 figure.

This post will focus on file management functions in R. Because brevity is always a concern, I won’t be trying to duplicate the entire analysis, address the research question, or the author’s choice of statistical model.

A project’s folder structure might look like the following:

* README.md  
* Data/  
    – raw_data/  
        + data_set_2014-09-23.csv  
    – processed_data/  
        + 103.0-data_chr.csv  
        + 103.1-data_dates.csv  
        + 103.2-data_dbl.csv  
        + 103.3-data_fctr.csv  
* Results/  
    – exploratory_figures/  
    – final_figures/  
    – exploratory_tables/  
    - final_tables/  
* Code/  
    – 001-data_download.R  
    - 002-data_read.R  
    - 003.0-wrngl_chr.R    
    - 003.1-wrngl_dates.R   
    - 003.2-wrngl_dbl.R   
    - 003.3-wrngl_fctr.R  
    – analysis_scripts/   
        + 004.0-analysis_describe.R  
        + 004.1-analysis_visualize.R  
        + 004.2-analysis_model.R  
* Text/  
    – README.txt  
    – final_products/   
    – literature/ 
    – meta/  

However you set up your project, you can use the file and directory functions in R to create the folders and download any files you’ll be using. This method keeps everything transparent and reproducible (it’s also easier to track if you leave the project and come back to it later.)


I will start by loading the packages we will use for this tutorial and printing our session information. The single package tidyverse will load all the packages listed, but I like recording them in case anyone I share this with isn’t familiar with this workflow (but this is becoming less of an issue).

# load tidyverse packages 
tidy_pkg %
"purrr", # programming
"readr", # reading data into R
"readxl", # reading excel sheets into R
"stringr", # dealing with strings
"tibble", # creating fast and printable data frames
"tidyr", # tidy data
"dbplyr", # data base manipulation
"tidyverse", # sort of everything above...
"devtools") # development tools
inst_tidy = lapply(tidy_pkg, library, 
    character.only = TRUE)

Then I print the session information for this session. I use the devtools::session_info() function, but you can also use sessionInfo() from the utils that is loaded with base R. This will tell future researchers what packages and the R version I used for this project.

# sessionInfo()
devtools::session_info()
## Session info ------
## setting value
## version R version 3.4.2 (2017-09-28)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Los_Angeles
## date 2017-10-15
## Packages -----
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.1 2017-09-25 CRAN (R 3.4.2)
## base * 3.4.2 2017-10-04 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.0)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.2 2017-10-04 local
## datasets * 3.4.2 2017-10-04 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.0)
## dbplyr * 1.1.0 2017-06-27 CRAN (R 3.4.0)
## devtools * 1.13.3 2017-08-02 CRAN (R 3.4.1)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0)
## forcats * 0.2.0.9000 2017-06-29 Github (tidyverse/forcats@714063c)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.2)
## ggplot2 * 2.2.1.9000 2017-10-14 Github (tidyverse/ggplot2@ffb40f3)
## glue 1.1.1 2017-06-21 CRAN (R 3.4.0)
## graphics * 3.4.2 2017-10-04 local
## grDevices * 3.4.2 2017-10-04 local
## grid 3.4.2 2017-10-04 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven * 1.1.0 2017-07-09 CRAN (R 3.4.0)
## hms 0.3 2016-11-22 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.17 2017-08-10 CRAN (R 3.4.1)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
## lubridate * 1.6.0.9009 2017-06-29 Github (tidyverse/lubridate@9e5894a)
## magrittr * 1.5.0 2017-06-29 Github (tidyverse/magrittr@0a76de2)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.2 2017-10-04 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.1)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.2)
## parallel 3.4.2 2017-10-04 local
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
## purrr * 0.2.3 2017-08-02 CRAN (R 3.4.1)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
## readr * 1.1.1.9000 2017-06-29 Github (tidyverse/readr@3ea8199)
## readxl * 1.0.0.9000 2017-06-10 Github (tidyverse/readxl@a1c46a8)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
## rlang 0.1.2 2017-08-09 CRAN (R 3.4.1)
## rmarkdown 1.6.0.9005 2017-10-14 Github (rstudio/rmarkdown@aa96f64)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0.9000 2017-10-14 Github (hadley/scales@d767915)
## stats * 3.4.2 2017-10-04 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr * 1.2.0.9000 2017-06-29 Github (tidyverse/stringr@bf48d9d)
## tibble * 1.3.4 2017-08-22 CRAN (R 3.4.1)
## tidyr * 0.7.1 2017-09-01 CRAN (R 3.4.1)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
## tools 3.4.2 2017-10-04 local
## utils * 3.4.2 2017-10-04 local
## withr 2.0.0 2017-10-14 Github (jimhester/withr@d1f0957)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)

This is a lot of information to digest, but most of it is self-explanatory. The first section is about the R version I am using, my computer’s operating system, and the language, timezone, and date this file was created. The second section is all the packages I have installed, their version, and where I downloaded them from.

Relative vs. working directories

A relative working directory is a directory that someone else can open on their computer and all the file manipulation commands in R/RStudio will work. For example, we the RStudio project we just created is in a project folder (RR_PLoS), and inside it, we will create additional folders for additional files we create and download. If anyone opens the project file (.Rproj), their RStudio session will be set to the same working directory, and all our scripts and commands will make sense.

The absolute working directory is one that only makes sense on your computer. For example, you may have seen that I set up the RR_PLoS folder in my Dropbox (technically this tutorial is being written in /Users/martinfrigaard/Dropbox/RR_PLoS). If all of the scripts and files were created using this particular (and marginally efficient) file system, anyone coming into this project hoping to reproduce it would have to hunt down every directory and reset it to their file paths. You will not make friends doing this…

NOTE: Avoid including the setwd() in your R scripts. This changes the working directory on the user’s computer and can turn the otherwise enjoyable experience of collaboration into a hellish game of whack-a-mole as colleagues try to find out where everything went (intermediate data sets, reports, tables, graphs, etc.). If you’ve ever dealt with this issue, you know why Hadley Wickham refers to the use of setwd() as antisocial.

Create an Rmarkdown document

The first thing I do is create the .Rmd file I’ll use throughout the tutorial. The function for doing this is file.create(). I’ll type the function into the console and inside the brackets, name this file "001-RepRes_and_PLoS_part_1.Rmd". I always precede script files with three digits because 1) this allows for accurate versioning and 2) I might not know how many scripts a project is going to take. I see the following in my console after clicking enter:

## TRUE

Next, I will want to edit this file, and I can do this with file.edit().

file.edit(
    "001-RepRes_and_PLoS_part_1.Rmd"
    )

Accessing files in a directory

To access the files in the current directory, I can use the dir() function.

dir("./")
## "010-RepRes_and_PLoS_part_1.Rmd"

If I get really lost, I can add ".." to the dir() function to see what the parent directory looks like.

Common file manipulation functions

The file manipulation functions are listed below:

file.create()
file.copy()
file.rename()
file.remove()

I will also use the dir.create() function to create new folders. In fact, I will start by creating a folder for screenshot images to help with this tutorial.

dir.create("./Images")
# verify the directory
dir("./")
## "010-RepRes_and_PLoS_part_1.Rmd" 
## "Images" 

I am going to save the images below into this folder.

RStudio projects

I’ll start this project by opening a RStudio project file. I can do this by clicking on the Project icon in the upper right corner of the RStudio IDE. After clicking on the icon, select Existing Directory., then I select a Project Working Directory in a folder I want to use. If there isn’t a folder created yet, I’ll use the New Directory icon on the first window prompt to create a folder for your project (see figure below).

RR_PLoS_existing_dir

You can see I am creating a new folder in my Dropbox in the folder RR_PLoS for ‘Reproducible Research in Plos”.

RR_PLoS_project_working_dir

Click Create Project.

After I’ve opened a new project, I should be able to notice a few things about my RStudio environment. First, the directory for this project is listed across the top of the IDE. Second, the project icon in the upper right corner is now named, RR_PLoS. Third, the top of the console lists the relative working directory for RStudio.

RR_PLoS_console

There is also a list of the files in this directory on the Files tab (it should only list the new project file).

RR_PLoS_files

As we can see, there isn’t much in our project folder. In this case, we want to create four new directories (folders) in the root or master directory using the create.dir() function. The names of these four folders will be: Code, Data, Text, and Products (or Results). You should be able to intuit the contents of each folder by its name, but if not, all of the code (downloading, cleaning, analysis, etc.) goes into the Code folder. The Data folder should contain the raw data in text form (either .csv or .tsv). Literature, grant proposals, protocols, etc. go into the Text folder. Finally, the Products or Results folder contains the deliverable for this project (manuscript, presentation, shiny app, etc.)

Creating the Code folder

I learned another handy trick from Jeff Leek that will check for an existing directory, and then create a folder if one isn’t there.

if (!file.exists("Code")) {
    dir.create("Code")
}

I can check the contents using the dir() function again, but add writeLines() to it using the pipe operator to make the printout cleaner.

dir("./") %>% 
    writeLines()
## 010-RepRes_and_PLoS_part_1.Rmd
## Code
## Images

This process can be repeated for any of the folders in our project, but since I know none of these currently exist, I just use the dir.create() function.

Creating the Text folder

The /Text folder should contain articles and materials I need to conduct whatever project I am working on. For example, I am going to navigate to the article on the Plos website and assign the download link to an object. So first I will create the Text folder, and then the /Text/literature sub-folder. Finally, I will download the article, and check to see that it ended up in the right place.

# assign url to object
original_article_url <- "https://goo.gl/gjFNHY" 
# create Text 
if (!file.exists("Text")) {
 dir.create("Text") } 
# create Text/literature folder 
if (!file.exists("Text/literature")) {
 dir.create("Text/literature") } 
# download file 
download.file(url = original_article_url, 
destfile = 
"Text/literature/Contagion_in_Mass_Killings_and_School_Shootings.PDF", 
    mode = "wb")

When I download the file, I also need to specify the mode = "wb" argument to download the .pdf from the site. Now I can check to see if the file has been downloaded into the right spot.

 
# check 
dir("./Text/literature") %>% writeLines()
## Contagion_in_Mass_Killings_and_School_Shootings.PDF

Looks good!

Create the Data folder

A little forethought goes a long way with data processing. I should assume I’m downloading the raw data, so I really want to create two directories; one for the /Data and another for the /Data/raw_data. I always expect to do a fair amount of data manipulation, so I also create the /Data/processed_data directory.

# create data folder
if (!file.exists("Data")) {
dir.create("Data")
}
# create a processed data folder
if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
# create a raw data folder
if (!file.exists("Data/raw_data")) {
dir.create("Data/raw_data")
}
# verify
dir("./Data")
## [1] "processed_data" "raw_data"

The URL for the data can be found in the paper under the Data section.

“There are currently no comprehensive federal repositories of data on mass killings and school shootings in the US, thus for these studies, we relied on data compiled by private organizations. Data are available at http://tinyurl.com/oql4lpy.&#8221;

These data are stored in a Google drive folder:

RR_PLoS_google_drive

I can click on the down arrow in the upper left portion of the screen to download the zipped files into the current directory. Or I can save the link to the file (https://goo.gl/riwBcw) and download the files into the Data folder just like I did with the research article. Getting the download link is a little tricky (you might need to open it in Chrome).

# Idenitfy the source of the dataset.
data_path_url <- "http://tinyurl.com/oql4lpy"
# url to download the files
data_download_url <- "https://goo.gl/riwBcw" 
# download fils
download.file(
    data_download_url, 
    destfile = "Data/raw_data/data_raw_zip.tar.gz")

Verify these data have been downloaded

 
dir("./Data/raw_data/") %>% writeLines()
## data_raw_zip.tar.gz

I can see this file is stored as a .tar.gz. I can use the untar() function to unzip the folder. Specify the file to unzip and the exdir = destination folder.

untar("./Data/raw_data/data_raw_zip.tar.gz", 
    exdir = "./Data/raw_data/")

Verify this has been unzipped.

dir("./Data/raw_data/") %>% writeLines()
## data
## data_raw_zip.tar.gz

It looks like the files came in a folder titled data, so I should check inside this folder for the raw data files.

dir("./Data/raw_data/data")
## [1] "brady_mass_original.txt"
## [2] "brady_mass_shooting_to_jan_2013.pdf"
## [3] "brady_school_original.txt"
## [4] "brady_school_shootings_to_jan_2014.pdf"
## [5] "usa_today_original.txt"

Now I see the three data files I was expecting (in .txt format). There are also two .pdf files. I can investigate these later, but right now I’m going to move the data files out of the data folder into my raw_data folder.

# move the brady mass murder data
file.copy(
    "./Data/raw_data/data/brady_mass_original.txt",
    "./Data/raw_data/brady_mass_original.txt"
     )
## [1] TRUE
# move the brady_school_original.txt
file.copy(
    "./Data/raw_data/data/brady_school_original.txt", 
    "./Data/raw_data/brady_school_original.txt"
    )
## [1] TRUE
# move the usa original data
file.copy(
    "./Data/raw_data/data/usa_today_original.txt", 
    "./Data/raw_data/usa_today_original.txt")
## [1] TRUE

Verify the data files have been moved.

dir("./Data/raw_data/")
## "brady_mass_original.txt" 
## "brady_school_original.txt"
## "data" "data_raw_zip.tar.gz" 
## "usa_today_original.txt"

And I should move the .pdf files to the /Text folder.

# create Text/meta folder
if (!file.exists("Text/meta")) {
    dir.create("Text/meta")
}
# move the brady_mass_shooting_to_jan_2013.pdf
file.copy(
"./Data/raw_data/data/brady_mass_shooting_to_jan_2013.pdf", 
"Text/meta/brady_mass_shooting_to_jan_2013.pdf"
)
## [1] TRUE
# move the brady_school_shootings_to_jan_2014.pdf
file.copy(
"./Data/raw_data/data/brady_school_shootings_to_jan_2014.pdf", 
"Text/meta/brady_school_shootings_to_jan_2014.pdf"
)
## [1] TRUE

To keep everything nice and clean, I should also remove the files and the ./Data/raw_data/data folder.

file.remove("./Data/raw_data/data/brady_mass_original.txt")   
## [1] TRUE     
file.remove("./Data/raw_data/data/brady_mass_shooting_to_jan_2013.pdf")     
## [1] TRUE     
file.remove("./Data/raw_data/data/brady_school_original.txt")     
## [1] TRUE    
file.remove("./Data/raw_data/data/brady_school_shootings_to_jan_2014.pdf")    
## [1] TRUE    
file.remove("./Data/raw_data/data/usa_today_original.txt")     
## [1] TRUE    
file.remove("./Data/raw_data/data") 
## [1] TRUE     
# verify
dir("./Data/raw_data/")
## "brady_mass_original.txt" 
## "brady_school_original.txt"
## "data_raw_zip.tar.gz" 
## "usa_today_original.txt"

I will take a look the master folder for this project with dir("./")

dir("./")
## "010-RepRes_and_PLoS_part_1.Rmd"
## "Code"
## "Data"
## "Images"
## "Text"

So I have the Code, Text, and Data, folders. In my next post, I am clean and process the raw data in order to re-create some of the tables and figures from the article and put them into a Results folder.

One thought on “Setting Up Data Project Folders (1/3)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.