Reproduce Tables and Figures (3/3)

Martin Frigaard
11/15/2017

This is the third of three posts for setting up a data project. The first post dealt with creating project folders and downloading files and data from the Internet. The second post went over a bit of data wrangling. In this post I will attempt to recreate come of the table counts and figures in the paper titled, “Contagion in Mass Killings and School Shootings”.

Table counts and figures

Now that I am comfortable with the dates in the data set, I can start trying to re-create a few figures from the article. I’ll start with the top two graphs in Figure 1, and focus on the gray line (it represents the raw data).

The title and description are provided below:

Fig 1. Number of mass killings, school shootings, and mass shootings over time for the data samples used in these studies. For clarity of presentation, all data samples are shown over the same time frame. Overlaid is the fit of the full model of Eq 4 that also contains a contagion component (red). The green line indicates the estimated portion of the data due to contagion. The points along the x-axis for the first three samples indicate the date of events that had a number of people killed in the top 5th percentile for that sample.

This looks like a frequency polygon where Year is mapped to the x-axis is (in 5-year increments), and the y-axis is the number of incidents per month. I will start with the USA today data, but I want to take a glimpse() just to remind myself what these data look like.

usa_tdy_df %>% 
    glimpse()
## Observations: 233
## Variables: 12
## $ month "01", "12", "11", "11", "11", "10", "10", "10...
## $ day "16", "01", "23", "23", "07", "29", "28", "26...
## $ year 2014, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ town "Spanish_Fork", "Topeka", "Tulsa", "Parsons",...
## $ state "UT", "KS", "OK", "KS", "FL", "SC", "TX", "NY...
## $ method "Shooting", "Shooting", "Shooting", "Unknown_...
## $ type "Family_killing", "Family_killing", "Other", ...
## $ nkill 4, 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 12, 4, 4, 4,...
## $ suicide 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  ...
## $ nwound "0", "0", "0", "0", "0", "0", "0", "0", "0", ...
## $ date 2014-01-16, 2013-12-01, 2013-11-23, 2013-11-...

I will start with a simple frequency polygon with date on the x-axis. I’ll set the binwidth to 35 as a starting point.

usa_tdy_df %>% 
    ggplot(aes(x = date)) +
    geom_freqpoly(binwidth = 35)

I can see the dates across the x-axis, which is a good start. But this doesn’t look like the figures above. I need to create a new variable in order to add some faceting by the type of mass killing (shooting/not shooting). First I will stratify the number of incidents by the method of the mass shooting. The count() for the method variable is below:

usa_tdy_df %>% 
    count(method)
## # A tibble: 17 x 2
## method n
##
## 1 Blunt_force 8
## 2 Blunt_force_and_suffocation 1
## 3 Shooting 168
## 4 Shooting_and_blunt_force 1
## 5 Shooting_and_drowning 1
## 6 Shooting_and_Smoke_inhalation_and_burns 1
## 7 Shooting_and_Stabbing 4
## 8 Shooting_and_strangulation 1
## 9 Smoke_inhalation_and_burns 18
## 10 Stabbing 14
## 11 Stabbing_and_Blunt_Force 3
## 12 Stabbing_and_Smoke_Inhalation_and_Burns 3
## 13 Stabbing_and_strangulation 1
## 14 Strangulation 2
## 15 Strangulation_and_blunt_force 1
## 16 Suffocation_and_Stabbing_and_Shooting 1
## 17 Unknown_Other 5

There are multiple ways to create a new variable. I could use mutate() and if_else() to create my new variable binary variable, or mutate() and case_when(). But because I want to create a factor variable from these categories that has two levels, mass killings with firearms and mass killings without firearms, I will use the fct_collapse() function from forcats.

forcats::fct_collapse()

usa_tdy_df %
mutate(method_bin =
  fct_collapse(method,
    shooting = c("Shooting",
      "Shooting_and_blunt_force",
      "Shooting_and_drowning",
      "Shooting_and_Smoke_inhalation_and_burns",
      "Shooting_and_Stabbing",
      "Shooting_and_strangulation",
      "Suffocation_and_Stabbing_and_Shooting"),
  nonshooting = c("Blunt_force",
      "Blunt_force_and_suffocation",
      "Smoke_inhalation_and_burns",
      "Stabbing",
      "Stabbing_and_Blunt_Force",
      "Stabbing_and_Smoke_Inhalation_and_Burns",
      "Stabbing_and_strangulation",
      "Strangulation",
      "Strangulation_and_blunt_force",
      "Unknown_Other")))

Now I will get a count of my new variable.

usa_tdy_df %>% 
  count(method_bin)
## # A tibble: 2 x 2
##      method_bin   n
##     nonshooting  56
##        shooting 177

Aaaaaand full stop. Recall the descriptives table in the paper? It lists the sample size of USA Today observations with firearms at 176, and the number of observations USA Today without firearms at 56. Also, the USA Today total sample size is 232 observations. Why are my numbers off? Well, I should’ve to read the entire article before starting to clean and visualize the data.

Rewind

In the methods section, the authors note “Data on mass killings in the US between 2006 to December 2013 were obtained from a USA Today study that examined Federal Bureau of Investigation (FBI) data from the FBI Supplemental Homicide Reports…” The important takeaway here is the time-frame. I never ran a descriptive summary of the date variable in the usa_tdy_df data frame to figure out if the time-frame in the raw data matches the time-frame listed in the methods.

I will now:

usa_tdy_df %>%
  summarise(
    min = min(date),
    max = max(date),
    n = n())
## # A tibble: 1 x 3
##          min        max   n
##   2006-01-01 2014-01-16 233

And here we have it. I can see this data set goes into January of 2014. I need to filter this observation out of the data set to get the same descriptive statistics the authors report.

IMPORTANT NOTE: Always always always read the documentation for a data set before diving in. Fortunately, I didn’t spend an entire day doing unnecessary cleaning and processing tasks on data that wouldn’t be included in the tables/figures or analyses.

usa_tdy_df %
  filter(date % 
  summarise(
    min = min(date),
    max = max(date),
    n = n())
## # A tibble: 1 x 3
##          min        max   n
##   2006-01-01 2013-12-01 232

Now I can check the count of our method_bin

usa_tdy_df %>% 
    count(method_bin)
## # A tibble: 2 x 2
## method_bin n
##
## 1 nonshooting 56
## 2 shooting 176

Ok, I can proceed with the visualization. I will facet this the graphs by method_bin. I am also going to stick with the 35 binwidth.

usa_tdy_df %>%
    ggplot(aes(x = date)) +
    geom_freqpoly(binwidth = 35) +
    facet_grid(. ~ method_bin)

Hmm…In order to look like the originals, I want the shooting category on the left side. I can shift this by using a the fct_rev() function.

forcats::fct_rev()

usa_tdy_df %>%
  ggplot(aes(x = date)) +
  geom_freqpoly(binwidth = 35) +
   facet_grid(. ~ fct_rev(method_bin))

That’s better. The binwidth seems to be representing the distribution of shootings similar to the gray line in the top two graphs in figure 1. To make these graphs look more like the ones in the paper, I should adjust the x-axis to be free_x.

usa_tdy_df %>%
  ggplot(aes(x = date)) +
  geom_freqpoly(binwidth = 35) +
  facet_wrap(~ fct_rev(method_bin), 
    scales = "free")

The geom_freqpoly() isn’t really capturing the number of incidents per month. I’ll create a variable that counts the number of incidents per month and map this on the y-axis. First I need to convert the month variable to an integer. Then I will use the add_count() function from dplyr.

dplyr::add_count()

# convert month to integer
usa_tdy_df$month %
    add_count(month, year) %>%
    ungroup() %>%
    rename(month_count = n) # rename month_count
usa_tdy_df %>% glimpse()
## Observations: 232
## Variables: 14
## $ month 12, 11, 11, 11, 10, 10, 10, 10, 10, 9, 9, 9, ...
## $ day "01", "23", "23", "07", "29", "28", "26", "26...
## $ year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ town "Topeka", "Tulsa", "Parsons", "Jacksonville",...
## $ state "KS", "OK", "KS", "FL", "SC", "TX", "NY", "AZ...
## $ method "Shooting", "Shooting", "Unknown_Other", "Sho...
## $ type "Family_killing", "Other", "Family_killing", ...
## $ nkill 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 12, 4, 4, 4, 4,...
## $ suicide 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nwound "0", "0", "0", "0", "0", "0", "0", "0", "0", ...
## $ date 2013-12-01, 2013-11-23, 2013-11-23, 2013-11-...
## $ method_bin shooting, shooting, nonshooting, shooting, s...
## $ month_count 1, 2, 1, 2, 4, 4, 1, 4, 4, 3, 3, 3, 2, 2, 2, ...

The add_count() function adds a numerical count variable (n) for each combination of variables entered into the function. I also renamed this variable month_count. I should get a count of this new variable by the method_bin.

usa_tdy_df %>%
  group_by(method_bin) %>%
summarise(
   sum = sum(month_count),
   min = min(month_count),
   max = max(month_count))
## # A tibble: 2 x 4
##      method_bin sum min max
##     nonshooting 100   1   3
##        shooting 562   1   8

These summaries look close to those represented in the top row of Figure 1. I will create a new graph and see if it looks like a closer match.

usa_tdy_df %>%
  ggplot(aes(x = date, 
             y = month_count)) +
  geom_line() +
  facet_wrap(~ fct_rev(method_bin), 
  scales = "free_y")

I’m getting closer, but I still need to rescale the x-axis to fewer dates, add axis labels and a title. Before I complete these steps, I will assign the graph to an object.

RR_PLoS_fig_1.1 <-
    ggplot(aes(x = date, y = month_count)) +
    geom_line() +
    facet_wrap(~ fct_rev(method_bin), scales = "free") +
    scale_x_date(date_breaks = "5 years",
    date_labels = "%Y") +
    labs(y = "Number of incidents per month", 
    x = "Year")

Now I can add the title to the object (and include the doi to link it to the original paper)

print(RR_PLoS_fig_1.1 + 
    ggtitle(
    "Reproduction of Fig 1. from,\n
     'Contagion in Mass Killings and School Shootings'",
      subtitle = "doi:10.1371/journal.pone.0117259.g001")
      )

This looks close enough to the figure in the article.

The distributions are similar to the gray lines, and I’m confident from my tabulations I have the correct data. I’m going to put this figure in the figures sub-folder of my Results folder.

I’ll also want to export/save all the data files into the processed_data folder. I recommend using .csv files because they’re readable by most software programs.

# brady_mass_df
write_csv(
    as_data_frame(brady_mass_df),
     "./Data/processed_data/brady_mass_df.csv"
     )

# brady_schl_df
write_csv(
    as_data_frame(brady_schl_df),
     "./Data/processed_data/brady_schl_df.csv"
     )
# usa_tdy_df

write_csv(
    as_data_frame(usa_tdy_df), 
    "./Data/processed_data/usa_tdy_df.csv"
    )

Create the Results folder

I have to create these folders first.

# Results folder
if (!file.exists("Results")) {
    dir.create("Results")
}
# Results/figures folder
if (!file.exists("Results/figures")) {
    dir.create("Results/figures")
}
# verify 
dir("./")
## "010-RepRes_and_PLoS_part_1.Rmd"
## "Code" 
## "Data"
## "Images"
## "Results" 
## "Text"

# verify
dir("./Results")
## [1] "figures"

Now I want to save my graph object–I can do this with ggsave.

ggsave("Results/figures/RR_PLoS_fig_1.1.png")
## Saving 7 x 5 in image

Great! Now I want to document all these files in a README.md document.

Creating the README.md

I want to create a README.md file that describes what is in my project folder. I can do this with cat() function.

cat("# Reproducible Research with R and RStudio \n",
    file = "README.md")

The \n tells the text file to start a new line after the title. We will probably want to add more to the README.md file as our project moves along. Whenever we want to edit a file we can access it using file.edit().

Or we can continue to add information to the file using R commands. If I want to keep adding more information to the same file I need to use the append = TRUE argument.

cat("\n
    This folder contains the contents for the tutorial on accessing and\n
    downloading data in the PlosOne article titled,\n
    ['Contagion in Mass Killings and School Shootings'](doi:10.1371/journal.pone.0117259.g002) \n 
    \n
    Citation: Towers S, Gomez-Lievano A, Khan M, Mubayi A, 
    Castillo-Chavez C (2015) Contagion in Mass Killings and School Shootings. 
    PLoS ONE 10 (7): e0117259. doi:10.1371/journal.pone.0117259 \n

The folder/file structure is displayed below:\n
\n
* Data\n
 – raw_data\n
  - brady_mass_original.txt\n
  - brady_school_original.txt\n
  - data_raw_zip.tar.gz\n
  - usa_today_original.txt\n
 – processed_data\n
  - brady_mass_df.csv\n
  - brady_schl_df.csv\n
  - usa_tdy_df.csv\n
* Results\n
 – figures\n
  - RR_PLoS_fig_1.1.png\n
* Text\n
 - literature\n
  - Contagion_in_Mass_Killings_and_School_Shootings.PDF\n
 - meta\n
  - brady_mass_shooting_to_jan_2013.pdf\n
  - brady_school_shootings_to_jan_2014.pdf\n
* Images\n
 - RR_PLoS_console.png\n
 - RR_PLoS_existing_dir.png\n
 - RR_PLoS_figure_1.png\n
 - RR_PLoS_files.png\n
 - RR_PLoS_google_drive.png\n
 - RR_PLoS_project_working_dir.png\n
* Code\n
 – 001.1-RepRes_and_PLoS_part_1.Rmd\n
", file = "README.md", append = TRUE)

I still need to copy the current .Rmd to the Code folder. Click command/control + S. Then use file.copy() to save a version of this script to the Code folder. This should be one of the last things you do in order to get a recent version.

file.copy(
    "001.0-RepRes_and_PLoS_part_1.Rmd", 
    "./Code/001.1-RepRes_and_PLoS_part_1.Rmd"
    )
## TRUE

Now I can open the new version using file.edit().

# file.edit(
    "./Code/001.1-RepRes_and_PLoS_part_1.Rmd"
)

Now I am editing the new version 001.1-RepRes_and_PLoS_part_1.Rmd in the Code folder.

I can close and move the previous version over using file.copy again.

# file.copy(
    "../001.0-RepRes_and_PLoS_part_1.Rmd", 
    "./001.0-RepRes_and_PLoS_part_1.Rmd"
    )
## TRUE

And remove the previous version from the root directory.

file.remove(
    "../001.0-RepRes_and_PLoS_part_1.Rmd"
)
## TRUE
# verify
dir("..")
## "Code" 
## "Data"
## "Images"
## "Results" 
## "Text"
## "README.md"
## "RR_PLoS.Rproj"

Great! Now I have a complete tutorial in one folder and I can send this to Github.

Commit project to Github.

First I need to navigate to Github and create a new repository for this project (I am going to assume you have a Github account and username and know how to create a new repo.)

RR_PLoS_web_repo

I also want to create another project within the root folder titled, RR_PLoS_Github_repo by clicking on the Project icon in the upper right corner, selecting New Project, Version Control, and Git.

Then I’ll enter the following fields in the Clone Git Repository prompt.

RR_PLoS_clone_git_repo

Click on Create project.

This will create a new project and folder in our root project folder. It will also open this a new session with an empty environment pane. If I click on the Git tab, I will see the files below.

RR_PLoS_files

I want to select the Files tab and select a file in the RR_PLoS folder.

RR_PLoS_files_tab

Now I want to click on More and Copy To to move each folder into the RR_PLoS_Github_repo folder. When my Git tab looks like the original RR_PLoS folder, I can select these folders and click on Commit.

RR_PLoS_git_tab_with_files

RR_PLoS_commit

Now navigate to the URL for your repo to confirm everything was pushed correctly. My link is below.

https://github.com/mjfrigaard/RR_PLoS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.