Reading and wrangling data (2/3)

Martin Frigaard
11/15/2017

In my last post I set up my project folders and downloaded the data sets for the article, “Contagion in Mass Killings and School Shootings”.

Reading data into RStudio

I am going to read the data into the RStudio environment. I recommend using read_csv for plain text files (this function is from readr is faster than the base read.csv).

brady_mass_df <- read_csv(
    "./Data/raw_data/brady_mass_original.txt"
    )
## Parsed with column specification: 
## cols( 
  ## lexclude = col_double(), 
  ## month = col_double(), 
  ## day = col_double(), 
  ## year = col_double(), 
  ## city = col_character(), 
  ## state = col_character(), 
  ## nwound = col_double(), 
  ## nkill = col_double(), 
  ## naff = col_double(), 
  ## suicide = col_double(), 
  ## killed_by_police = col_double(), 
  ## source_quoted = col_double(), 
  ## shots_fired_at_schoolgrounds = col_double(), 
  ## shots_fired_at_bar_nightclub_stripclub = col_double(), 
  ## shots_fired_at_home = col_double(), 
  ## gender = col_character() 
## )
## Warning: 36 parsing failures. 
## row col expected actual file 
## 16 -- 16 columns 17 columns './Data/raw_data/brady_mass_original.txt' 
## 17 -- 16 columns 17 columns './Data/raw_data/brady_mass_original.txt' 
## 18 -- 16 columns 17 columns './Data/raw_data/brady_mass_original.txt' 
## 19 -- 16 columns 17 columns './Data/raw_data/brady_mass_original.txt' 
## 20 -- 16 columns 17 columns './Data/raw_data/brady_mass_original.txt' 
## ... ... .......... .......... ......................................... 
## See problems(...) for more details.
 
brady_mass_df %>% 
  glimpse()
## Observations: 477
## Variables: 16
## $ lexclude 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month 1, 1, 1, 1, 12, 12, 11,...
## $ day 30, 22, 19, 13, 14, 11,...
## $ year 2013, 2013, 2013, 2013,...
## $ city "Phoenix", "Houston", "...
## $ state "AZ", "TX", "NM", "KY",...
## $ nwound 2, 4, 0, NA, 0, 1, 2, 1...
## $ nkill 1, 0, 5, NA, 26, 2, 2, ...
## $ naff 3, 4, 5, 3, 26, 3, 4, 3...
## $ suicide 1, 0, 0, 0, 0, 1, 1, 0,...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0,...
## $ source_quoted 1, 1, 1, 1, 1, 1, 1, 1,...
## $ shots_fired_at_schoolgrounds 0, 1, 0, 1, 1, 0, 0, 0,...
## $ shots_fired_at_bar_nightclub_stripclub 0, 0, 0, 0, 0, 0, 0, 0,...
## $ shots_fired_at_home 0, 0, 1, 0, 0, 0, 0, 0,...
## $ gender "1", "1", "1", "1", "1"...

NOTE: I’ll put _df on my data frames so I can identify them with pattern =.

The read_csv() function prints some information on the data that were just read into the RStudio working environment. The Parsed with column specification: section tells me how each column was recognized by RStudio. These are inside the cols() portion. The printed portion also tells me how many parsing failures there were in this dataset (36 parsing failures.) followed by a table that describes the columns that didn’t parse the way RStudio was expecting.

## row col expected actual file
## 16 -- 16 columns 17 columns './Data/brady_mass_original.txt'
## 17 -- 16 columns 17 columns './Data/brady_mass_original.txt'
## 18 -- 16 columns 17 columns './Data/brady_mass_original.txt'
## 19 -- 16 columns 17 columns './Data/brady_mass_original.txt'
## 20 -- 16 columns 17 columns './Data/brady_mass_original.txt'

The glimpse() function tells me this data set has 477 observations and 16 variables. The state variable lines up with names of actual states ("AZ", "TX", "NM", "KY") so I can be somewhat confident that these data have been read into R correctly.

Next read in the brady_school_original.txt file.

brady_schl_df % 
  glimpse()
## Observations: 221
## Variables: 1
## $ `nwound nkill source_quoted month day year naff suicide killed_by_police 
## town state lexclude` ...

Ugh–this can’t be right. The entire data set was read into 1 column/variable. The print-out doesn’t tell me much, so I need to look at these data. Preferably I can do this inside RStudio and not have to open the file in another program. I can do this with file.show().

file.show(
    "./Data/raw_data/brady_school_original.txt"
    )

This opens the R Information window and I can look (but not touch) the data. Can you spot a potential problem with these data? There are two spaces between the source_quoted and month variables. This is going to cause a problem when I read the data into RStudio using read_csv(). But fortunately, I can use the read.table() from base R.

The rest of these variables look like they are delimited by a single space " " so I will use the read.table() function from base R. This allows me to enter the sep = "" which means, “the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.

brady_schl_df % 
  glimpse()
## Observations: 221
## Variables: 12
## $ nwound 2, 1, 1, 3, 1, 2, 0, 4, 0, 0, 2, 0, 2, 0, 0, ...
## $ nkill 0, 0, 0, 0, 0, 1, 0, 6, 1, 0, 0, 1, 0, 1, 3, ...
## $ source_quoted 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month 1, 12, 12, 11, 11, 10, 10, 6, 4, 3, 1, 1, 1, ...
## $ day 14, 13, 4, 13, 2, 21, 15, 7, 19, 18, 31, 29, ...
## $ year 2014, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ naff 2, 1, 1, 3, 1, 3, 0, 10, 1, 0, 2, 1, 2, 1, 3,...
## $ suicide 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ town "Roswell", "Centennial", "Winter_Garden", "Pi...
## $ state "NM", "CO", "FL", "PA", "NC", "NV", "TX", "CA...
## $ lexclude 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...

That looks better. I will try to read in the final file usa_today_original.txt with read_csv()

usa_tdy_df % 
    glimpse()
## Observations: 233
## Variables: 11
## $ month "01", "12", "11", "11", "11", "10", "10", "10...
## $ day "16", "01", "23", "23", "07", "29", "28", "26...
## $ year 2014, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ town "Spanish_Fork", "Topeka", "Tulsa", "Parsons",...
## $ state "UT", "KS", "OK", "KS", "FL", "SC", "TX", "NY...
## $ method "Shooting", "Shooting", "Shooting", "Unknown_...
## $ type "Family_killing", "Family_killing", "Other", ...
## $ nkill 4, 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 12, 4, 4, 4,...
## $ suicide 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ nwound "0", "0", "0", "0", "0", "0", "0", "0", "0", ...

Whew! Now all three data frames are loaded into our workspace. I can check what else is in my working environment using ls().

ls() %>% 
  writeLines()
## brady_mass_df
## brady_schl_df
## data_download_url
## data_path_url
## inst_tidy
## original_article_url
## tidy_pkg
## usa_tdy_df

I can see two vectors for package installation (inst_tidy, tidy_pkg), three data frames (brady_mass_df, brady_schl_df, usa_tdy_df), and three urls (data_download_url, data_path_url, original_article_url). Time to move into data wrangling.

Data wrangling

I have covered data wrangling at length in a variety of other tutorials, but I did want to present an idea I found in the Essentials of Data Science text from Graham J. Williams. The chapter on wrangling presents a series of wrangling templates that can be used on multiple projects (available here.

I find the idea of having a generic data cleaning/preparation template appealing–a standard, boilerplate script that outlines common steps for converting messy, unstructured data into tidy data sets. The website–and entire book–is worth reading.

A possible objection to a data wrangling template might be that every dataset is messy in its own unique way, so standardization is not possible or worth the effort. I find this hypothetical objection unsatisfying because having tidy data as an end-goal makes uniformity possible. If we know we want to end up with a data frame that has 1 variable per column and 1 observation per row, then we can work backward from this goal to the ‘first contact’ we have with a data set. The steps between read_csv() and the final glimpse() or str() will undoubtedly vary, but I think there are probably more similarities than differences in data wrangling pipelines.

I’ll use a few examples from the templates to get the data from the article into shape to reproduce some of the table and figures.

Variables

The names() command tells us the variable names in the data frame

brady_mass_df %>%
     names()
## [1] "lexclude"
## [2] "month"
## [3] "day"
## [4] "year"
## [5] "city"
## [6] "state"
## [7] "nwound"
## [8] "nkill"
## [9] "naff"
## [10] "suicide"
## [11] "killed_by_police"
## [12] "source_quoted"
## [13] "shots_fired_at_schoolgrounds"
## [14] "shots_fired_at_bar_nightclub_stripclub"
## [15] "shots_fired_at_home"
## [16] "gender"

If you want a clean, list version, you can add writeLines() after names()

brady_mass_df %>% 
    names() %>% 
        writeLines()
## lexclude
## month
## day
## year
## city
## state
## nwound
## nkill
## naff
## suicide
## killed_by_police
## source_quoted
## shots_fired_at_schoolgrounds
## shots_fired_at_bar_nightclub_stripclub
## shots_fired_at_home
## gender

Inspect your data (glimpse(), str(), head(), tail())

The glimpse() function is like str(), but shows more of the data and no attributes.

usa_tdy_df %>%
    glimpse()
## Observations: 233
## Variables: 11
## $ month "01", "12", "11", "11", "11", "10", "10", "10...
## $ day "16", "01", "23", "23", "07", "29", "28", "26...
## $ year 2014, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ town "Spanish_Fork", "Topeka", "Tulsa", "Parsons",...
## $ state "UT", "KS", "OK", "KS", "FL", "SC", "TX", "NY...
## $ method "Shooting", "Shooting", "Shooting", "Unknown_...
## $ type "Family_killing", "Family_killing", "Other", ...
## $ nkill 4, 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 12, 4, 4, 4,...
## $ suicide 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  ...
## $ nwound "0", "0", "0", "0", "0", "0", "0", "0", "0", ...

Compare this to the str() function.

usa_tdy_df %>% 
    str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 233 obs. of 11 variables:
## $ month : chr "01" "12" "11" "11" ...
## $ day : chr "16" "01" "23" "23" ...
## $ year : num 2014 2013 2013 2013 2013 ...
## $ town : chr "Spanish_Fork" "Topeka" "Tulsa" "Parsons" ...
## $ state : chr "UT" "KS" "OK" "KS" ...
## $ method : chr "Shooting" "Shooting" "Shooting" "Unknown_Other" ...
## $ type : chr "Family_killing" "Family_killing" "Other" "Family_killing" ...
## $ nkill : num 4 4 4 4 4 5 5 5 4 4 ...
## $ suicide : num 1 0 0 0 0 1 0 0 1 0 ...
## $ killed_by_police: num 0 0 0 0 0 0 0 0 0 0 ...
## $ nwound : chr "0" "0" "0" "0" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 11
## .. ..$ month : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ day : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ town : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ state : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ method : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ type : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ nkill : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ suicide : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ killed_by_police: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ nwound : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"

As you can see, the str() function spits out a lot of information at the bottom (attr(*...), but considering I almost never need this information, I prefer using glimpse().

I can see the top of the data frame with head() and the bottom with tail() (specify a number if you only want to see a few)

brady_mass_df %>% 
    head(3)
## # A tibble: 3 x 16
## lexclude month day  year        city state nwound nkill naff suicide
##        0     1  30  2013     Phoenix    AZ      2     1    3       1
##        0     1  22  2013     Houston    TX      4     0    4       0
##        0     1  19  2013 Albuquerque    NM      0     5    5       0
## # ... with 6 more variables: killed_by_police , source_quoted ,
## # shots_fired_at_schoolgrounds ,
## # shots_fired_at_bar_nightclub_stripclub ,
## # shots_fired_at_home , gender
usa_tdy_df %>% 
    tail(3)
## # A tibble: 3 x 11
## month day year     town state   method            type nkill suicide
##    02  21 2006 Mesa        AZ Shooting            Other    5       0
##    01  30 2006 Goleta      CA Shooting   Public_Killing    7       1
##    01  01 2006 Richmond    VA Stabbing Robbery_Burglary    4       0
## # ... with 2 more variables: killed_by_police , nwound

Dimensions of the data frame

The dimensions of a data frame are helpful to know because they give me a sense of how ‘big’ a data set is. For example, the dimensions of usa_tdy_df data frame are:

usa_tdy_df %>% dim()
## [1] 233 11

That means there are 11 variables and 233 observations. This information is available in the glimpse() command, so I don’t use dim() too often.

usa_tdy_df %>% 
    glimpse()
## Observations: 233
## Variables: 11
## $ month "01", "12", "11", "11", "11", "10", "10", "10...
## $ day "16", "01", "23", "23", "07", "29", "28", "26...
## $ year 2014, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ town "Spanish_Fork", "Topeka", "Tulsa", "Parsons",...
## $ state "UT", "KS", "OK", "KS", "FL", "SC", "TX", ...
## $ method "Shooting", "Shooting", "Shooting", ...
## $ type "Family_killing", "Family_killing", "Other", ...
## $ nkill 4, 4, 4, 4, 4, 5, 5, 5, 4, 4, 4, 12, 4, 4,...
## $ suicide 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,  ...
## $ killed_by_police 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  ...
## $ nwound "0", "0", "0", "0", "0", "0", "0", "0", "0", ...

Now I am going to clean the date variables in these three data frames.


Cleaning dates

In the usa_tdy_df, the three variables corresponding to dates are month, day, and year. I want to combine these three into a single variable date.

lubridate::make_date()

I can do this with the make_date() function from the lubridate package. I should do a test run to make sure I have the syntax correct.

usa_tdy_df %$% 
    glimpse(month)
## chr [1:233] "01" "12" "11" "11" "11" ...
usa_tdy_df %$%  
    glimpse(day)
## chr [1:233] "16" "01" "23" "23" ...
usa_tdy_df %$% 
    glimpse(year)
## num [1:233] 2014 2013 2013 2013 ...

I will test and assign this to the data frame.

usa_tdy_df %>%
dplyr::select(year, month, day) %>%
  mutate(
    date = make_date(year, month, day)) %$%
  glimpse(date)
## Date[1:233], format: "2014-01-16" "2013-12-01" 
# assign
usa_tdy_df %
  mutate(
    date = make_date(year, month, day))

I can repeat this process for the dates in the other two data sets. In the next post I will verify some of the table counts and attempt to recreate the figures.

One thought on “Reading and wrangling data (2/3)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.