Wrangle multiple .csv files with a function in R

This little tutorial solves a problem I was having when working on the exploratory data analysis exercises in Doing Data Science by Cathy O’Neil and Rachel Schutt. I highly recommended picking up a copy for yourself.

On pages 37-38, we are instructed to do the following (paraphrasing here):

This folder contains 31 simulated days of ads shown and clicks recorded on the New York Times home page. Rows represent users, and the variables are: Age, Gender (0 = female, 1 = male), Impressions (number impressions), Clicks (number clicks), and a binary indicator for signed in or not Signed_in. We need to create two new variables: age_group, which contains six levels of Age (“<18”, “18-24”, “25-34”, “35-44”, “45-54”, “55-64”, and “65+”), and CTR or clickthrough-rate, calculated as the number of clicks / the number of impressions.

The EDA exercises in the book are intended for a single day, but what if I wanted to look at an entire month worth of data? Luckily I remembered learning a handy workflow for processing multiple data files in the third ggplot2 course from Datacamp by Rick Scavetta. If you have a situation where all your data files need similar wrangling/preparation before visualizing or modeling, you will find this helpful (I hope!).

Data files

I’ve moved the data from the Github repository to a local data folder (./data/).

Read data files

I will start by reading the first data set into RStudio using readr::read_csv() and then use dplyr::glimpse() to see what these data look like.

nyt1 <- readr::read_csv(file = "./data/nyt1.csv",
                col_names = TRUE)
## Parsed with column specification:
## cols(
##   Age = col_integer(),
##   Gender = col_integer(),
##   Impressions = col_integer(),
##   Clicks = col_integer(),
##   Signed_In = col_integer()
## )
nyt1 %>% dplyr::glimpse()
## Observations: 458,441
## Variables: 5
## $ Age          36, 73, 30, 49, 47, 47, 0, 46, 16, 52, 0, 21, 0, 5...
## $ Gender       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Impressions  3, 3, 3, 3, 11, 11, 7, 5, 3, 4, 8, 3, 4, 6, 5, 6, ...
## $ Clicks       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,...

I see all 5 variables in this data frame. I will create a pipeline that creates the two variables from the exercises and a third Female variable that makes the categories in Gender less ambiguous. When I am done, I look at the data with dplyr::glimpse() again.

nyt1 % 
    dplyr::mutate(
        age_group = case_when( # create age_group variable
                   Age % dplyr::glimpse()
## Observations: 458,441
## Variables: 8
## $ Age          36, 73, 30, 49, 47, 47, 0, 46, 16, 52, 0, 21, 0, 5...
## $ Gender       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Impressions  3, 3, 3, 3, 11, 11, 7, 5, 3, 4, 8, 3, 4, 6, 5, 6, ...
## $ Clicks       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,...
## $ age_group    "35-44", "65+", "25-34", "45-54", "45-54", "45-54"...
## $ CTR          0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.090...
## $ Female       "Male", "Female", "Male", "Female", "Female", "Mal...

Now I want to bundle the data reading and preparing commands as a function, clean_nyt.

clean_nyt % 
               dplyr::mutate(
                 age_group = case_when( # create age_group variable
                   Age < 18 ~ "= 18 &
                   Age = 25 & Age = 35 & 
                   Age = 45 & Age = 55 & 
                   Age = 65 ~ "65+"),          
                 CTR = Clicks/Impressions, # create CTR variable 
                 Female = case_when( # create new Female variable
                   Gender == 0 ~ "Male",              
                   Gender == 1 ~ "Female",             
                   TRUE ~ as.character(Gender))) 
} 

I will test clean_nyt() on a nyt2.csv

nyt2 <- clean_nyt("./data/nyt2.csv")
## Parsed with column specification:
## cols(
##   Age = col_integer(),
##   Gender = col_integer(),
##   Impressions = col_integer(),
##   Clicks = col_integer(),
##   Signed_In = col_integer()
## )
nyt2 %>% glimpse()
## Observations: 449,935
## Variables: 8
## $ Age          48, 0, 15, 0, 0, 0, 63, 0, 24, 16, 31, 0, 56, 52, ...
## $ Gender       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,...
## $ Impressions  3, 9, 4, 5, 7, 11, 3, 4, 2, 7, 5, 3, 5, 6, 2, 5, 4...
## $ Clicks       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,...
## $ age_group    "45-54", "<18", "<18", "<18", "<18", "<18", "55-64...
## $ CTR          0.0000, 0.1111, 0.0000, 0.0000, 0.1429, 0.0000, 0....
## $ Female       "Female", "Male", "Female", "Male", "Male", "Male"...

It looks like clean_nyt() is working!

Now I need to create a vector with the files in dir("./data"). I will call this nyt_files. Then I will paste0() the file path to the files and store this in the my_nyt_files vector.

nyt_files <- dir("./data")
my_nyt_files % head()
## [1] "./data/nyt1.csv"  "./data/nyt10.csv" "./data/nyt11.csv"
## [4] "./data/nyt12.csv" "./data/nyt13.csv" "./data/nyt14.csv"

Great. Now I can create a for loop to pass the files through and build a master data frame, my_nyt_data.

# Build my_nyt_data with a for loop
my_nyt_data <- NULL
for (file in my_nyt_files) { # for every file...
 temp <- clean_nyt(file)  # clean it with clean_nyt()
 temp$id <- sub(".csv", "", file) # add an id column (but remove .csv)
}
my_nyt_data % glimpse()
## Observations: 14,905,865
## Variables: 9
## $ Age          36, 73, 30, 49, 47, 47, 0, 46, 16, 52, 0, 21, 0, 5...
## $ Gender       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Impressions  3, 3, 3, 3, 11, 11, 7, 5, 3, 4, 8, 3, 4, 6, 5, 6, ...
## $ Clicks       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,...
## $ age_group    "35-44", "65+", "25-34", "45-54", "45-54", "45-54"...
## $ CTR          0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.090...
## $ Female       "Male", "Female", "Male", "Female", "Female", "Mal...
## $ id           "./data/nyt1", "./data/nyt1", "./data/nyt1", "./da...

That is a big data set–14,905,865 observations and 9 variables.

Clean up id

Now we just need to clean up the id variable a little with the stringr::str_replace() function and verify all 31 data sets are accounted for using dplyr::distinct() and base::nrow().

my_nyt_data % 
    dplyr::mutate(id = 
                      stringr::str_replace(id, 
                                pattern = "./data/" , 
                                replacement = "")) 
my_nyt_data %>% 
    dplyr::distinct(id) %>% 
    base::nrow()
## [1] 31

Check age_group

Ok we should check our new variables, age_group and Female. Let’s start with age_group using a combination of dplyr::count() and tidyr::spread().

my_nyt_data %>% dplyr::count(age_group, Age) %>% 
    tidyr::spread(age_group, n) %>% head()
## # A tibble: 6 x 8
##     Age   `<18` `18-24` `25-34` `35-44` `45-54` `55-64` `65+`
##                      
## 1     0 5613610      NA      NA      NA      NA      NA    NA
## 2     3       2      NA      NA      NA      NA      NA    NA
## 3     4       2      NA      NA      NA      NA      NA    NA
## 4     5      10      NA      NA      NA      NA      NA    NA
## 5     6      41      NA      NA      NA      NA      NA    NA
## 6     7     167      NA      NA      NA      NA      NA    NA

Yikes! There are 5613610 respondents with Age of 0. Let’s remove these using dplyr::filter() and re-check those zeros.

my_nyt_data %
    filter(Age != 0)
my_nyt_data %>% count(age_group, Age) %>% 
    spread(age_group, n) %>% head()
## # A tibble: 6 x 8
##     Age `<18` `18-24` `25-34` `35-44` `45-54` `55-64` `65+`
##                    
## 1     3     2      NA      NA      NA      NA      NA    NA
## 2     4     2      NA      NA      NA      NA      NA    NA
## 3     5    10      NA      NA      NA      NA      NA    NA
## 4     6    41      NA      NA      NA      NA      NA    NA
## 5     7   167      NA      NA      NA      NA      NA    NA
## 6     8   519      NA      NA      NA      NA      NA    NA

I should also check the top of the Age distribution with base::tail().

my_nyt_data %>% count(age_group, Age) %>% 
    spread(age_group, n) %>% tail()
## # A tibble: 6 x 8
##     Age `<18` `18-24` `25-34` `35-44` `45-54` `55-64` `65+`
##                    
## 1   108    NA      NA      NA      NA      NA      NA     7
## 2   109    NA      NA      NA      NA      NA      NA     1
## 3   111    NA      NA      NA      NA      NA      NA     3
## 4   112    NA      NA      NA      NA      NA      NA     1
## 5   113    NA      NA      NA      NA      NA      NA     1
## 6   115    NA      NA      NA      NA      NA      NA     1

115 is old…but possible. Ok I also want to add the dplyr::filter(Age != 0) to the clean_nyt() function.

# update function
clean_nyt %                      
                  dplyr::filter(Age != 0) %>% 
                    dplyr::mutate(
                  age_group = case_when( # create age_group variable
                          Age < 18 ~ "= 18 &
                    Age = 25 & Age = 35 &
                    Age = 45 & Age = 55 & 
                          Age = 65 ~ "65+"), 
                    CTR = Clicks/Impressions, # create CTR variable
                    Female = case_when( # create new Female variable
                                Gender == 0 ~ "Male", 
                                Gender == 1 ~ "Female",
                    TRUE ~ as.character(Gender)))
}

Let’s re-run the for loop and make sure we have nice, clean data in my_nyt_data.

# Build my_nyt_data with a for loop
my_nyt_data <- NULL
for (file in my_nyt_files) { # for every file...
 temp <- clean_nyt(file)  # clean it with clean_nyt()
 temp$id <- sub(".csv", "", file) # add an id column (but remove .csv)
    my_nyt_data % glimpse()
## Observations: 9,292,255
## Variables: 9
## $ Age          36, 73, 30, 49, 47, 47, 46, 16, 52, 21, 57, 31, 40...
## $ Gender       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,...
## $ Impressions  3, 3, 3, 3, 11, 11, 5, 3, 4, 3, 6, 5, 3, 5, 4, 4, ...
## $ Clicks       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ age_group    "35-44", "65+", "25-34", "45-54", "45-54", "45-54"...
## $ CTR          0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.090...
## $ Female       "Male", "Female", "Male", "Female", "Female", "Mal...
## $ id           "./data/nyt1", "./data/nyt1", "./data/nyt1", "./da...

I see a new sample size of 9,292,255–this is promising! I’ll re-check the age_group variable.

my_nyt_data %>% count(age_group, Age) %>% 
    spread(age_group, n)
## # A tibble: 111 x 8
##      Age `<18` `18-24` `25-34` `35-44` `45-54` `55-64` `65+`
##  *                  
##  1     3     2      NA      NA      NA      NA      NA    NA
##  2     4     2      NA      NA      NA      NA      NA    NA
##  3     5    10      NA      NA      NA      NA      NA    NA
##  4     6    41      NA      NA      NA      NA      NA    NA
##  5     7   167      NA      NA      NA      NA      NA    NA
##  6     8   519      NA      NA      NA      NA      NA    NA
##  7     9  1384      NA      NA      NA      NA      NA    NA
##  8    10  3592      NA      NA      NA      NA      NA    NA
##  9    11  8187      NA      NA      NA      NA      NA    NA
## 10    12 17054      NA      NA      NA      NA      NA    NA
## # ... with 101 more rows

That looks much better.

Check Female

Now check the new Female variable.

my_nyt_data %>% count(Gender, Female) %>% 
    spread(Female, n)
## # A tibble: 2 x 3
##   Gender  Female    Male
## *        
## 1      0      NA 4476582
## 2      1 4815673      NA

Ok–these categories are all present and accounted for. Let’s get a smaller sample of this data set to work with. I think 10% is enough. I’ll grab this with dplyr::sample_frac(), store it as nyt_data, and view the contents with dplyr::glimpse()

nyt_data % dplyr::glimpse()
## Observations: 929,226
## Variables: 9
## $ Age          23, 52, 63, 48, 63, 69, 55, 26, 66, 48, 53, 41, 60...
## $ Gender       1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1,...
## $ Impressions  7, 3, 1, 5, 3, 0, 5, 8, 4, 5, 4, 7, 7, 4, 7, 4, 7,...
## $ Clicks       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Signed_In    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ age_group    "18-24", "45-54", "55-64", "45-54", "55-64", "65+"...
## $ CTR          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, NaN, 0.000...
## $ Female       "Female", "Male", "Male", "Female", "Male", "Male"...
## $ id           "./data/nyt6", "./data/nyt6", "./data/nyt31", "./d...

Now to get a quick look at the distribution of CTR by age_group and Female in this sample.

nyt_data %>% filter(CTR != 0.0000) %>% 
    ggplot(aes(x = CTR, color = Female)) + 
                    geom_freqpoly(bins = 30) + 
                    facet_wrap(~ age_group, ncol = 3)

ggplot

ggsave("nyt_data_freqpoly_CTR.png", width = 8, height = 5, unit = "in", dpi = 480)

There you have it! 31 clean data sets and a visualization in under 250 lines!!!!


File creation

How was this file created?

  • File creation date: 2018-01-18
  • R version 3.4.3 (2017-11-30)
  • R version (short form): 3.4.3
  • tidyverse package version: 1.2.1
  • Additional session information
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       RStudio (1.1.402)           
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Los_Angeles         
##  date     2018-01-18
## Packages -----------------------------------------------------------------
##  package    * version    date       source                              
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                      
##  base       * 3.4.3      2017-12-07 local                               
##  bindr        0.1        2016-11-13 CRAN (R 3.4.0)                      
##  bindrcpp   * 0.2        2017-06-17 CRAN (R 3.4.0)                      
##  bitops     * 1.0-6      2013-08-17 CRAN (R 3.4.0)                      
##  broom        0.4.3      2017-11-20 CRAN (R 3.4.2)                      
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                      
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                      
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                      
##  compiler     3.4.3      2017-12-07 local                               
##  crayon       1.3.4      2017-09-16 CRAN (R 3.4.1)                      
##  datasets   * 3.4.3      2017-12-07 local                               
##  devtools     1.13.4     2017-11-09 CRAN (R 3.4.2)                      
##  digest       0.6.14     2018-01-14 CRAN (R 3.4.3)                      
##  dplyr      * 0.7.4      2017-09-28 CRAN (R 3.4.2)                      
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.0)                      
##  forcats    * 0.2.0.9000 2018-01-05 Github (tidyverse/forcats@fdde458)  
##  foreign      0.8-69     2017-06-22 CRAN (R 3.4.3)                      
##  ggdendro     0.1-20     2016-04-27 CRAN (R 3.4.0)                      
##  ggformula  * 0.6.1      2018-01-03 CRAN (R 3.4.3)                      
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.4.0)                      
##  glue         1.2.0      2017-10-29 CRAN (R 3.4.2)                      
##  graphics   * 3.4.3      2017-12-07 local                               
##  grDevices  * 3.4.3      2017-12-07 local                               
##  grid         3.4.3      2017-12-07 local                               
##  gridExtra    2.3        2017-09-09 CRAN (R 3.4.1)                      
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                      
##  haven        1.1.0.9000 2018-01-05 Github (tidyverse/haven@7f2b479)    
##  highr        0.6        2016-05-09 CRAN (R 3.4.0)                      
##  hms          0.4.0      2017-11-23 CRAN (R 3.4.2)                      
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                      
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                      
##  knitr      * 1.18       2017-12-27 CRAN (R 3.4.3)                      
##  labeling     0.3        2014-08-23 CRAN (R 3.4.0)                      
##  lattice    * 0.20-35    2017-03-25 CRAN (R 3.4.3)                      
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                      
##  lubridate    1.7.1      2017-11-03 CRAN (R 3.4.2)                      
##  magrittr   * 1.5        2014-11-22 CRAN (R 3.4.0)                      
##  MASS         7.3-48     2017-12-25 CRAN (R 3.4.3)                      
##  Matrix     * 1.2-12     2017-11-20 CRAN (R 3.4.3)                      
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                      
##  methods    * 3.4.3      2017-12-07 local                               
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                      
##  modelr       0.1.1      2017-07-24 CRAN (R 3.4.1)                      
##  mosaic     * 1.1.1      2017-11-28 CRAN (R 3.4.2)                      
##  mosaicCore   0.4.2      2017-11-28 CRAN (R 3.4.2)                      
##  mosaicData * 0.14.0     2016-06-17 CRAN (R 3.4.0)                      
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                      
##  nlme         3.1-131    2017-02-06 CRAN (R 3.4.3)                      
##  parallel     3.4.3      2017-12-07 local                               
##  pillar       1.1.0      2018-01-14 CRAN (R 3.4.3)                      
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.0)                      
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                      
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.1)                      
##  purrr      * 0.2.4      2017-10-18 CRAN (R 3.4.2)                      
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                      
##  Rcpp         0.12.14    2017-11-23 CRAN (R 3.4.3)                      
##  RCurl      * 1.95-4.10  2018-01-04 CRAN (R 3.4.3)                      
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                      
##  readxl       1.0.0.9000 2017-06-10 Github (tidyverse/readxl@a1c46a8)   
##  reshape2   * 1.4.3      2017-12-11 CRAN (R 3.4.2)                      
##  rlang        0.1.6      2017-12-21 CRAN (R 3.4.3)                      
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.1)                      
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                      
##  RWordPress * 0.2-3      2018-01-11 Github (duncantl/RWordPress@ce6d2d6)
##  scales       0.5.0      2017-08-24 CRAN (R 3.4.1)                      
##  splines      3.4.3      2017-12-07 local                               
##  stats      * 3.4.3      2017-12-07 local                               
##  stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                      
##  stringr    * 1.3.0      2017-12-24 Github (tidyverse/stringr@5376e12)  
##  tibble     * 1.4.1      2017-12-25 CRAN (R 3.4.3)                      
##  tidyr      * 0.7.2      2017-10-16 CRAN (R 3.4.2)                      
##  tidyselect   0.2.3      2017-11-06 CRAN (R 3.4.2)                      
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                      
##  tools        3.4.3      2017-12-07 local                               
##  utils      * 3.4.3      2017-12-07 local                               
##  withr        2.1.1      2017-12-19 CRAN (R 3.4.2)                      
##  XML          3.98-1.9   2017-06-19 CRAN (R 3.4.1)                      
##  xml2         1.1.1      2017-01-24 CRAN (R 3.4.0)                      
##  XMLRPC     * 0.3-0      2018-01-11 Github (duncantl/XMLRPC@4d20a7c)    
##  yaml         2.1.16     2017-12-12 CRAN (R 3.4.2)

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.