160 Rules from PLoS with googlesheets, stringr and tidytext

Today I found two great articles from PLoS Computational Biology. The first, Ten simple rules for biologists learning to program by Maureen A. Carey and Jason A. Papin, touches on a few excellent points I hadn’t read before.

  • Learn how to ask questions – (Rule 5)

“Once you understand the problem and have identified that there is no obvious (and publicly available) solution, ask for answers in programming communities. When asking, paraphrase the fundamental problem. Include error messages and enough information to reproduce the problem (include packages, versions, data or sample data, code, etc.). Present a brief summary of what was done, what was intended, how you interpret the problem, what troubleshooting steps were already taken, and whether you have searched other posts for the answer.”

  • Teach yourself – (Rule 9)

“You are not alone in your occasional frustration.”

Shout out to @drob‘s tweet!

drob_tweet

The second (published on the same day), was Ten simple rules for drawing scientific comics by Jason E. McDermott, Matthew Partridge, and Yana Bromberg. I’ve never attempted to draw a scientific comic, but after reading their article, I felt like it might be something I could tackle (see rule #1).

The advice in these two articles sounded familiar, so I got to thinking about the overlap in wording across PLoS’s Ten Simple Rules collection. I’ve been reading Text Mining with R – A Tidy Approach by Julia Silge and David Robinson and wanted to use some of the functions in the tidytext package, so I made a Googlesheet with 15 of my favorite articles from this series.

This tutorial will go over the following:

  1. Downloading the data from a google sheet with the googlesheets package
  2. Tidying the data (and un-tidying it) with tidyr
  3. Formatting a date variable with lubridate
  4. String manipulation with stringr
  5. Visualizations with ggplot2, wordcloud2, and wordcloud

Load the packages

library(wordcloud)
library(tidyverse)
library(googlesheets)
library(ggthemes)
library(tidytext)
library(magrittr)
library(scales)
library(wordcloud2)
library(webshot)
library(htmlwidgets)

1. Downloading the data from a Googlesheet

The googlesheets from Jenny Bryan and Joanna Zhao allows anyone to download data from Google’s browser-based spreadsheet.

The functions for this package are user-friendly because they all start with a gs_ prefix (for ‘googlesheet’).

The first function to get me started is googlesheets::gs_ls()–it lists the googlesheets I see when I open this page:

googlesheetsview

This takes me to an the authentication page. After providing my authentication, I can take a look at the my_sheets data frame with dplyr::dplyr()

my_sheets % dplyr::glimpse()
## Observations: 221
## Variables: 10
## $ sheet_title  "DMU Projects", "Data Is Plural — Structured Archi...
## $ author       "jen.creasman", "jsvine", "martin", "katherineb1",...
## $ perm         "r", "r", "rw", "r", "rw", "r", "r", "rw", "rw", "...
## $ version      "old", "old", "old", "old", "old", "old", "old", "...
## $ updated      2018-01-11 19:16:09, 2018-01-10 13:09:16, 2018-01...
## $ sheet_key    "1OZKbogpJjpx-jlc4CnJ7CUfCgI_BsdyjPUe3wYwcxcQ", "1...
## $ ws_feed      "https://spreadsheets.google.com/feeds/worksheets/...
## $ alternate    "https://docs.google.com/a/newsandnumbers.org/spre...
## $ self         "https://spreadsheets.google.com/feeds/spreadsheet...
## $ alt_key      "1OZKbogpJjpx-jlc4CnJ7CUfCgI_BsdyjPUe3wYwcxcQ", "1...

The information in the my_sheets table is interesting to look through–it contains sheets I’ve created, the authors, the date they were updated, and other information that comes in handy when using the other functions in the googlesheets package.

FYI: Not all the sheets that have been shared with me are listed here, but this is a common issue. Read the documentation on on gs_ls() in the reference manual for more information.

Register the googlesheet

I want to look at the 160PLoSSimpleRules sheet, so I can use the googlesheets::gs_title() function to assign this particular sheet to the Plos10GoogleSheet object.

Plos10GoogleSheet <- gs_title("160PLoSSimpleRules")
## Sheet successfully identified: "160PLoSSimpleRules"

I get the Sheet successfully identified: "160PLoSSimpleRules" message. I am going to use the base::typeof() function to see what kind of object Plos10GoogleSheet is.

Plos10GoogleSheet %>% typeof()
## [1] "list"
# how many items?
Plos10GoogleSheet %>% length()
## [1] 17
# 17 

So I can see that Plos10GoogleSheet is a list with 17 items.

Why isn’t Plos10GoogleSheet a data frame?

Lists are actually great receptacles for data of unequal length. We can run the base::summary() function on Plos10GoogleSheet to get a little more information.

Plos10GoogleSheet %>% summary()
##             Length Class   Mode     
## sheet_key    1     -none-  character
## sheet_title  1     -none-  character
## n_ws         1     -none-  numeric  
## ws_feed      1     -none-  character
## browser_url  1     -none-  character
## updated      1     POSIXct numeric  
## reg_date     1     POSIXct numeric  
## visibility   1     -none-  character
## lookup       1     -none-  logical  
## is_public    1     -none-  logical  
## author       1     -none-  character
## email        1     -none-  character
## perm         1     -none-  character
## version      1     -none-  character
## links        3     tbl_df  list     
## ws          12     tbl_df  list     
## alt_key      1     -none-  character

The summary output tells us that this list stores object of varying lengths and Modes–this makes it hard to put in a data frame. Note that the tbl_df is a is a list, but the components of these need to be equal-length vectors (a rectangle).

The 160 PLoS Simple Rules Google Sheet

The data are in the first sheet of 160PLoSSimpleRules, so I need to tell R to locate that sheet and load it’s contents into a data frame. I can do this by using the googlesheets::gs_ws_ls() function to get names of the sheets in 160PLoSSimpleRules ("160PLoSSimpleRulesTidy" in this case).

(my_google_sheet <- gs_ws_ls(Plos10GoogleSheet))
## [1] "160PLoSSimpleRulesTidy"

Now I can use the googlesheets::gs_read() function read my_google_sheet into a Plos10 data frame called Plos10.

Plos10 %
  gs_read(ws = my_google_sheet)
## Accessing worksheet titled '160PLoSSimpleRulesTidy'.
## 
Downloading: 1.2 kB     
Downloading: 1.2 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB     
Downloading: 4.5 kB
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   pubyear = col_character(),
##   title = col_character(),
##   rule_number = col_integer(),
##   rules = col_character(),
##   link = col_character()
## )

The shape of Plos10

What is the shape of this data frame?

Plos10 %>% glimpse()
## Observations: 160
## Variables: 6
## $ id           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,...
## $ pubyear      "27 Mar 2014", "27 Mar 2014", "27 Mar 2014", "27 M...
## $ title        "Ten Simple Rules for Effective Computational Rese...
## $ rule_number  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7...
## $ rules        "Look Before You Leap", "Develop a Prototype First...
## $ link         "https://doi.org/10.1371/journal.pcbi.1003506", "h...

What title’s are included?

I use the base::unique() function here because I like the way it prints in the Rmarkdown document.

Plos10 %$% unique(title) %>% writeLines()
## Ten Simple Rules for Effective Computational Research
## Ten Simple Rules for Creating a Good Data Management Plan
## Ten Simple Rules for Taking Advantage of Git and GitHub
## Ten Simple Rules for Better Figures
## Ten Simple Rules for Effective Statistical Practice
## Ten Simple Rules To Commercialize Scientific Research
## Ten Simple Rules for Editing Wikipedia
## Ten Simple Rules for Approaching a New Job
## Ten Simple Rules for Doing Your Best Research, According to Hamming
## Ten Simple Rules for Reproducible Computational Research
## Ten simple rules for biologists learning to program
## Ten simple rules for drawing scientific comics
## Ten Simple Rules for Digital Data Storage
## Ten Simple Rules for Lifelong Learning, According to Hamming
## Ten Simple Rules for the Care and Feeding of Scientific Data
## Ten simple rules for responsible big data research

The titles in this data frame include:

  1. Ten Simple Rules For Effective Computational Research
  2. Ten Simple Rules For Creating A Good Data Management Plan
  3. Ten Simple Rules For Taking Advantage Of Git And Github
  4. Ten Simple Rules For Better Figures
  5. Ten Simple Rules For Effective Statistical Practice
  6. Ten Simple Rules To Commercialize Scientific Research
  7. Ten Simple Rules For Editing Wikipedia
  8. Ten Simple Rules For Approaching A New Job
  9. Ten Simple Rules For Doing Your Best Research, According To Hamming
  10. Ten Simple Rules For Reproducible Computational Research
  11. Ten Simple Rules For Biologists Learning To Program
  12. Ten Simple Rules For Drawing Scientific Comics
  13. Ten Simple Rules For Digital Data Storage
  14. Ten Simple Rules For Lifelong Learning, According To Hamming
  15. Ten Simple Rules For The Care And Feeding Of Scientific Data
  16. Ten Simple Rules For Responsible Big Data Research

It’s probably a good idea to save this as a .csv just in case something changes on the Googlesheet.

if (!file.exists("./Data/Raw")) {
    dir.create("./Data/Raw")
}
write_csv(as_data_frame(Plos10), "./Data/Raw/Plos10.csv")

Option 2: Downloading the data from Github

There is also a copy of the raw data as a .csv on Github. I can load this into R using the typical utils::download.file() and readr::read_csv() functions.

PLoSGitHub_file_url <- c("https://goo.gl/qoS1M1") # file url
PLoSGitHub_file_path <- c("./Data/Plos10GitHub.csv") # file path
# download file
download.file(url = PLoSGitHub_file_url, 
              destfile = PLoSGitHub_file_path)
Plos10GitHub <- read_csv(PLoSGitHub_file_path)
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   pubyear = col_character(),
##   title = col_character(),
##   rule_number = col_integer(),
##   rules = col_character(),
##   link = col_character()
## )
Plos10GitHub %>% glimpse()
## Observations: 160
## Variables: 6
## $ id           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,...
## $ pubyear      "27 Mar 2014", "27 Mar 2014", "27 Mar 2014", "27 M...
## $ title        "Ten Simple Rules for Effective Computational Rese...
## $ rule_number  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7...
## $ rules        "Look Before You Leap", "Develop a Prototype First...
## $ link         "https://doi.org/10.1371/journal.pcbi.1003506", "h...

2. Tidying the data

When I created this data table, I put it in a tidy format. But what if I wanted the rules listed across columns (instead of down the rows)? I can’t think of a reason I want the data arranged this way right now, but sometimes I find it fun to un-tidy data sets just to get a better handle on the tidyr functions. I’ll go over a couple options for rearranging these data into non-tidy formats.

Un-tidying the data

One option for moving the rules from rows in a single column to 10 separate columns is to set the names with magrittr::set_names(). But this requires I know the expected number of columns beforehand in my new data frame. This isn’t too hard with this example (but it would be difficult with a larger data set). I will create a new vector or names (plos10_spread_names) and spread the rules apart by rule_number.

plos10_spread_names % 
            spread(
                key = rule_number, 
                value = rules) %>% 
            set_names(plos10_spread_names) %>% 
            glimpse()
## Observations: 16
## Variables: 14
## $ id       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ title    "27 Mar 2014", "22 Oct 2015", "14 Jul 2016", "11 Sep 2...
## $ pubyear  "Ten Simple Rules for Effective Computational Research...
## $ link     "https://doi.org/10.1371/journal.pcbi.1003506", "https...
## $ rule_1   "Look Before You Leap", "Determine the Research Sponso...
## $ rule_2   "Develop a Prototype First", "Identify the Data to Be ...
## $ rule_3   "Make Your Code Understandable to Others (and Yourself...
## $ rule_4   "Don't Underestimate the Complexity of Your Task", "Ex...
## $ rule_5   "Understand the Mathematical, Numerical, and Computati...
## $ rule_6   "Use Pictures: They Really Are Worth a Thousand Words"...
## $ rule_7   "Version Control Everything", "Define the Project’s Da...
## $ rule_8   "Test Everything", "Describe How the Data Will Be Diss...
## $ rule_9   "Share Everything", "Assign Roles and Responsibilities...
## $ rule_10  "Keep Going!", "Prepare a Realistic Budget", "Use GitH...

Another option is with dplyr::mutate() and tidyr::unite(). The first will create a new variable rule_prefix to add rule_ to the rule_number id, then tidyr::unite() sticks these two columns together into rule_key. We can just pick up here with the tidyr::spread() function, but use the new rule_key as the key variable.

Plos10GitHub %>%
    mutate(rule_prefix = 'rule') %>%
    unite(col = rule_key, # new column
           rule_prefix, # the default separator is '_'
           rule_number) %>% 
    spread(key = rule_key, 
            value = rules) %>% glimpse()
## Observations: 16
## Variables: 14
## $ id       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ pubyear  "27 Mar 2014", "22 Oct 2015", "14 Jul 2016", "11 Sep 2...
## $ title    "Ten Simple Rules for Effective Computational Research...
## $ link     "https://doi.org/10.1371/journal.pcbi.1003506", "https...
## $ rule_1   "Look Before You Leap", "Determine the Research Sponso...
## $ rule_10  "Keep Going!", "Prepare a Realistic Budget", "Use GitH...
## $ rule_2   "Develop a Prototype First", "Identify the Data to Be ...
## $ rule_3   "Make Your Code Understandable to Others (and Yourself...
## $ rule_4   "Don't Underestimate the Complexity of Your Task", "Ex...
## $ rule_5   "Understand the Mathematical, Numerical, and Computati...
## $ rule_6   "Use Pictures: They Really Are Worth a Thousand Words"...
## $ rule_7   "Version Control Everything", "Define the Project’s Da...
## $ rule_8   "Test Everything", "Describe How the Data Will Be Diss...
## $ rule_9   "Share Everything", "Assign Roles and Responsibilities...

You can check to see if you did this correctly by downloading the Plos10Wide.csv from Github and comparing their dimensions and names.

3. Format publication year

The publication year in Plos10 is listed as chr for character. I need to convert this to the Date format and this looks like a prime candidate for the lubridate::dmy() function.

Plos10$pubyear %>% glimpse()
##  chr [1:160] "27 Mar 2014" "27 Mar 2014" "27 Mar 2014" "27 Mar 2014" ...
# listed as a chr
Plos10$pubyear % glimpse()
##  Date[1:160], format: "2014-03-27" "2014-03-27" "2014-03-27" "2014-03-27" "2014-03-27" ...

4. Plot publications by year and title

Lets get a quick look at the publications year (pubyear) by title. I’ll reorder the title by the publication year so the points follow a pattern of increasing year.

Plos10 %>% 
    ggplot(aes(x = pubyear, 
               y = reorder(title, pubyear))) +
    geom_point(alpha = 0.08, 
               size = 2.5, 
               color = "dodgerblue2") 

plot of chunk 01-geom_point_pub_date

Not bad…but there is a lot of redundant language in the title. We will use stringr below to remove these characters.

5. Convert to lower case and remove redundant words

I want to clean up the title for each article because the repetitive language is essentially chart junk. The Ten Simple Rules for and "Ten Simple Rules To " strings can be removed from each title (but unfortunately they have different cases)…

No problem! I can use the stringr::str_to_title() and stringr::str_remove_all() to get the titles looking prettier.

Also note the alternative pipe from magrittr::%$% for the head() function.

Plos10$title % 
    str_to_title(.) %>% 
    str_remove_all(., "Ten Simple Rules For ") %>% 
    str_remove_all(., "Ten Simple Rules To ")
Plos10 %$% head(title)
## [1] "Effective Computational Research" "Effective Computational Research"
## [3] "Effective Computational Research" "Effective Computational Research"
## [5] "Effective Computational Research" "Effective Computational Research"

Now I will try the plot again.

Plos10 %>% 
    ggplot(aes(x = pubyear, 
               y = reorder(title, pubyear))) +
    geom_point(alpha = 0.07, 
               size = 3.2,
               color = "deepskyblue") +
        theme(plot.background = element_rect(color = "gray100", 
                                             size = 1)) + 
        theme(panel.background = element_rect(fill = "azure")) +
        theme(panel.grid.minor = element_blank()) +
        theme(panel.grid.major = element_blank()) +
        theme(axis.line = element_line(color = "grey50")) + 
        xlab("Publication Date") + 
        ylab("Publication Title") +
        ggtitle("Publication Dates for 16 PLoS Ten Simple Rules Articles")    

plot of chunk 02-geom_point_short_title

Tidying the words

The tidytext::unnest_tokens() function separates the words from the rules in the Plos10 data frame. The specific language from the tidy text book,

“A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.”

# make a giant bag of words
Plos10Tokenized %
        tidytext::unnest_tokens(words, 
                        rules)
Plos10Tokenized %>% glimpse()
## Observations: 1,012
## Variables: 6
## $ id           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pubyear      2014-03-27, 2014-03-27, 2014-03-27, 2014-03-27, 2...
## $ title        "Effective Computational Research", "Effective Com...
## $ rule_number  1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4,...
## $ link         "https://doi.org/10.1371/journal.pcbi.1003506", "h...
## $ words        "look", "before", "you", "leap", "develop", "a", "...

We can see that rules has been renamed to words and that the size of the data frame has grown from 160 observations in Plos10 to 1,012 in Plos10Tokenized.

Stop words are words that get removed because they are common and un-interesting for text analysis. Words like “the”, “is”, “at”, “which”, and “on”.

stop_words % 
                        tbl_df() %>% 
                        glimpse()
## Observations: 1,149
## Variables: 2
## $ word     "a", "a's", "able", "about", "above", "according", "ac...
## $ lexicon  "SMART", "SMART", "SMART", "SMART", "SMART", "SMART", ...

I am going to rename the two columns in tidytext::stop_words to make life a little easier for joining. There are three lexicons in the tidytext::stop_words data frame. I will use dplyr::filter() to use only the "onix word list described below:

“This stopword list is probably the most widely used stopword list. It covers a wide number of stopwords without getting too aggressive and including too many words which a user might search upon. This wordlist contains 429 words.” – from the Onix Text Retrieval Toolkit

# rename vars to make life easier
StopWords %
    dplyr::select(
        words = word,
        lex = lexicon) %>% 
    dplyr::filter(lex %in% "onix") 
StopWords %>% dplyr::glimpse()
## Observations: 404
## Variables: 2
## $ words  "a", "about", "above", "across", "after", "again", "agai...
## $ lex    "onix", "onix", "onix", "onix", "onix", "onix", "onix", ...

It looks like there are 404 words in the onix stop_words list–not 429.

Now we can use the dplyr::anti_join() function to remove the stop words from Plos10TidyText.

# remove stop_words from Plos10Tokenized
Plos10TidyText %
    anti_join(StopWords, by = "words") 

# have a look
Plos10TidyText %>% glimpse()
## Observations: 509
## Variables: 6
## $ id           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pubyear      2014-03-27, 2014-03-27, 2014-03-27, 2014-03-27, 2...
## $ title        "Effective Computational Research", "Effective Com...
## $ rule_number  1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5,...
## $ link         "https://doi.org/10.1371/journal.pcbi.1003506", "h...
## $ words        "look", "leap", "develop", "prototype", "code", "u...

This new Plos10TidyText data frame has 6 variables and 509 observations. I can use the dplyr::count() function to find the 5 most common words.

Plos10TidyText %>%
    count(words, sort = TRUE) %>% 
    head(5) %>% dplyr::glimpse()
## Observations: 5
## Variables: 2
## $ words  "data", "github", "analysis", "code", "learning"
## $ n      30, 6, 5, 5, 5

It looks like data is the most common word by far, followed by github, analysis, code and learning.

6. Creating word clouds

The word cloud is a graphical representation of word frequencies. The two options for creating word clouds in R are wordcloud and wordcloud2.

The wordcloud package

The word cloud is a graphical representation of word frequencies. The wordcloud::wordcloud() function takes a vector or words, the max.words is the upper limit of words I want on the graph, the min.freq is the lowest number of times a word needs to occur to end up in the cloud, scale is the ratio of biggest to smallest word, the rot.per is the proportion of words that will be rotated, colors allows for most R color-graphing options (I used topo.colors(25)), and random.color is a logical argument that randomly assigns a color to different words based on their frequency.

wordcloud::wordcloud(Plos10TidyText$words, # vector of words
                    max.words = 25, # maximum number of words to be plotted
                                    # least frequent terms dropped
                    min.freq = 2, # words with frequency below min.freq 
                                 # will not be plotted
                    scale = c(6, 1),
                    # colors = brewer.pal(8, "Dark2"),
                    rot.per = 0.50,
                    colors = topo.colors(25),
                    random.color = TRUE)

plot of chunk 03-PLoS_wordcloud

The wordcloud2 package

Another option for word clouds comes from the wordcloud2 package developed by Chiffon Lang.

In order to use the wordcloud2::wordcloud2() function, we need a dataframe with 2 columns–the first column with the words from PloS rules in Plos10TidyText and the second with their relative frequencies (or dplyr::count()). We will create this object and call it Plos10TidyTextCount.

Plos10TidyTextCount % 
    count(words, sort = TRUE)
Plos10TidyTextCount %>% head() %>% dplyr::glimpse()
## Observations: 6
## Variables: 2
## $ words  "data", "github", "analysis", "code", "learning", "pract...
## $ n      30, 6, 5, 5, 5, 5

Now I can send this to wordcloud2::wordcloud2() and specify the color and "random-dark", the fontWeight as "bold", and the size as 1.5.

wordcloud2::wordcloud2(Plos10TidyTextCount, 
           color = "random-dark", 
           fontWeight = "bold",
           size = 1.5)

plot of chunk 03.1-PLoS_wordcloud2

7. Plotting proportions by title

Why don’t we see how many of these words occur in the Drawing Scientific Comics in proportion to the other titles? We can do this by creating a data frame that contains 4 variables:

  1. the words created by tidytext::unnest_tokens()
  2. the title of the articles
  3. a proportion of words by title
  4. the Drawing Scientific Comics title in it’s with it’s own proportion

The steps to create this are outlined below:

# expect a warning about rows with missing values being removed
Plos10TidyProp % 
        dplyr::count(title, words) %>%  # creates df with title, 
                                        # words, and n (444 x 3)
        dplyr::group_by(title) %>% # groups by title (444 x 3)
        dplyr::mutate(proportion = n / sum(n)) %>% # adds proportion of 
                                                 # words by title (444 x 4)
        dplyr::select(-n) %>% # removes the n column (444 x 3)
        dplyr::group_by(title, words) %>% # groups by title/words (444 x 3)
        tidyr::spread(title, proportion) %>% # spreads titles to columns
                                        # creates a df thats wide (374 x 17)
        # now gather up the titles across columns into a single row 
        # except the drawing scientific comics paper
        tidyr::gather(title, proportion, 
            `Approaching A New Job`:`The Care And Feeding Of Scientific Data`,
            -(`Drawing Scientific Comics`))
        # this gets a new df with 5,610 rows and 4 columns (great for viz!)
Plos10TidyProp %>% glimpse()
## Observations: 5,610
## Variables: 4
## $ words                        ...
## $ `Drawing Scientific Comics`  ...
## $ title                        ...
## $ proportion                   ...

Why this data frame? Well, I need to be able to graph the other titles according to Drawing Scientific Comics. Summarizing these two columns helped me understand what I wanted to graph.

I want a summary of the Drawing Scientific Comics variables, but first I need to remove the missing values with dplyr::filter(), then dplyr::ungroup() on words.

Drawing_Scientific_Comics_summary % 
    filter(!is.na(`Drawing Scientific Comics`)) %>% 
    ungroup() %>% 
    summarize(
        min_prop = min(`Drawing Scientific Comics`),
        max_prop = max(`Drawing Scientific Comics`),
        n_prop = n())
Drawing_Scientific_Comics_summary %>% dplyr::glimpse()
## Observations: 1
## Variables: 3
## $ min_prop  0.04545
## $ max_prop  0.1364
## $ n_prop    285

Now I want to see the min, max, median, and n of the proportion column, but this time I want to group it by the titles. There should be 15 rows in this data frame.

Plos10TidyProp_summarize_proportions % 
    filter(!is.na(proportion)) %>% 
    ungroup() %>% 
    group_by(title) %>% 
    summarize(
        min_prop = min(proportion),
        median_prop = median(proportion),
        max_prop = max(proportion),
        n_prop = n())
Plos10TidyProp_summarize_proportions %>% dplyr::glimpse()
## Observations: 15
## Variables: 5
## $ title        "Appr...
## $ min_prop     0.043...
## $ median_prop  0.043...
## $ max_prop     0.130...
## $ n_prop       20, 2...

Ok but what if I want to add the pub_year back to this data frame? This is a great job for dplyr::left_join() and join by title. But fist I should remove any duplicated columns.

Plos10TitleYearTidy % 
    dplyr::select(title,
                  pubyear)
# join to Plos10TidyProp
Plos10TidyProp % 
    dplyr::left_join(Plos10TitleYearTidy, 
                     by = "title") 
Plos10TidyProp %>% glimpse()
## Observations: 182,138
## Variables: 5
## $ words                        ...
## $ `Drawing Scientific Comics`  ...
## $ title                        ...
## $ proportion                   ...
## $ pubyear                      ...

This creates a long data frame (182,138 observations), but only adds a single column pubyear. But I can check the min, max, and median of proportion again to see if the underlying distribution has changed.

check_summaries_on_proportion % 
    filter(!is.na(proportion)) %>% 
    ungroup() %>% 
    group_by(title) %>% 
    summarize(
        min_prop = min(proportion),
        median_prop = median(proportion),
        max_prop = max(proportion),
        n_prop = n())
check_summaries_on_proportion %>% dplyr::glimpse()
## Observations: 15
## Variables: 5
## $ title        "Appr...
## $ min_prop     0.043...
## $ median_prop  0.043...
## $ max_prop     0.130...
## $ n_prop       460, ...

The n_prop has increased from the last summary, but the proportion min and max hasn’t changed.

I am going to double-check the number of unique titles in this new Plos10TidyProp data frame (this time using dplyr::ungroup()).

Plos10TidyProp %>% 
    dplyr::ungroup() %$% 
    base::unique(title) 
##  [1] "Approaching A New Job"                         
##  [2] "Better Figures"                                
##  [3] "Biologists Learning To Program"                
##  [4] "Commercialize Scientific Research"             
##  [5] "Creating A Good Data Management Plan"          
##  [6] "Digital Data Storage"                          
##  [7] "Doing Your Best Research, According To Hamming"
##  [8] "Editing Wikipedia"                             
##  [9] "Effective Computational Research"              
## [10] "Effective Statistical Practice"                
## [11] "Lifelong Learning, According To Hamming"       
## [12] "Reproducible Computational Research"           
## [13] "Responsible Big Data Research"                 
## [14] "Taking Advantage Of Git And Github"            
## [15] "The Care And Feeding Of Scientific Data"

Still at 15!

Now before I can create an plot that shows the overlap between the words in the ten rules for Drawing Scientific Comics and any of the other 15 articles, I should create a diff_prop variable that computes the difference between the Drawing Scientific Comics and the proporiton variables. I can do this with dplyrs filter(), mutate(), group_by(), and summarize() to look at the median (middle) value and total (n).

Plos10TidyProp %>% 
    dplyr::filter(!is.na(`Drawing Scientific Comics`) & 
           !is.na(proportion)) %>% 
    dplyr::mutate(
        diff_prop = `Drawing Scientific Comics` - proportion) %>% 
    dplyr::group_by(title) %>% 
    dplyr::summarize(
        middle_diff_prop = median(diff_prop),
        n_prop = n()) %>% dplyr::glimpse()
## Observations: 4
## Variables: 3
## $ title             ...
## $ middle_diff_prop  ...
## $ n_prop            ...

It looks like there is only overlap in words for Drawing Scientific Comics and four other titles:

  1. Biologists Learning To Program
  2. Effective Statistical Practice
  3. Lifelong Learning, According To Hamming
  4. Responsible Big Data Research

I will assign this new variable to the Plos10TidyProp.

Plos10TidyProp % 
    dplyr::filter(!is.na(`Drawing Scientific Comics`) & 
           !is.na(proportion)) %>% 
    dplyr::mutate(
        diff_prop = `Drawing Scientific Comics` - proportion)
Plos10TidyProp %>% glimpse()
## Observations: 187
## Variables: 6
## $ words                        ...
## $ `Drawing Scientific Comics`  ...
## $ title                        ...
## $ proportion                   ...
## $ pubyear                      ...
## $ diff_prop                    ...

Now we are ready to build a plot of proportions on the x axis, the Drawing Scientific Comics proportions on the y, and then map the diff_prop variable to color. The geoms we will be using are the abline() (to gauge the frequency of words in each set of rules), a jitter() to add some random noise to the plot (which makes it easier to see the words around the abline()), and a text() geom to add the words on the plot (preferably within the jitter points).

Plos10TidyProp %>% 
    dplyr::filter(!is.na(diff_prop)) %>% 
    ggplot(aes(x = proportion, 
                y = `Drawing Scientific Comics`,
                color = diff_prop)) + 
    # add the abline
    geom_abline(color = "coral3", 
                lty = 2,
                slope = 1) +
    # add a jitter
    geom_jitter(alpha = 0.2,
                size = 1.1,
                width = 0.01,
                height = 0.01) +
    # add the text 
    geom_text(aes(label = words,
                color = diff_prop),
                check_overlap = TRUE,
                size = 4.5,
                nudge_x = 0.0,
                nudge_y = 0.0) +
    # add a color gradient
    scale_color_gradient(limits = c(0.0, 0.01),
                low = 'royalblue3',
                high = 'red2') +
    # adjust the scales for % 
    scale_x_continuous(labels = scales::percent, 
                limits = c(0, 0.10)) + 
    scale_y_continuous(labels = scales::percent, 
                limits = c(0, 0.15)) + 
    # remove legend position
    theme(legend.position = "none") +
    # facet by the titles with 2 x 2
    facet_wrap(~ title, ncol = 2) + 
    # add labs for x and y
    labs(x = "Proximity to line == Words in agreement",
                y = "Drawing Scientific Comics",
                title = "Comparing Word Frequency in PLoS 10 Simple Rules")

plot of chunk 04-abline_prop_wordsXtitle

How can I interpret this graph?

According to Silge & Robinson in Text Mining in R,

“Words that are close to the line in these plots have similar frequencies in both sets of…(articles)”

and

“Words that are far from the line are words that are found more in one set of (articles) than another.”

I’ve added the red and blue coloring to show the low and high thresholds with the scale_color_gradient(limits = c(0.0, 0.01)). In Drawing Scientific Comics and Effective Statistical Practice, simple occurs with high frequency, and in Drawing Scientific Comics and Biologists Learning to Program. perfect occurs with low frequency.

Should I track down the rules that have these common words and see what they are? Yes.

8. BONUS! tidyr::extract()

Above I used the stringr package to convert the titles to ‘title case’ and remove the redundant language. This package can handle the most common string manipulations. But the tidyr package comes with a neat extract() function that comes in handy when you need to find regular expression matches and return a data frame.

Find rules with common words

So I have the rules vector in the Plos10 data frame, and I want to look in the Drawing Scientific Comics and Effective Statistical Practice for rules that have the word simple. Then I want to look in Drawing Scientific Comics and Biologists Learning To Program for rules that have the word perfect.

When I think about the steps in terms of the objects (or nouns), it looks like this:

  1. Look in Columns for “title” and “rules”
  2. Match on Rows that have “Drawing Scientific Comics” or “Effective Statistical Practice”
  3. Find Values with “simple” and “perfect”
  4. …?

I know I needed to ultimately return two Columns (rules and titles) with four Values (each containing simple and perfect). When I map these objects to functions (or Verbs), I got the following:

  1. Look in Columns: dplyr::select()
  2. Match on Rows: dplyr::filter()
  3. Find Values: ???

I knew I needed to return the rules column, but I wanted it in a data frame (preferably with the title, too). Then I remembered tidyr::extract(), a neat function described here in R for Data Science that takes data in a tibble, matches on a named set of strings, and then puts these matches into newly created columns.

So now I had a complete pipeline to map my steps to functions.

  1. Look in Columns: dplyr::select()
  2. Match on Rows: dplyr::filter()
  3. Find Values: tidyr::extract(), return Columns: into = " "
  4. Return Values: dplyr::filter(!is.na(" ")

I remembered this function because it’s actually in the section on stringr, not tidyr.

Now I can use this pipeline to identify the two common simple rules:

Plos10 %>%
    dplyr::select(title,
        rules) %>% 
    dplyr::filter(title %in% "Drawing Scientific Comics" |
        title %in% "Effective Statistical Practice") %>% 
    tidyr::extract(rules, 
        into = "simple_string", 
        regex = "(Simple|simple)", 
        remove = FALSE) %>% 
    dplyr::filter(!is.na(simple_string)) %$% 
    base::unique(rules)
## [1] "Keep it Simple"         
## [2] "Comics should be simple"

I can repeat the same process but instead of extracting the simple rules, I will extract the perfect rules.

Plos10 %>%
    dplyr::select(title,
        pubyear,
        rules) %>% 
    dplyr::filter(title %in% "Drawing Scientific Comics" |
        title %in% "Biologists Learning To Program") %>% 
    tidyr::extract(rules, 
        into = "perfect_string", 
        regex = "(perfect | perfect)", 
        remove = FALSE) %>% 
    dplyr::filter(!is.na(perfect_string)) %$% 
    base::unique(rules)
## [1] "Practice makes perfect"    
## [2] "Make it right, not perfect"

I think it’s safe to say the two rules from Effective Statistical Practice and Drawing Scientific Comics are pretty similar (or attempt to make a similar point): ‘Simplicity is a virtue.’

The rules from Biologists Learning To Program and Drawing Scientific Comics are less analogous. If the goal is to ‘make it right, not perfect’ (as noted in Drawing Scientific Comics), then should I ‘practice making it right’ (and not focus on perfection)? Obviously practice plays an important role in acquiring any skill, but I think there’s something to be said for emphasizing getting things done right over getting them perfectly.

Additional resources

Check out these resources for more on the topics covered here:

  1. R for Data Science
  2. Text Mining with R
  3. Tidyverse.org
  4. knitr
  5. wordcloud2
  6. wordcloud

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.