rstudio::conf() day 1 (resource dump)

I just finished my first day at rstudio::conf() and wanted to provide a list of references that came up throughout the day.

Books on teaching

These texts came up in the tidyverse train-the-trainer section today with Greg Wilson (check out his website here).

I mentioned Steven Pinker’s book The Sense of Style in the discussion on the curse of knowledge, and Git for Humans by David Demaree is great if you are ever sadistic enough to try and teach people how to use Git.

Visualization guide

This guide on selecting a chart was shown during the Cognitive Craft portion of the lecture. Another guide to check out is Visualizing the Uncertainty in Data by Nathan Yau.

Courses and workshops

Blogs to follow

I’m sure there are more I am forgetting, but I will add them in a future post.

Scraping wikipedia tables

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends 
library(tidyverse)
library(magrittr)
library(xml2)
library(rvest)

Load the EHR data below:

EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")  
EHRMedTable %>% glimpse(78)

Find a list of antibiotic medications on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
List of 2
$ node:
$ doc :
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables % rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
List of 3
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables % str()
 List of 2
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
Observations: 168
Variables: 5
$ Generic name "Aminoglycosides", "…
$ Brand names "Aminoglycosides", "…
$ Common uses[3] "Aminoglycosides", "…
$ Possible side effects[3] "Aminoglycosides", "…
$ Mechanism of action "Aminoglycosides", "…

Looks like this is the right table! I will assign it to a data frame

titled, “WikiAntiB” and check the base::names().

WikiAntiB % names()
 [1] "Generic name"             "Brand names"
[3] "Common uses[3]" "Possible side effects[3]"
[5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>% 
dplyr::select(
generic_name = Generic name,
brand_name = Brand names,
common_uses = Common uses[3],
poss_side_effects = Possible side effects[3],
mech_action = Mechanism of action)
WikiAntiB %>% dplyr::glimpse(60)

Observations: 176
Variables: 5
$ generic_name "Aminoglycosides", "Amikacin", …
$ brand_name "Aminoglycosides", "Amikin", "G…
$ common_uses "Aminoglycosides", "Infections …
$ poss_side_effects "Aminoglycosides", "Hearing los…
$ mech_action "Aminoglycosides", "Binding to …

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
dplyr::select(
id,
med = rxddrug)
remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)

Observations: 12,957
Variables: 2
$ id 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,…
$ med "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION…

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80) 
chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" …

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds %     dplyr::mutate(med = stringr::str_to_lower(med)) EHRMeds$med %>% dplyr::glimpse(80) 

chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" …

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>% 
dplyr::select(generic_name) %>%
# select this column to split
tidyr::separate(col = generic_name,
# put medication in gen_name
into = c("gen_name", "etc"),
# separate them on the whitespace
sep = " ",
# but keep the original variable
remove = FALSE) %>%
# then take new gen_name (first med)
dplyr::select(gen_name) %>%
# get the distinct values
dplyr::distinct(gen_name) %>%
dplyr::mutate(gen_name =
# convert to lowercase
str_to_lower(gen_name),
# remove (bs)
gen_name = str_remove_all(gen_name,
pattern = "\(bs\)"),
# replace "/" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "\/",
replacement = "_"),
# replace "-" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "-",
replacement = "_")) %>%
# split the new column again, this time into 2 gen_names
# on the underscore we put there ^
tidyr::separate(col = gen_name,
into = c("gen_name1", "gen_name2"),
sep = "_",
remove = FALSE) %>%
# now get all gen_name meds into one column
tidyr::gather(key = gen_name_key,
value = gen_name_value,
gen_name1:gen_name2) %>%
# remove missing
dplyr::filter(!is.na(gen_name_value)) %>%
# and rename
dplyr::select(generic_name = gen_name_value)

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60) 

Observations: 161
Variables: 1
$ generic_name "aminoglycosides", "amikacin", "ge…

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
base::unique() %>%
base::writeLines()
unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

check rows so I know I'm not fooling myself
EHRMeds %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 701
EEHRMeds %>%
dplyr::filter(stringr::str_detect(string = med,
pattern = antibiotic_med)) %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
dplyr::filter(stringr::str_detect(med,
antibiotic_med)) %>%
dplyr::select(id,
antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>% base::unique() 

[1] "rifaximin"
[2] "amoxicillin"
[3] "hydrocortisone; neomycin; polymyxin b otic"
[4] "trimethoprim"
[5] "cefdinir"
[6] "clindamycin"
[7] "azithromycin"
[8] "sulfamethoxazole"
[9] "ethambutol"
[10] "pyrazinamide"
[11] "minocycline"
[12] "sulfamethoxazole; trimethoprim"
[13] "cefixime"
[14] "polymyxin b; trimethoprim ophthalmic"
[15] "dexamethasone; tobramycin ophthalmic"
[16] "cefuroxime"
[17] "doxycycline"
[18] "amoxicillin; clavulanate"
[19] "erythromycin topical"
[20] "ciprofloxacin"
[21] "vancomycin"
[22] "penicillin v potassium"
[23] "silver sulfadiazine topical"
[24] "penicillin"
[25] "moxifloxacin ophthalmic"
[26] "gatifloxacin ophthalmic"
[27] "metronidazole"
[28] "ciprofloxacin; dexamethasone otic"
[29] "erythromycin ophthalmic"
[30] "gentamicin ophthalmic"
[31] "azithromycin ophthalmic"
[32] "tetracycline"
[33] "ofloxacin ophthalmic"
[34] "ciprofloxacin ophthalmic"
[35] "dexamethasone; neomycin; polymyxin b ophthalmic"
[36] "chloramphenicol"
[37] "mupirocin topical"
[38] "isoniazid"
[39] "levofloxacin"
[40] "nitrofurantoin"
[41] "moxifloxacin"
[42] "benzoyl peroxide; clindamycin topical"
[43] "sulfacetamide sodium ophthalmic"
[44] "neomycin; polymyxin b sulfate topical"
[45] "sulfasalazine"
[46] "metronidazole topical"
[47] "clarithromycin"
[48] "cefprozil"
[49] "clindamycin topical"
[50] "polymyxin b sulfate"
[51] "ofloxacin otic"
[52] "tobramycin ophthalmic"
[53] "dapsone topical"

Join tables back together

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
Observations: 18,704
Variables: 9
$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ seqn 62161, 62161, 62162, 62163, 62164, 62…
$ rxduse 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ rxddrug "FLUOXETINE", "METHYLPHENIDATE", NA, …
$ rxddrgid "d00236", "d00900", NA, NA, "d00181",…
$ rxqseen 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N…
$ rxddays 5840, 5840, NA, NA, 9125, NA, 547, NA…
$ rxdcount 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N…
$ antib_med NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
dplyr::filter(!is.na(antib_med)) %>%
dplyr::count(antib_med) %>%
dplyr::distinct(id, .keep_all = TRUE) %>%
dplyr::arrange(desc(n)) %>%
utils::head(10)
# A tibble: 10 x 2
antib_med n

1 amoxicillin 126
2 azithromycin 40
3 sulfamethoxazole; trimethoprim 33
4 doxycycline 26
5 amoxicillin; clavulanate 23
6 minocycline 16
7 cefdinir 14
8 ciprofloxacin 13
9 penicillin v potassium 13
10 nitrofurantoin 11

BONUS! Using stringr::str_detect() within a dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

test -----
EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_)) %>%
dplyr::count(antib_2meds, antib_med) %>%
tidyr::spread(antib_2meds, n) %>%
head(10)
# A tibble: 10 x 4
antib_med 1 antibiotic med.2 antibiotic me… <NA>

1 amoxicillin 126 NA NA
2 amoxicillin; clavulanate NA 23 NA
3 azithromycin 40 NA NA
4 azithromycin ophthalmic 2 NA NA
5 benzoyl peroxide; clindamy… NA 4 NA
6 cefdinir 14 NA NA
7 cefixime 2 NA NA
8 cefprozil 1 NA NA
9 cefuroxime 3 NA NA
10 chloramphenicol 1 NA NA
assign -----
EHRMedTable <- EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
"EHRMedTable",
as.character(Sys.Date()),
".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text ” Teaching Statistics: A Bag of Tricks ” from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here

Tools & Resources for Learning Stata

The Resources for learning Stata page has most of the sites I describe below. Unfortunately, their list is also riddled with link rot, and many of the resources use ancient versions/commands. I’ve only included the active sites that I’ve actually used for analysis/projects.

*I’ll continue updating as I find more resources. 

My first post on this blog was about tools and resources for learning R, but I felt I should also include a list of Stata sites. 

UNC Carolina Population Center

This tutorial is centered around survey research and provides a helpful orientation to basic Stata commands. Also includes sample datasets.

INDRE at UCLA

The Institute for Digital Research and Education at University of California, Los Angeles, has a repository of your basic examples and tips. The Stata starter kit isn’t a bad place to start.

Germán Rodríguez from Princeton University

This tutorial is another crash course for anyone wanting a general understanding of Stata. He also covers some intermediate/advanced topics like macros and looping.

University of Wisconson-Madison

If there were an award for Stata window screenshots, this tutorial would win by a longshot. Unfortunately, the images are from Stata 10 or 11 (PC). Still very comprehensive with quite an extensive graphics section.

Vanderbilt Biostats Class Lecture Notes

These slides and examples are great if you’re looking for information on a particular statistical model. They have a great Multiple Poisson Regression model and an excellent overview of more complicated techniques like Neural Nets and Classification and Regression Trees.

Quick Q&A on Stata from Indiana Univeristy

The IU Knowledge Base has about 20 frequently asked questions on Stata use that I have referred to more than once. Some are general, others more esoteric. But their explanations are concise, and that’s a major plus in my book.

The Stata blog (and youtube channel)

The Stata blog has some useful information (see this example on effect sizes), as does their YouTube channel. I found the video lengths to be manageable and usually cover the topics well. The reference material from Stata Press is usually overkill but has some example datasets to work through.

Geocenter (best for last!)

I stumbled across a fantastic Stata tutorial on GitHub. The lessons include slides and homework problems (yay homework!) with solutions. I recommend this for all Stata users because it shows how using version control (like GitHub) can be implemented with the Stata .do files.

They’ve also created Stata cheat sheets that bear a striking resemblance to the RStudio cheat sheets. Well played.