I mentioned Steven Pinker’s book The Sense of Style in the discussion on the curse of knowledge, and Git for Humans by David Demaree is great if you are ever sadistic enough to try and teach people how to use Git.
I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.
This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.
An EHR medications table
I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.
First load the packages:
# load tidyverse and friends library(tidyverse) library(magrittr) library(xml2) library(rvest)
List of 3 $ :List of 2 ..$ node: ..$ doc : ..- attr(*, "class")= chr "xml_node" $ :List of 2 ..$ node: ..$ doc : ..- attr(*, "class")= chr "xml_node" $ :List of 2 ..$ node: ..$ doc : ..- attr(*, "class")= chr "xml_node" - attr(*, "class")= chr "xml_nodeset"
This is a list of three lists, each of them xml_nodes.
Use grep to find relevant tables
In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.
Get the relevant tables from the xml_nodeset in wiki_html_tables.
relevant_tables % str()
List of 2 $ :List of 2 ..$ node: ..$ doc : ..- attr(*, "class")= chr "xml_node" $ :List of 2 ..$ node: ..$ doc : ..- attr(*, "class")= chr "xml_node" - attr(*, "class")= chr "xml_nodeset"
This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [] and set fill = TRUE.
I also use dplyr::glimpse(60)
rvest::html_table(relevant_tables[], fill = TRUE) %>%
Observations: 168 Variables: 5 $ Generic name "Aminoglycosides", "… $ Brand names "Aminoglycosides", "… $ Common uses "Aminoglycosides", "… $ Possible side effects "Aminoglycosides", "… $ Mechanism of action "Aminoglycosides", "…
Looks like this is the right table! I will assign it to a data frame
titled, “WikiAntiB” and check the base::names().
WikiAntiB % names()
 "Generic name" "Brand names"  "Common uses" "Possible side effects"  "Mechanism of action"
The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.
But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.
WikiAntiB <- WikiAntiB %>% dplyr::select( generic_name = Generic name, brand_name = Brand names, common_uses = Common uses, poss_side_effects = Possible side effects, mech_action = Mechanism of action) WikiAntiB %>% dplyr::glimpse(60)
Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.
WikiAntiBGenName <- WikiAntiB %>% dplyr::select(generic_name) %>% # select this column to split tidyr::separate(col = generic_name, # put medication in gen_name into = c("gen_name", "etc"), # separate them on the whitespace sep = " ", # but keep the original variable remove = FALSE) %>% # then take new gen_name (first med) dplyr::select(gen_name) %>% # get the distinct values dplyr::distinct(gen_name) %>% dplyr::mutate(gen_name = # convert to lowercase str_to_lower(gen_name), # remove (bs) gen_name = str_remove_all(gen_name, pattern = "\(bs\)"), # replace "/" w/ underscore gen_name = str_replace_all(gen_name, pattern = "\/", replacement = "_"), # replace "-" w/ underscore gen_name = str_replace_all(gen_name, pattern = "-", replacement = "_")) %>% # split the new column again, this time into 2 gen_names # on the underscore we put there ^ tidyr::separate(col = gen_name, into = c("gen_name1", "gen_name2"), sep = "_", remove = FALSE) %>% # now get all gen_name meds into one column tidyr::gather(key = gen_name_key, value = gen_name_value, gen_name1:gen_name2) %>% # remove missing dplyr::filter(!is.na(gen_name_value)) %>% # and rename dplyr::select(generic_name = gen_name_value)
Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.
Searching for string patterns with stringr::str_detect()
The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.
First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().
check rows so I know I'm not fooling myself EHRMeds %>% dplyr::distinct(med) %>% base::nrow() #  701
BONUS! Using stringr::str_detect() within a dplyr::case_when()
If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.
# A tibble: 10 x 4 antib_med 1 antibiotic med.2 antibiotic me… <NA>
1 amoxicillin 126 NA NA 2 amoxicillin; clavulanate NA 23 NA 3 azithromycin 40 NA NA 4 azithromycin ophthalmic 2 NA NA 5 benzoyl peroxide; clindamy… NA 4 NA 6 cefdinir 14 NA NA 7 cefixime 2 NA NA 8 cefprozil 1 NA NA 9 cefuroxime 3 NA NA 10 chloramphenicol 1 NA NA
The following quote is from a section titled, “learn how to learn new technologies”,
In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.
I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.
I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.
The Resources for learning Stata page has most of the sites I describe below. Unfortunately, their list is also riddled with link rot, and many of the resources use ancient versions/commands. I’ve only included the active sites that I’ve actually used for analysis/projects.
*I’ll continue updating as I find more resources.
My first post on this blog was about tools and resources for learning R, but I felt I should also include a list of Stata sites.
If there were an award for Stata window screenshots, this tutorial would win by a longshot. Unfortunately, the images are from Stata 10 or 11 (PC). Still very comprehensive with quite an extensive graphics section.
These slides and examples are great if you’re looking for information on a particular statistical model. They have a great Multiple Poisson Regression model and an excellent overview of more complicated techniques like Neural Nets and Classification and Regression Trees.
The IU Knowledge Base has about 20 frequently asked questions on Stata use that I have referred to more than once. Some are general, others more esoteric. But their explanations are concise, and that’s a major plus in my book.
The Stata blog (and youtube channel)
The Stata blog has some useful information (see this example on effect sizes), as does their YouTube channel. I found the video lengths to be manageable and usually cover the topics well. The reference material from Stata Press is usually overkill but has some example datasets to work through.
I stumbled across a fantastic Stata tutorial on GitHub. The lessons include slides and homework problems (yay homework!) with solutions. I recommend this for all Stata users because it shows how using version control (like GitHub) can be implemented with the Stata .do files.
They’ve also created Stata cheat sheets that bear a striking resemblance to the RStudio cheat sheets. Well played.