Scraping wikipedia tables

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends 
library(tidyverse)
library(magrittr)
library(xml2)
library(rvest)

Load the EHR data below:

EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")  
EHRMedTable %>% glimpse(78)

Find a list of antibiotic medications on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
List of 2
$ node:
$ doc :
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables % rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
List of 3
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables % str()
 List of 2
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
Observations: 168
Variables: 5
$ Generic name "Aminoglycosides", "…
$ Brand names "Aminoglycosides", "…
$ Common uses[3] "Aminoglycosides", "…
$ Possible side effects[3] "Aminoglycosides", "…
$ Mechanism of action "Aminoglycosides", "…

Looks like this is the right table! I will assign it to a data frame

titled, “WikiAntiB” and check the base::names().

WikiAntiB % names()
 [1] "Generic name"             "Brand names"
[3] "Common uses[3]" "Possible side effects[3]"
[5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>% 
dplyr::select(
generic_name = Generic name,
brand_name = Brand names,
common_uses = Common uses[3],
poss_side_effects = Possible side effects[3],
mech_action = Mechanism of action)
WikiAntiB %>% dplyr::glimpse(60)

Observations: 176
Variables: 5
$ generic_name "Aminoglycosides", "Amikacin", …
$ brand_name "Aminoglycosides", "Amikin", "G…
$ common_uses "Aminoglycosides", "Infections …
$ poss_side_effects "Aminoglycosides", "Hearing los…
$ mech_action "Aminoglycosides", "Binding to …

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
dplyr::select(
id,
med = rxddrug)
remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)

Observations: 12,957
Variables: 2
$ id 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,…
$ med "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION…

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80) 
chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" …

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds %     dplyr::mutate(med = stringr::str_to_lower(med)) EHRMeds$med %>% dplyr::glimpse(80) 

chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" …

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>% 
dplyr::select(generic_name) %>%
# select this column to split
tidyr::separate(col = generic_name,
# put medication in gen_name
into = c("gen_name", "etc"),
# separate them on the whitespace
sep = " ",
# but keep the original variable
remove = FALSE) %>%
# then take new gen_name (first med)
dplyr::select(gen_name) %>%
# get the distinct values
dplyr::distinct(gen_name) %>%
dplyr::mutate(gen_name =
# convert to lowercase
str_to_lower(gen_name),
# remove (bs)
gen_name = str_remove_all(gen_name,
pattern = "\(bs\)"),
# replace "/" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "\/",
replacement = "_"),
# replace "-" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "-",
replacement = "_")) %>%
# split the new column again, this time into 2 gen_names
# on the underscore we put there ^
tidyr::separate(col = gen_name,
into = c("gen_name1", "gen_name2"),
sep = "_",
remove = FALSE) %>%
# now get all gen_name meds into one column
tidyr::gather(key = gen_name_key,
value = gen_name_value,
gen_name1:gen_name2) %>%
# remove missing
dplyr::filter(!is.na(gen_name_value)) %>%
# and rename
dplyr::select(generic_name = gen_name_value)

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60) 

Observations: 161
Variables: 1
$ generic_name "aminoglycosides", "amikacin", "ge…

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
base::unique() %>%
base::writeLines()
unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

check rows so I know I'm not fooling myself
EHRMeds %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 701
EEHRMeds %>%
dplyr::filter(stringr::str_detect(string = med,
pattern = antibiotic_med)) %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
dplyr::filter(stringr::str_detect(med,
antibiotic_med)) %>%
dplyr::select(id,
antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>% base::unique() 

[1] "rifaximin"
[2] "amoxicillin"
[3] "hydrocortisone; neomycin; polymyxin b otic"
[4] "trimethoprim"
[5] "cefdinir"
[6] "clindamycin"
[7] "azithromycin"
[8] "sulfamethoxazole"
[9] "ethambutol"
[10] "pyrazinamide"
[11] "minocycline"
[12] "sulfamethoxazole; trimethoprim"
[13] "cefixime"
[14] "polymyxin b; trimethoprim ophthalmic"
[15] "dexamethasone; tobramycin ophthalmic"
[16] "cefuroxime"
[17] "doxycycline"
[18] "amoxicillin; clavulanate"
[19] "erythromycin topical"
[20] "ciprofloxacin"
[21] "vancomycin"
[22] "penicillin v potassium"
[23] "silver sulfadiazine topical"
[24] "penicillin"
[25] "moxifloxacin ophthalmic"
[26] "gatifloxacin ophthalmic"
[27] "metronidazole"
[28] "ciprofloxacin; dexamethasone otic"
[29] "erythromycin ophthalmic"
[30] "gentamicin ophthalmic"
[31] "azithromycin ophthalmic"
[32] "tetracycline"
[33] "ofloxacin ophthalmic"
[34] "ciprofloxacin ophthalmic"
[35] "dexamethasone; neomycin; polymyxin b ophthalmic"
[36] "chloramphenicol"
[37] "mupirocin topical"
[38] "isoniazid"
[39] "levofloxacin"
[40] "nitrofurantoin"
[41] "moxifloxacin"
[42] "benzoyl peroxide; clindamycin topical"
[43] "sulfacetamide sodium ophthalmic"
[44] "neomycin; polymyxin b sulfate topical"
[45] "sulfasalazine"
[46] "metronidazole topical"
[47] "clarithromycin"
[48] "cefprozil"
[49] "clindamycin topical"
[50] "polymyxin b sulfate"
[51] "ofloxacin otic"
[52] "tobramycin ophthalmic"
[53] "dapsone topical"

Join tables back together

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
Observations: 18,704
Variables: 9
$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ seqn 62161, 62161, 62162, 62163, 62164, 62…
$ rxduse 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ rxddrug "FLUOXETINE", "METHYLPHENIDATE", NA, …
$ rxddrgid "d00236", "d00900", NA, NA, "d00181",…
$ rxqseen 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N…
$ rxddays 5840, 5840, NA, NA, 9125, NA, 547, NA…
$ rxdcount 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N…
$ antib_med NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
dplyr::filter(!is.na(antib_med)) %>%
dplyr::count(antib_med) %>%
dplyr::distinct(id, .keep_all = TRUE) %>%
dplyr::arrange(desc(n)) %>%
utils::head(10)
# A tibble: 10 x 2
antib_med n

1 amoxicillin 126
2 azithromycin 40
3 sulfamethoxazole; trimethoprim 33
4 doxycycline 26
5 amoxicillin; clavulanate 23
6 minocycline 16
7 cefdinir 14
8 ciprofloxacin 13
9 penicillin v potassium 13
10 nitrofurantoin 11

BONUS! Using stringr::str_detect() within a dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

test -----
EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_)) %>%
dplyr::count(antib_2meds, antib_med) %>%
tidyr::spread(antib_2meds, n) %>%
head(10)
# A tibble: 10 x 4
antib_med 1 antibiotic med.2 antibiotic me… <NA>

1 amoxicillin 126 NA NA
2 amoxicillin; clavulanate NA 23 NA
3 azithromycin 40 NA NA
4 azithromycin ophthalmic 2 NA NA
5 benzoyl peroxide; clindamy… NA 4 NA
6 cefdinir 14 NA NA
7 cefixime 2 NA NA
8 cefprozil 1 NA NA
9 cefuroxime 3 NA NA
10 chloramphenicol 1 NA NA
assign -----
EHRMedTable <- EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
"EHRMedTable",
as.character(Sys.Date()),
".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text ” Teaching Statistics: A Bag of Tricks ” from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.