Scraping wikipedia tables

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends 
library(tidyverse)
library(magrittr)
library(xml2)
library(rvest)

Load the EHR data below:

EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")  
EHRMedTable %>% glimpse(78)

Find a list of antibiotic medications on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
List of 2
$ node:
$ doc :
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables % rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
List of 3
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables % str()
 List of 2
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
Observations: 168
Variables: 5
$ Generic name "Aminoglycosides", "…
$ Brand names "Aminoglycosides", "…
$ Common uses[3] "Aminoglycosides", "…
$ Possible side effects[3] "Aminoglycosides", "…
$ Mechanism of action "Aminoglycosides", "…

Looks like this is the right table! I will assign it to a data frame

titled, “WikiAntiB” and check the base::names().

WikiAntiB % names()
 [1] "Generic name"             "Brand names"
[3] "Common uses[3]" "Possible side effects[3]"
[5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>% 
dplyr::select(
generic_name = Generic name,
brand_name = Brand names,
common_uses = Common uses[3],
poss_side_effects = Possible side effects[3],
mech_action = Mechanism of action)
WikiAntiB %>% dplyr::glimpse(60)

Observations: 176
Variables: 5
$ generic_name "Aminoglycosides", "Amikacin", …
$ brand_name "Aminoglycosides", "Amikin", "G…
$ common_uses "Aminoglycosides", "Infections …
$ poss_side_effects "Aminoglycosides", "Hearing los…
$ mech_action "Aminoglycosides", "Binding to …

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
dplyr::select(
id,
med = rxddrug)
remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)

Observations: 12,957
Variables: 2
$ id 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,…
$ med "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION…

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80) 
chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" …

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds %     dplyr::mutate(med = stringr::str_to_lower(med)) EHRMeds$med %>% dplyr::glimpse(80) 

chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" …

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>% 
dplyr::select(generic_name) %>%
# select this column to split
tidyr::separate(col = generic_name,
# put medication in gen_name
into = c("gen_name", "etc"),
# separate them on the whitespace
sep = " ",
# but keep the original variable
remove = FALSE) %>%
# then take new gen_name (first med)
dplyr::select(gen_name) %>%
# get the distinct values
dplyr::distinct(gen_name) %>%
dplyr::mutate(gen_name =
# convert to lowercase
str_to_lower(gen_name),
# remove (bs)
gen_name = str_remove_all(gen_name,
pattern = "\(bs\)"),
# replace "/" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "\/",
replacement = "_"),
# replace "-" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "-",
replacement = "_")) %>%
# split the new column again, this time into 2 gen_names
# on the underscore we put there ^
tidyr::separate(col = gen_name,
into = c("gen_name1", "gen_name2"),
sep = "_",
remove = FALSE) %>%
# now get all gen_name meds into one column
tidyr::gather(key = gen_name_key,
value = gen_name_value,
gen_name1:gen_name2) %>%
# remove missing
dplyr::filter(!is.na(gen_name_value)) %>%
# and rename
dplyr::select(generic_name = gen_name_value)

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60) 

Observations: 161
Variables: 1
$ generic_name "aminoglycosides", "amikacin", "ge…

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
base::unique() %>%
base::writeLines()
unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

check rows so I know I'm not fooling myself
EHRMeds %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 701
EEHRMeds %>%
dplyr::filter(stringr::str_detect(string = med,
pattern = antibiotic_med)) %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
dplyr::filter(stringr::str_detect(med,
antibiotic_med)) %>%
dplyr::select(id,
antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>% base::unique() 

[1] "rifaximin"
[2] "amoxicillin"
[3] "hydrocortisone; neomycin; polymyxin b otic"
[4] "trimethoprim"
[5] "cefdinir"
[6] "clindamycin"
[7] "azithromycin"
[8] "sulfamethoxazole"
[9] "ethambutol"
[10] "pyrazinamide"
[11] "minocycline"
[12] "sulfamethoxazole; trimethoprim"
[13] "cefixime"
[14] "polymyxin b; trimethoprim ophthalmic"
[15] "dexamethasone; tobramycin ophthalmic"
[16] "cefuroxime"
[17] "doxycycline"
[18] "amoxicillin; clavulanate"
[19] "erythromycin topical"
[20] "ciprofloxacin"
[21] "vancomycin"
[22] "penicillin v potassium"
[23] "silver sulfadiazine topical"
[24] "penicillin"
[25] "moxifloxacin ophthalmic"
[26] "gatifloxacin ophthalmic"
[27] "metronidazole"
[28] "ciprofloxacin; dexamethasone otic"
[29] "erythromycin ophthalmic"
[30] "gentamicin ophthalmic"
[31] "azithromycin ophthalmic"
[32] "tetracycline"
[33] "ofloxacin ophthalmic"
[34] "ciprofloxacin ophthalmic"
[35] "dexamethasone; neomycin; polymyxin b ophthalmic"
[36] "chloramphenicol"
[37] "mupirocin topical"
[38] "isoniazid"
[39] "levofloxacin"
[40] "nitrofurantoin"
[41] "moxifloxacin"
[42] "benzoyl peroxide; clindamycin topical"
[43] "sulfacetamide sodium ophthalmic"
[44] "neomycin; polymyxin b sulfate topical"
[45] "sulfasalazine"
[46] "metronidazole topical"
[47] "clarithromycin"
[48] "cefprozil"
[49] "clindamycin topical"
[50] "polymyxin b sulfate"
[51] "ofloxacin otic"
[52] "tobramycin ophthalmic"
[53] "dapsone topical"

Join tables back together

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
Observations: 18,704
Variables: 9
$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ seqn 62161, 62161, 62162, 62163, 62164, 62…
$ rxduse 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ rxddrug "FLUOXETINE", "METHYLPHENIDATE", NA, …
$ rxddrgid "d00236", "d00900", NA, NA, "d00181",…
$ rxqseen 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N…
$ rxddays 5840, 5840, NA, NA, 9125, NA, 547, NA…
$ rxdcount 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N…
$ antib_med NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
dplyr::filter(!is.na(antib_med)) %>%
dplyr::count(antib_med) %>%
dplyr::distinct(id, .keep_all = TRUE) %>%
dplyr::arrange(desc(n)) %>%
utils::head(10)
# A tibble: 10 x 2
antib_med n

1 amoxicillin 126
2 azithromycin 40
3 sulfamethoxazole; trimethoprim 33
4 doxycycline 26
5 amoxicillin; clavulanate 23
6 minocycline 16
7 cefdinir 14
8 ciprofloxacin 13
9 penicillin v potassium 13
10 nitrofurantoin 11

BONUS! Using stringr::str_detect() within a dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

test -----
EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_)) %>%
dplyr::count(antib_2meds, antib_med) %>%
tidyr::spread(antib_2meds, n) %>%
head(10)
# A tibble: 10 x 4
antib_med 1 antibiotic med.2 antibiotic me… <NA>

1 amoxicillin 126 NA NA
2 amoxicillin; clavulanate NA 23 NA
3 azithromycin 40 NA NA
4 azithromycin ophthalmic 2 NA NA
5 benzoyl peroxide; clindamy… NA 4 NA
6 cefdinir 14 NA NA
7 cefixime 2 NA NA
8 cefprozil 1 NA NA
9 cefuroxime 3 NA NA
10 chloramphenicol 1 NA NA
assign -----
EHRMedTable <- EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
"EHRMedTable",
as.character(Sys.Date()),
".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text ” Teaching Statistics: A Bag of Tricks ” from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here

Ty Cobb and integrity in publishing

I recently purchased and read Ty Cobb: A Terrible Beauty by Charles Leerhsen. I decided to get this book after watching Leerhsen’s lecture at Hillsdale college. I’d always thought of Ty Cobb as the racist curmudgeon portrayed by Tommy Lee Jones in the 1994 film Cobb. Even before seeing this film, Ty Cobb’s reputation for being rotten was pervasive–when referring to him in Field of Dreams, Shoeless Joe Jackson states, “No one liked that son of a bitch.

Unfortunately for Cobb (and anyone interested in the truth), these portrayals of the baseball great’s life and character are highly fictionalized. Most of the popular opinions of Ty Cobb come from two biographies: Charles C. Alexander’s Ty Cobb, and Al Stump’s Cobb. These author’s construct a narrative that depicts Cobb as a drunk, belligerent bully who used to sharpen his cleats and scream racial epithets at his hired help.

Leerhsen does a fantastic job addressing how these stories are more likely to be based on fiction than facts, pushed by the authors to increase their book sales. After all, a baseball star who is a racist jerk will elicit a (well deserved) sense of outrage and disgust, thereby attracting more attention.

From the epilogue,

This Cobb was someone they could shake their head at, denounce, and feel superior to. Spinning stories in a way that made him look immoral was a convenient way to say, “I am not a racist because I reject this man who is.” Cultures change as values change, wars are waged and the harvest waxes and wanes, but a villain who inspires self-congratulation makes for one hell of a tenacious myth.

The tragedy of Ty Cobb’s narrative is the insightful baseball and general life lessons the man had to offer. Leerhsen distills Cobb’s philosophy on baseball into two words: pay attention. Cobb would spend endless hours mentally rehearsing the game, taking notes, and thinking up possible scenarios and plays. He also paid attention to the minds of his opponents.

Another example from Leerhsen,

After [Cobb] noticed how upset the good-hearted Big Train got when he beaned batters, Cobb stood in against him as he did against nobody else, hunching over the plate and sticking his head into the strike zone. He could have gotten killed; instead, very often, he got walked.

Anyone who has read Moneyball and knows the importance of walks and on-base percentage sees the genius of Ty Cobb at work here.


The takeaway lesson I have from this book isn’t actually from the book. It was a woman who stood and gave praise during the Q&A portion of Leerhsen’s lecture,

…you’ve written a cautionary tale that in a complicated political season has a lesson for us…what happened to Ty Cobb could not happen today because everybody knows everything, but it does happen. So thank you for your courage in writing a book that reminds us that we don’t know everything until we really know somebody and that everything that we think we know we should re-examine several times with a clear conscience and our own integrity before we make those judgments. Thank you very much…

The last portion of this statement resonated with me while reading an article from FiveThirtyEight, “We Used Broadband Data We Shouldn’t Have — Here’s What Went Wrong“.

The reason this article is important,

We should have been more careful in how we used the data to help guide where to report out our stories on inadequate internet, and we were reminded of an important lesson: that just because a data set comes from reputable institutions doesn’t necessarily mean it’s reliable.

An article like this takes courage. In the era of ‘Fake News’ and ‘alternative facts’, it’s refreshing to see this kind of honesty from a media source that relies so much on evidence-based reporting. I imagine an article like this must’ve been painful to write, but I respect the authors more after reading it. That’s when I thought of the comment from Leerhsen’s lecture, and when I noticed how important it is to think about valuing integrity.

Wikipedia defines integrity in ethics as, “the honesty and truthfulness or accuracy of one’s actions.” I tend to think of it as, “doing what you know is right even when no one is looking.”

Literate Programming & Dynamic Document Options in Stata

TL: DR

I’ve spent the last few months attempting to incorporate different literate programming and reproducible research options with the Stata statistical software. This post provides a quick overview of my goal, a brief “how-to” on each option, and my thoughts on realistically introducing them in a workflow.


What is literate programming?

Literate programming is a term coined by Donald E. Knuth in 1984. The general idea is to combine human readable text with machine-readable commands into the same document, manual, or website.  At the beginning of his paper, Knuth writes,

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

Although Knuth is describing the process of writing computer programs, his concept applies to any scenario where a series of commands are used to get a computer to perform a particular set of functions.

The data analysis process includes a successive set of commands used to manipulate the rawest form of the data, transform it into summaries or visualizations, and then model it for predictions and inferences. Each step in the process is based on the preceding step, so tracking the entire method in a systematic way is essential for remaining organized and being able to reproduce your work.

The yardstick I was using to evaluate each Stata literate programming option was how well each method provided a relatively seamless transition between the Stata commands and the human-readable explanations.

A Typical Stata Workflow

Below is an example workflow in Stata.

stata_workflow

While working in Stata, commands get put into a .do file, and they are used to create the output in either  .gph (image) or .smcl (text) files. The text results will then get written up in a .docx, and the tables and figures will be created using .xlsx. These documents are then sent off to a scientific journal somewhere, where the finished paper will most likely end up as a .pdf or .html on the web.

So to recap, the general process is:

.do >> .gph >> .smcl >> .docx >> .xlsx >> .pdf >> .html 

This workflow involves seven different file types to create a single finished manuscript (that will usually only contain only text and images).

Why should I care?

If you’ve read this far, the answer should be obvious: the process outlined above is inefficient. The analysis process is split across Stata, MS Word, MS Excel, Adobe/Preview, and whatever web browser you’re using. This makes working in Stata tedious.

Solution: Digital Notebooks

I recently came across a white paper that discusses the benefits of using Notebooks, and I’ve summarized the main points below:

  • “…notebooks speed up the process of trying out data models and frameworks and testing hypotheses, enabling everyone to work quickly, iteratively and collaboratively”
  • “…can be used to perform an analysis, generate visualizations, and add media and annotations within a single document.”
  • “…can be used to annotate the analyses for future reference and keep track of each step in the data discovery process so that the notebook becomes a source of project communication and documentation, as well as a learning resource for other data scientists.”

Although the paper is referring specifically to the Jupyter Notebooks, RStudio recently introduced the R Notebooks. Both methods combine similar sections for markdown formatted text, data analysis commands, output tables and/or figures, and other relevant portions of the results. These digital notebooks closely resemble paper laboratory notebooks (see below).

This slideshow requires JavaScript.

As you can see from this example, some of the text and calculations have been handwritten, while others have been calculated outside of the notebook, printed, and then pasted back inside the lab notebook. I commend the authors for their transparency, but this doesn’t seem like the most efficient method of keeping track of your work.

Does Stata have an equivalent option?

Sort of. Below I review my experience using three Stata options that collectively provide similar abilities to the notebooks provided by Python and R.


#1 markdoc

markdoc is a package for creating dynamic documents in Stata. It is very similar to the kintr package in R. The package was written by E. F. Haghish and is available on Github. To run markdoc, you’ll need to install Pandoc (which requires a type-setting software for TeX/LaTeX–I used MikTeX on my PC and Homebrew on my Mac).

After installing Pandoc and MikTeX, you’ll also need to download and install wkhtmltopdf.

Installing markdoc

You can install markdoc with the ssc command

ssc install markdoc

You should see:

checking markdoc consistency and verifying not already installed...
installing into c:\ado\plus\...
installation complete.

You’ll also need to install weaver and statax

ssc install weaver
ssc install statax


markdoc
also has a handy dialog box you can install with the following command:

db markdoc

The Output Files

Haghish provides example .do files for getting started with markdoc. I recommend working through each of them, but it shouldn’t be too difficult if you’re used to commenting in your .do files or writing in markdown. The .docx and .pdf output files are clean, orgranized, and formatted.

markdoc_docx

markdoc_pdf

markdoc is ideal for producing high-quality documents directly from the Stata IDE. After you understand the markdoc syntax, you will be able to perform the majority of your work in the .do file. The only downside I encountered in markdoc was a somewhat buggy installation–it worked better for me on the Mac. But the package is incredibly well maintained by the author, and I was able to find answers to my questions on his Github page eventually.


#2 weave

Germán Rodríguez at Princeton created weave, and I consider it a markdoc-light Stata package.

Installing weave

Installation is easy. Just type the following command into the Stata IDE.

net from http://data.princeton.edu/wws509/stata

And there’s an example .do file on his website.

weave essentially uses markdown/html tags for inline images and headers that are written directly into your .do files. The results are inserted into the output as plain text, so there is no need to tweak their formatting. When you’re finished with your analyses, you just type the following commands directly into the IDE.

weave using sample

The beginning of the .do file contains a command for logging using everything as a .usl file. The .usl file is then ‘weaved’ to create a .html output which will automatically open in your web browser.

The Output Files

You can just print the .html file to a .pdf like you would any web page. Chrome seems to create the best-looking .pdfs.

weave_output_pdf

*TIP: use minimal lines on your .do file to create cleaner looking output. I’ve created a detailed example here.

I use weave whenever I’m using Stata on my Mac. It’s easy to use, quick to format, and only requires me to have Stata open with a .do file.  I’ll use markdoc if I am creating a more professional-looking report, but the bugginess of markdoc doesn’t make it very user-friendly


#3 ipystata

The Jupyter Notebooks (previously IPython Notebooks) can be configured to work with Stata commands. Unfortunately, the package works best with Windows/PC.  The setup isn’t too complicated but has a few steps that can trip you up.

Download Anaconda from Continuum

You can download the most recent version of Anaconda from Continuum . This will include the following applications

anaconda_package

The only application I will be covering in this post is the Anaconda Prompt.

Changing the Jupyter Notebook working directory

The first thing you will want to do is set up your Jupyter Notebook in an appropriate working directory. You can do this by right-clicking on the Anaconda Prompt and run it as an administrator (I’ve moved the application to the taskbar).

anaconda_prompt

When the prompt is displayed (it should say Administrator)

anaconda_prompt_admin

copy+paste the file directory to the folder you want the Jupyter Notebook to open in. In the Anaconda Prompt, type

cd C:\Users\Google Drive\...\ipystata\notebooks

This will change your working directory. After you’re in the correct working directory, start up the Jupyter Notebooks by typing the following command in your Anaconda Prompt,

Jupyter Notebook

This should open a new tab in your default web browser.

jupyter

You can open a new notebook using the tab on the far right of the screen by selecting, “New” >> “Python [default]

Registering Stata

You will need to open a Command Prompt window as an administrator by right-clicking on the application and selecting, “Run as administrator” (*you can search for this application in the windows search bar by typing “cmd”).

admin_cmd

from here you need to navigate to your Stata application in your Program Files (usually in the C:\ drive)

stata_path

copy+paste the file destination and enter it into the Command Prompt window preceded by cd

cd C:\Program Files (x86)\Stata14

admin_path_stata

from this location, register the Stata application by typing the name of the .exe file followed by a space and /Register

register_stata

*No news is good news on this command. 

Installing ipystata and pandas

Now go back to your Jupyter Notebook and install pandas. Pandas is an open-source data analysis package for python. Read about it here.

In the first line of your notebook type:

import pandas as pd

To install ipystata, you’ll need to open a Windows PowerShell window (as an administrator) and enter the following command:

pip install ipystata

install_ipystata

After the package has been installed, enter the following command in your Jupyter Notebook:

import ipystata

To test if it worked, type a simple display command preceded by the %%stata . The output should look like this:

jupyter_ipystata

Using the %%stata Magic Commands

Now that you are up and running, the Jupyter Notebook basically replaces your .do file. You will just need to precede the Stata commands with a line containing the %%stata

Start by loading a native dataset

%%stata
sysuse auto, clear

You can get a quick overview of these data by using codebook, compact or describe, short

stata_describe

Including Graphs in the Output

To include graphs in your output, simply include the -gr command on the same line as your  %%stata command.

figure_1

matrix graph

figure_2

scatter plot

Sharing Your Output Online

In my opinion, the best part of using Jupyter Notebooks is the ability to share your work online. You can publish your notebook using the cloud+arrow icon on the toolbar (register your account first).

sharing

In fact, this notebook and a complete example of the ipystata package is available online. I think this feature makes the Jupyter Notebook the best option for literate programming and reproducible research in Stata. The complicated setup is definitely worth the time investment because you’ll be able to have an ongoing stream of commands, formatted text summaries, and output all in one place.

 

 

 

Tools & Resources for Learning Stata

The Resources for learning Stata page has most of the sites I describe below. Unfortunately, their list is also riddled with link rot, and many of the resources use ancient versions/commands. I’ve only included the active sites that I’ve actually used for analysis/projects.

*I’ll continue updating as I find more resources. 

My first post on this blog was about tools and resources for learning R, but I felt I should also include a list of Stata sites. 

UNC Carolina Population Center

This tutorial is centered around survey research and provides a helpful orientation to basic Stata commands. Also includes sample datasets.

INDRE at UCLA

The Institute for Digital Research and Education at University of California, Los Angeles, has a repository of your basic examples and tips. The Stata starter kit isn’t a bad place to start.

Germán Rodríguez from Princeton University

This tutorial is another crash course for anyone wanting a general understanding of Stata. He also covers some intermediate/advanced topics like macros and looping.

University of Wisconson-Madison

If there were an award for Stata window screenshots, this tutorial would win by a longshot. Unfortunately, the images are from Stata 10 or 11 (PC). Still very comprehensive with quite an extensive graphics section.

Vanderbilt Biostats Class Lecture Notes

These slides and examples are great if you’re looking for information on a particular statistical model. They have a great Multiple Poisson Regression model and an excellent overview of more complicated techniques like Neural Nets and Classification and Regression Trees.

Quick Q&A on Stata from Indiana Univeristy

The IU Knowledge Base has about 20 frequently asked questions on Stata use that I have referred to more than once. Some are general, others more esoteric. But their explanations are concise, and that’s a major plus in my book.

The Stata blog (and youtube channel)

The Stata blog has some useful information (see this example on effect sizes), as does their YouTube channel. I found the video lengths to be manageable and usually cover the topics well. The reference material from Stata Press is usually overkill but has some example datasets to work through.

Geocenter (best for last!)

I stumbled across a fantastic Stata tutorial on GitHub. The lessons include slides and homework problems (yay homework!) with solutions. I recommend this for all Stata users because it shows how using version control (like GitHub) can be implemented with the Stata .do files.

They’ve also created Stata cheat sheets that bear a striking resemblance to the RStudio cheat sheets. Well played.

Getting Started in RStudio Notebooks

This is the first draft of a post that was featured on storybench.org. Please check out that version if you run into issues here.

 

R is a powerful statistical programming language for manipulating, graphing, and modeling data. One of the major positive aspects of R is that it’s open-source (free!). But “free” does not mean “easy.” Learning to program in R is time-consuming (and occasionally frustrating), but fortunately, the number of helpful tools and packages is constantly increasing.

Enter RStudio

RStudio is an integrated development environment (IDE) that streamlines the R programming workflow into an easy to read layout. RStudio also includes useful tools (referred to as packages) for data manipulation, cleaning, restructuring, visualizations, report writing, and publishing to the web.

Just like R, it’s free. RStudio recently released R Markdown Notebooks, a nice integration of code, plain text, and results that can be exported into PDF, .docx, or HTML format.


Getting started

Start out by installing R and RStudio (you’ll need the preview version found here)

*If you need help installing R or RStudio, feel free to use this Google doc installation guide.

The IDE environment has four panes (seen below),
RStudio_setup

As you can see from the image above, the upper left pane (where I’m writing this tutorial) is the editor. The pane to the right (where it says “Environment is empty“), will show the working dataset. The lower left pane is called the console, which runs the R code. And the pane in the bottom right will display my results.

Opening a New R Notebook

To get started, click on “File” > “New File” > “R Notebook”. R Notebooks automatically start off with a title and some sample code. To see how the analysis is weaved into the Html click on the small “play” button:

play button

Save the file (“File” > “Save”) and then click on “Preview” at the top of the pane.

preview button.png

I don’t want to spoil the suspense, so I won’t put a screenshot of what you’ll see. Just know that R Notebooks does a great job of combining markdown text, R code, and results in a clean, crisp, easy-to-share finished product.

R syntax – numbers & text

You can use RStudio as a simple calculator. Type 2 + 2 directly into the console and press enter. You should see this:

2 + 2
[1] 4

 

You’re probably hoping to use RStudio for something slightly more advanced than simple arithmetic. Fortunately, R can calculate and store multiple values in variables to reference later. This is done with the <- assignment operator:

x <- 2 + 2

The <- is similar to the = sign. In fact, the = sign does the same thing, but the typical convention in R is the <-. To see the contents of x , enter it into the console and press enter.

x
[1] 4

You can also perform mathematical operations with variables. Store 4 + 4 in a variable called y and add it the variable x

y <- 4 + 4
y + x
[1] 12

R identifies numbers and text, or “string” characters. Text can also be stored into variables using the <- symbol and quotations.

a <- "human"
b <- "error"

Text strings are stored differently than numerical data in R. The commands used to manipulate strings are also slightly different.

If you want to combine two strings, use the paste function

paste(a,b)
[1] "human error"

Objects & Data Structures in R

R is an object oriented programming language, which means it recognizes objects according to their structure and type. The most common objects in R are atomic vectors and lists.

Atomic Vectors 1.1numerical & integer vectors

Numeric vectors (also called double) include “real” numbers with decimal places, while integers are whole numbers. To create numerical vectors, use the command c() which stands for concatenating (a fancy term for combining).

Below is an example of a numeric vector of odd numbers less than 10:

odd_vect <- c(1.3, 3.3, 5.5, 7.7, 9.9)

This statement is saying, “combine these five numbers into a vector and call it odd_vect

If I wanted to create an integer (or whole number) vector, I need to follow each number with an L

The assignment operator also works in the other direction–something I didn’t learn until recently. Use it to create another numerical vector named even_vect of even integers less than or equal to 10.

c(2L, 4L, 6L, 8L, 10L) -> even_vect

The c() function works for combining separate numerical vectors, too.  Add these two variables together into a new vector called ten_vect and print the contents:

ten_vect <- c(odd_vect, even_vect)

ten_vect

[1] 1.3 3.5 5.1 7.7 9.1 2.0 4.0 6.0 8.0 10.0

The final numeric vector (ten_vect) has combined both the odd and even values into a single vector.

Atomic vectors 1.2 – logical & character vectors

Logical vectors return two possible values, TRUE or FALSE. We can use logic to interrogate vectors in order to discover their type.

For example, we can use is.numeric to figure out if the ten_vect vector we created ended up being numeric or integer.

is.numeric(ten_vect)

[1] TRUE

Why did the combination of a numerical and integer vector end up being numeric? This is referred to as coercion. When a less flexible data type (numeric) is combined with a more flexible data type (integer), the more flexible element is coerced into the less flexible type.

Atomic vector 1.3 – Character vectors

In R, character vectors contain text strings. We can use character vectors to construct a sentence using a combination of c() and <- functions.
We will start with a preposition
prep_vect <- c("In")

then include a noun

noun_vect <- c("the Brothers Karamazov,")

throw in a subject,

sub_vect <- c("Dmitri")

sprinkle in a verb,

verb_vect <- c("kills")

and finish with an object

obj_vect <- c("his father")

Sentence construction can be a great way to learn how vector objects are structured in R. Atomic vectors are always flat, so you can nest them all…

sent_vect <- c("In",c("the Brothers Karamazov,",c("Dmitri",c("kills",c("his father")))))

sent_vect

[1] "In"                      "the Brothers Karamazov," "Dmitri"                 
[4] "kills"                   "his father"

Or enter them directly:
c("In","the Brothers Karamazov", "Dmitri", "kills", "his father"

[1] "In"                      "the Brothers Karamazov," "Dmitri"                 
[4] "kills"                   "his father"

Both return the same result.

Finally, we can combine each part of the sentence together using paste:

sent_vect <- paste(prep_vect, noun_vect, sub_vect, verb_vect, obj_vect)

[1] "In the Brothers Karamazov, Dmitri kills his father"

Lists

Unlike vectors–which only contain elements of a single type–lists can contain elements of different types.

We will create a list that includes an integer vector (even_vect) a logical vector (TRUE,FALSE), a full sentence ( sent_vect ), a numerical vector (odd_vect), and we will call it, my_list

my_list <- list(even_vect, c(TRUE, FALSE), c(sent_vect), c(odd_vect))

We will look at the structure of our list using str

str(my_list)

List of 4
 $ : int [1:5] 2 4 6 8 10
 $ : logi [1:2] TRUE FALSE
 $ : chr "In the Brothers Karamazov, Dmitri kills his father"
 $ : num [1:5] 1.3 3.3 5.5 7.7 9.9

Lists are recursive–they can contain other lists.

lists_on <- list(list(list(list())))

str(lists_on)

List of 1
 $ :List of 1
 ..$ :List of 1
 .. ..$ : list()

This feature separates Lists from the Atomic vectors described above.

So there you have it! This how-to should give you some basics in R programming. You can save it as HTML, pdf, or Docx file for future reference.