Quickly import and export data from R with datapasta

EDIT: As usual, Mara Averick was way ahead of me on these two packages 🙂 She wrote a great example (with much cleaner .gifs), so be sure to check out her post too!

https://maraaverick.rbind.io/2018/10/reprex-with-datapasta/
follow her on twitter too! -> https://twitter.com/dataandme

The datapasta package is great for importing and exporting data. It’s copy + paste ability transforms just about any rectangular data you can drag your mouse over into data.frame‘s or tibble‘s.

Using datapasta

Check out the data in this table on male and female height and weights. I can highlight it all, click cmd + c (or ctrl + c),

Now I can head back to RStudio and enter the following in a fresh .R script.

library(datapasta)
library(tidyverse)
datapasta::tribble_paste()

You should see something like this:

ahhh pure bliss!

Pretty slick, huh? Unfortunately, when I try to run this code, I get the following error:

Error in list2(…) : object 'NANA' not found

It looks like the NANA values are throwing the tribble_paste() function off. No worries, datapasta also has a df_paste() function too!

Can this get any better?

Now I can also go through and edit the data.frame() function a bit to remove the pesky NA columns and values.

What if my dataset is really big?

Maybe you have a huge dataset you want to import into R, but you’re not sure if datapasta can handle it? All it takes is a little adjustment on the datapasta::dp_set_max_rows() function.

For example, if I wanted to copy + paste this table into an RStudio session, I could enter datapasta::dp_set_max_rows(num_rows = 15000) in my .R script just above the tribble_paste() function.

datapasta::dp_set_max_rows(num_rows = 15000)
datapasta::tribble_paste()
it’s slow, but it gets there…

As you can see, tribble_paste() parsed this table into a tibble::tribble() function (and it was over 14000 rows!).

But wait…there’s more!

What if I found a problem with the table I just pasta ‘d into R (or another data set in R)? I don’t know about you, but I’m constantly needing to ask questions or share code on Stackoverflow or RStudio Community.

Well, the handy-dandy dpasta() function helps me create excellent reproducible examples.

Let’s assume I needed to share a sample from the height and weight data I just imported (WordHtWt).

  • First I’ll add some meaningful names to the columns in WordHtWt (using magrittr::set_names()),
  • Then take a small sample with help from dplyr (it’s a good idea to always use the smallest possible data frame to re-create the problem),
  • And…
# new names
world_height_weight_names <- c("country", "male_avg_ht_m",
"male_avg_wt_kg", "male_bmi",
"female_avg_ht_m", "female_avg_wt_kg",
"female_bmi")
# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# PASTA!!!
dpasta()
Voila!

Now I have a nice bit of code I can post (and hopefully get my questions answered).

Friends and alternatives to datapasta

datapasta plays well with the reprex package. If you aren’t sure what reprex does, you should watch the webinar from Jenny Bryan (the package author). If you are looking for the base R alternative to dpasta(), there’s dput(), but the output is not as clean (and it doesn’t have a direct analog to the _paste() functions).


# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# try this with dput()
dput()

structure(list(country = c("Taiwan", "Burma", "Kazakhstan", "Bolivia",
"Belgium", "Mali", "Mauritius", "Laos", "Burundi", "France",
"Mexico", "Nigeria", "Turkey"), male_avg_ht_m = c("1.73 m", "1.65 m",
"1.72 m", "1.67 m", "1.81 m", "1.72 m", "1.71 m", "1.60 m", "1.68 m",
"1.79 m", "1.68 m", "1.67 m", "1.74 m"), male_avg_wt_kg = c("74.8 kg",
"60.4 kg", "77.8 kg", "70.6 kg", "87.8 kg", "67.7 kg", "71.9 kg",
"57.9 kg", "61.5 kg", "83.3 kg", "77.6 kg", "63.0 kg", "82.4 kg"
), male_bmi = c(25, 22.2, 26.3, 25.3, 26.8, 22.9, 24.6, 22.6,
21.8, 26, 27.5, 22.6, 27.2), female_avg_ht_m = c("1.60 m", "1.54 m",
"1.60 m", "1.53 m", "1.65 m", "1.61 m", "1.57 m", "1.51 m", "1.55 m",
"1.65 m", "1.56 m", "1.58 m", "1.60 m"), female_avg_wt_kg = c("60.7 kg",
"54.5 kg", "68.1 kg", "64.8 kg", "70.0 kg", "59.9 kg", "64.1 kg",
"52.4 kg", "51.7 kg", "66.4 kg", "69.4 kg", "59.9 kg", "73.7 kg"
), female_bmi = c(23.7, 23, 26.6, 27.7, 25.7, 23.1, 26, 23, 21.5,
24.4, 28.5, 24, 28.8)), row.names = c(NA, -13L), class = "data.frame")

Additional resources

Check out the vignette for datapasta here, the tidyverse packages, and an excellent description of how to write a reproducible example from Advanced R by Hadley Wickham. Be sure to thank the datapasta author Miles McBain for all the future headaches he just saved you from.

Querying MySQL from RStudio

In the previous post I demonstrated how to install and use MySQL from the command line. In this tutorial I will show you how to set up and query a relational database management system (RDBMS) from RStudio.

The Lahman database

These queries are done using the Lahman baseball data set available here. I chose these data because 1) they are open to anyone and can be run on your local machine, and 2) they are available as a .sql file and .csv files for loading into any RDBMS software.

Database packages in RStudio

For this tutorial I will be using a local instance of MySQL with help from the DBI, dplyr, and RMySQL packages.

library(DBI) 
library(RMySQL)
library(RMariaDB)
library(dplyr)
library(rstudioapi)
library(ggthemes)

Change MySQL settings

Before we can connect to a database with RStudio, we need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption.
  3. Then click Start MySQL Sever

Connect to MySQL

A DBI connection is built below to my local instance of MySQL.

LahmanDBIMySQL <- DBI::dbConnect(MySQL(),
dbname = "lahman2016",
host = "localhost",
user = "root",
rstudioapi::askForPassword(prompt = "database password")) LahmanDBIMySQL

As I can see, this is a <MySQLConnection:0,1>. This is the object we will use to access the lahman2016 database.

Querying a database from RStudio

We can get a list of tables in the lahman2016 database by using the DBI::dbListTables() function.

DBI::dbListTables(LahmanDBIMySQL)
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingOFsplit"
[13] "FieldingPost" "Griffeys" "HallOfFame"
[16] "HomeGames" "Managers" "ManagersHalf"
[19] "Master" "Parks" "Pitching"
[22] "PitchingPost" "Salaries" "Schools"
[25] "SeriesPost" "Teams" "TeamsFranchises"
[28] "TeamsHalf"

This is similar to the SHOW TABLES; command in MySQL. In the previous tutorial, I built a table of statistics on Ken Griffey Junior and Senior in the Lahman database. The Griffeys table contains following fields:

DBI::dbListFields(LahmanDBIMySQL, "Griffeys") 
## [1] "playerID" "birthYear" "nameFirst" "nameLast" "weight"
## [6] "height" "bats" "throws" "retroID" "bbrefID"
## [11] "teamID" "lgID" "yearID" "G" "AB"
## [16] "R" "H" "HR" "RBI" "BB"
## [21] "SO"

Queries can be passed directly to the DBI::dbGetQuery() function, like the SELECT statement below.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT * FROM Griffeys;") %>%
dplyr::glimpse(78)

Observations: 45
Variables: 21
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ retroID "grifk001", "grifk001", "grifk001", "grifk001", "grifk001…
$ bbrefID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…

I realize I don’t have all the columns I want from the Batting and Fielding tables, so I will query lahman2016 again to collect these data.

Aliasing columns in MySQL tables

There are three columns in the Batting table I want to add to Griffeys: SF, 2B, and 3B. Two of these names violate naming rules in R.

DBI::dbListFields(LahmanDBIMySQL, "Batting") 
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "AB" "R" "H" "2B" "3B" "HR"
[13] "RBI" "SB" "CS" "BB" "SO" "IBB"
[19] "HBP" "SH" "SF" "GIDP"

The 2B and 3B columns contain the number of doubles and triples for each player, and SF is the number of sacrifice flies. Unfortunately, R won’t allow columns to have names that start with a number. I can write a query to include these columns, but use an alias to create R-friendly column names.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;") %>% glimpse(78)

Observations: 45
Variables: 22
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…
$ dubs 5, 9, 15, 28, 35, 33, 27, 28, 21, 23, 21, 20, 28, 7, 15, …
$ trips 1, 5, 9, 9, 8, 8, 4, 10, 6, 2, 3, 1, 4, 0, 3, 1, 0, 0, 3,…
$ sac_flies "0", "0", "3", "3", "2", "3", "3", "5", "4", "3", "2", "4…

Storing SQL in character vectors

I can also assign the query above to a character vector (batting_query) and pass the vector to the DBI::dbGetQuery() function. I will assign these new variables to the Griffeys table.

batting_query <- "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;"
Griffeys <- DBI::dbGetQuery(LahmanDBIMySQL, batting_query)

Now that we see the SQL works and these two tables are joined, we can calculate a few new statistics in R.

Calculating On-base plus slugging (OPS)

A players on-base plus slugging (OPS) is a measure of their overall batting performance. This is the sum of the player’s on-base percentage and slugging average. The steps to calculate this statistic are below.

GriffsOPSbyYear <- Griffeys %>% 
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr"))

I created a few intermediate calculations before creating obslug_perc, and I also created a nameID so the two players can be easily graphed. Below is the on-base plus slugging (OPS) statistic for both Griffeys over their career.

GriffsOPSbyYear %>% 
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) + ggthemes::theme_fivethirtyeight()

The Lahman package in R

There is also a Lahman package in R. I will use this below to demonstrate some of the similarities between dplyr and MySQL syntax.

First I create an object for the Lahman::Master, Lahman::Fielding, and Lahman::Batting tables.

Master <- Lahman::Master
Batting <- Lahman::Batting
Fielding <- Lahman::Fielding

Then the code below creates the same tables and graph using dplyr commands.

GriffeyOPSPlot <- Master %>% 
dplyr::select(playerID,
birthYear,
nameLast) %>%
# find Griffeys
dplyr::filter(nameLast %in% "Griffey") %>%
# join to batting table
dplyr::inner_join(x = .,
y = Batting,
by = "playerID") %>%
# select/rename relevant columns
dplyr::select(teamID,
nameLast,
playerID,
birthYear,
yearID,
G,
AB,
R,
H,
HR,
dubs = X2B,
trips = X3B,
RBI,
BB,
SO,
sac_flies = SF) %>%
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr")) %>%
# create plot of new OPS
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) +
ggthemes::theme_fivethirtyeight()
ggplot2::ggsave(filename = "GriffeyOPSPlot.png", width = 7, height = 5, units = "in")

End

UFC 226 brought out the big guns (again…)

Why did the UFC let Brock Lesnar challenge Daniel Cormier for the belt?

This weekend I watched UFC 226, Miocic vs. Cormier. This was a title match with Miocic defending the heavyweight belt for the third time. Cormier landed a hard right hook in the final minute of the first round that dropped Miocic to the mat, and after a few ground-and-pound head shots, Cormier was given the win by KO (punches).

Before Joe Rogan could finish his fight recap interview, Cormier grabbed the microphone and called out Brock Lesnar (who was conveniently standing just outside the octagon). Lesnar entered the ring, said some rather disrespectful things about Miocic, then called out Cormier, alluding to an upcoming title fight between the 39-year-old Cormier and for WWE star.

I found this spectacle to be surreal, and couldn’t believe Joe Rogan let someone take his microphone. To say the least, it was an unprofessional display of sportsmanship and made me feel like I was watching more of a stage play than an athletic event.

Why would Dana White allow Brock Lesnar back into the UFC?

This question was bouncing around in my head most of the night, and I decided to dig into the numbers to see if I could understand what happened.

The Sports Daily data

After some snooping. I found a data set of pay-per-view sales (the actual metric is ‘Buy Rate’). These data come from the post titled, ” All-Time UFC PPV Sales Data,” and they contain data up until 225 (right before the fight in question).

The code chunk below downloads the table of data from the website, extracts the table, and checks the shape of this new data frame.

PPVUFC_url <- "http://thesportsdaily.com/2018/02/16/all-time-ufc-ppv-sales-data-fox11/"
PPVUFC_extraction <- PPVUFC_url %>%
     read_html() %>%
     html_nodes("table")
# check the structure of the new PPVUFC_extraction object
# PPVUFC_extraction %>% str()
# extract the html table
PpvUfcRaw <- rvest::html_table(PPVUFC_extraction[[1]]) 
# check the shape of the raw data 
# PpvUfcRaw %>% dplyr::glimpse(78)
# Now I need to clean these data up a bit by making the following changes:
# 1. rename the variables
# 2. remove the first row of data (they are the column names)
# 3. remove an empty row of data (row 2)
PayPerViewUFC <- PpvUfcRaw %>%
  dplyr::rename(event = X1,
                date = X2,
                main_event = X3,
                buy_rate = X4)
PayPerViewUFC <- PayPerViewUFC %>%
  filter(event != "Event" & event != "")
# check the dataShape
# dataShape() is a function I wrote that combines a little bit of
# dplyr::glimpse(), utils::head(), utils::tail(), and base::class()
dataShape <- function(df) {
    obs <- nrow(df)
    vars <- ncol(df)
    class <- paste0(class(df), collapse = "; ")
    first_var <- base::names(df) %>% head(1)
    last_var <- base::names(df) %>% tail(1)
    group <- is_grouped_df(df)
    heads_tails <- tibble::as_tibble(.env$ht(df))
    cat("Observations: ", obs, "\n", sep = "")
    cat("Variables: ", vars, "\n", sep = "")
    cat("Class(es): ", class, "\n", sep = " ")     
    cat("First/last variable: ", first_var, "/", last_var, "\n", sep = "")
    cat("Grouped: ", group, "\n", sep = "")
    cat("Top 5 & bottom 5 observations:", "\n", sep = "") 
    heads_tails 
} 
PayPerViewUFC %>% dataShape()
## Observations: 219
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date      main_event                         buy_rate
##  *                                       
##  1 UFC 1   Nov 12/93 Tournament                         86,000
##  2 UFC 2   Mar 11/94 Tournament                         300,000
##  3 UFC 3   Sept 9/94 Tournament                         90,000
##  4 UFC 4   Dec16/94  Tournament                         120,000
##  5 UFC 5   Apr 7/95  Royce Gracie vs Ken Shamrock       260,000
##  6 UFC 221 Feb 11/18 Yoel Romero vs Luke Rockhold       N/A
##  7 UFC 222 Mar 3/18  Cris Cyborg vs Yana Kunitskaya     N/A
##  8 UFC 223 Apr 7/18  Khabib Nurmagomedov vs Al Iaquinta N/A
##  9 UFC 224 May 12/18 Amanda Nunes vs Raquel Pennington  N/A
## 10 UFC 225 Jun 9/18  Robert Whittaker vs Yoel Romero    250,000

Export the raw data and the processed data file.

# writeLines(fs::dir_ls("data"))
write_csv(as_data_frame(PpvUfcRaw), "data/PpvUfcRaw.csv")
write_csv(as_data_frame(PayPerViewUFC), "data/PayPerViewUFC.csv")

Unique Events

These should be the UFC event (all the way back to the beginning). I want to see if this is unique (1 per row). The best to way to be sure of this is with base::identical()

base::identical(x = nrow(dplyr::distinct(PayPerViewUFC, event)),
          y = nrow(PayPerViewUFC))
## [1] TRUE

That’s helpful information–now I know I don’t have duplicate identification numbers for each UFC event.

The dates for each event

The next column in the data set is date, and these are all given as Month(abbreviation) DD/YY. I can quickly clean these data up using the lubridate::mdy() functions.

# check the format
PayPerViewUFC$date %>% glimpse(78)
##  chr [1:219] "Nov 12/93" "Mar 11/94" "Sept 9/94" "Dec16/94" "Apr 7/95" ...
# pick the function and parse date
PayPerViewUFC$date <- lubridate::mdy(PayPerViewUFC$date) 
# check the new date PayPerViewUFC$date %>% glimpse(78)
##  Date[1:219], format: "1993-11-12" "1994-03-11" "1994-09-09" "1994-12-16" "1995-04-07" ...

The Main Event column

These are the titles for each main event in the UFC. We can see a quick count of how many are listed as Tournament, how many rematches there are (those with a Main Event that showed up more than once), and how many events are listed only once.

PayPerViewUFC %>%
  dplyr::count(main_event, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    main_event                            n
##                                 
##  1 Tournament                           11
##  2 Chuck Liddell vs Randy Couture        3
##  3 Anderson Silva vs Chael Sonnen        2
##  4 Andrei Arlovski vs Tim Sylvia         2
##  5 Chuck Liddell vs Tito Ortiz           2
##  6 Frankie Edgar vs Benson Henderson     2
##  7 Frankie Edgar vs Gray Maynard         2
##  8 George St-Pierre vs Matt Serra        2
##  9 Johny Hendricks vs Robbie Lawler      2
## 10 Jose Aldo vs Chad Mendes              2

The ‘Buy Rates’ column

This column is the pay per view buy rate, but as noted in the original post,

No need for much of an introduction here – this is a list of the sales totals for every UFC pay-per-view since Day 1. Now, since the UFC is a private company and doesn’t release sales info, all this is based on estimates, usually released by the inimitable Dave Meltzer based on info from PPV providers (and listed on the Wikipedia pages for the events – UFC 178 & UFC 179 from Tapology). And during the Dark Ages of the sport, when it was pretty much banned everywhere, no PPV info is available.

These can be formatted correctly by removing the comma (stringr::str_remove_all()) and then converting to numeric (base::as.numeric()).

PayPerViewUFC <- PayPerViewUFC %>%
  mutate(buy_rate = stringr::str_remove_all(string = buy_rate, pattern = ","),
         buy_rate = base::as.numeric(buy_rate))
PayPerViewUFC %>% dataShape()
## Observations: 219
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date       main_event                         buy_rate
##  *                                          
##  1 UFC 1   1993-11-12 Tournament                            86000
##  2 UFC 2   1994-03-11 Tournament                           300000
##  3 UFC 3   1994-09-09 Tournament                            90000
##  4 UFC 4   1994-12-16 Tournament                           120000
##  5 UFC 5   1995-04-07 Royce Gracie vs Ken Shamrock         260000
##  6 UFC 221 2018-02-11 Yoel Romero vs Luke Rockhold             NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Yana Kunitskaya           NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomedov vs Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs Raquel Pennington        NA
## 10 UFC 225 2018-06-09 Robert Whittaker vs Yoel Romero      250000

Remove Tournaments to focus on events with fighters

I’m going to filter out the Main Events that were merely listed as Tournament.

UFCFighterEvents <- PayPerViewUFC %>%
  dplyr::filter(main_event != "Tournament")
UFCFighterEvents %>% dataShape()
## Observations: 208
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date       main_event                         buy_rate
##  *                                          
##  1 UFC 5   1995-04-07 Royce Gracie vs Ken Shamrock         260000
##  2 UFC 6   1995-07-14 Ken Shamrock vs Dan Severn           240000
##  3 UFC 7   1995-09-08 Ken Shamrock vs Oleg Taktarov        190000
##  4 UFC 8   1996-02-16 Ken Shamrock vs Kimo Leopoldo        300000
##  5 UFC 9   1996-05-17 Ken Shamrock vs Dan Severn           141000
##  6 UFC 221 2018-02-11 Yoel Romero vs Luke Rockhold             NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Yana Kunitskaya           NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomedov vs Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs Raquel Pennington        NA
## 10 UFC 225 2018-06-09 Robert Whittaker vs Yoel Romero      250000

Separate the fighters into their own columns

Now I want to separate the names of both fighters in the main_event column into two different columns fighter_1 and fighter_2.

UFCFighterEvents <- UFCFighterEvents %>%
    tidyr::separate(
        col = main_event,
        into = c("fighter_1",
                 "fighter_2"),
        sep = " vs ",
        remove = FALSE)
UFCFighterEvents %>% dataShape()
## Observations: 208
## Variables: 6
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 6
##    event   date       main_event         fighter_1    fighter_2   buy_rate
##  *                                         
##  1 UFC 5   1995-04-07 Royce Gracie vs K… Royce Gracie Ken Shamro…   260000
##  2 UFC 6   1995-07-14 Ken Shamrock vs D… Ken Shamrock Dan Severn    240000
##  3 UFC 7   1995-09-08 Ken Shamrock vs O… Ken Shamrock Oleg Takta…   190000
##  4 UFC 8   1996-02-16 Ken Shamrock vs K… Ken Shamrock Kimo Leopo…   300000
##  5 UFC 9   1996-05-17 Ken Shamrock vs D… Ken Shamrock Dan Severn    141000
##  6 UFC 221 2018-02-11 Yoel Romero vs Lu… Yoel Romero  Luke Rockh…       NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Ya… Cris Cyborg  Yana Kunit…       NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomed… Khabib Nurm… Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs R… Amanda Nunes Raquel Pen…       NA
## 10 UFC 225 2018-06-09 Robert Whittaker … Robert Whit… Yoel Romero   250000

Pay-per-view purchases over time

Now I can visualize these data by plotting date on the x-axis and buy_rate on the y-axis.

UFCFighterEvents %>%
  ggplot(aes(x = date,
             y = buy_rate)) +
    geom_point(alpha = 0.5,
               size = 1.5) +
          ggplot2::labs(x = "Date",
                        y = "Pay-Per-View Buy Rate",
                        title = "UFC pay-per-view viewership") +
   ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

This shows a pretty clear rise of UFC pay-per-view purchases over time since 2001. I want to get rid of some additional zeros in the buy_rate variable by creating buy_rate_mil (which is done by dividing each number by 1,000,000).

I also want to check the top 50 UCF fights and see who the main attractions were because I’m curious about what fighters bring the most viewership (and when this tends to happen). I am going to switch to a line plot because I will be looking at fewer points and I’d like to see the trends (or changes) more clearly.

# Create buy_rate_mil --------
UFCFighterEvents <- UFCFighterEvents %>%
  dplyr::mutate(buy_rate_mil = buy_rate/1000000)
UFCFighterEventsTop50 <- UFCFighterEvents %>%
  dplyr::arrange(desc(buy_rate)) %>%
  utils::head(50)
UFCFighterEventsTop50 %>%
    ggplot(aes(x = date, y = buy_rate_mil)) +
    geom_line() +
      theme_ipsum() +
        ggplot2::labs(x = "Date",
                      y = "Pay-Per-View Buy Rate (Millions)",
                      title = "Top 50 UFC pay-per-view events") +
   ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

This graph shows the top ‘most viewed’ 50 UFC events. I added the theme_ipsum_rc() from the hrbrthemes package. Read more about it here..

This looks like I’m actually looking at three distinct peaks (one in mid 2009, the others in early and late 2016). Any idea who was fighting in these events?

Write a function to summarize the buy_rate and buy_rate_mil variables

I want some descriptive statistics on the two PPV metrics in this data set, so I am going to write a function that will give me a quick numerical summary of each variable. Learn more about writing functions in the awesome “Programming with dplyr” vignette.

numvarSum <- function(df, expr) {
  expr <- enquo(expr) # turns expr into a quosure
  summarise(df,
          n = sum((!is.na(!!expr))), # non-missing
          na = sum((is.na(!!expr))), # missing
          mean = mean(!!expr, na.rm = TRUE), # unquotes mean()
          median = median(!!expr, na.rm = TRUE), # unquotes median()
          sd = sd(!!expr, na.rm = TRUE), # unquotes sd()
          variance = var(!!expr, na.rm = TRUE), # unquotes var()
          min = min(!!expr, na.rm = TRUE), # unquotes min()
          max = max(!!expr, na.rm = TRUE), # unquotes max()
          se = sd/sqrt(n)) # standard error
}
UFCFighterEvents %>%
  numvarSum(buy_rate)
##     n na   mean median     sd  variance   min     max    se
## 1 184 24 437701 350000 317497 1.008e+11 35000 1650000 23406
UFCFighterEvents %>%
  numvarSum(buy_rate_mil)
##     n na   mean median     sd variance   min  max      se
## 1 184 24 0.4377   0.35 0.3175   0.1008 0.035 1.65 0.02341

How many events had more than 1 million views?

The line graph above shows the buying rate trends of the top 50 UFC events over time. However, I want some more details on the events themselves. I will start by creating a factor variable (buy_rate_fct) that sorts the buy_rate_mil variable into five levels. The numerical summaries above can help me make sure the new variable is coded correctly.

UFCFighterEvents <- UFCFighterEvents %>% 
 dplyr::mutate(buy_rate_fct = case_when(
buy_rate_mil < 1.00 ~ "less than 1.00 million ppvs",
buy_rate_mil >= 1.00 & buy_rate_mil < 1.25 ~ "1.00-1.25 million ppvs",
buy_rate_mil >= 1.25 & buy_rate_mil < 1.50 ~ "1.25-1.50 million ppvs",
buy_rate_mil >= 1.50 & buy_rate_mil < 1.75 ~ "1.50-1.75 million ppvs",
buy_rate_mil >= 1.75 & buy_rate_mil < 2.00 ~ "1.75-2.00 million ppvs"),
# convert to factor
buy_rate_fct = factor(buy_rate_fct,
                     levels = c("1.75-2.00 million ppvs",
                                "1.50-1.75 million ppvs",
                                "1.25-1.50 million ppvs",
                                "1.00-1.25 million ppvs",
                            "less than 1.00 million ppvs")))
# check the buy_rate_fct
knitr::kable(
UFCFighterEvents %>% dplyr::count(buy_rate_fct))
buy_rate_fct n
1.50-1.75 million ppvs 3
1.25-1.50 million ppvs 1
1.00-1.25 million ppvs 11
less than 1.00 million ppvs 169
NA 24

Now I can map the adjusted buy rate (buy_rate_mil) on the y-axis and the categorical buy rate variable (buy_rate_fct) to the color aesthetic to see more details of the distribution.

UFCFighterEvents %>%
  dplyr::filter(!is.na(buy_rate_fct)) %>%
    ggplot(aes(x = date,
               y = buy_rate_mil,
               group = buy_rate_fct)) +
    geom_point(aes(color = buy_rate_fct), size = 1.5) +
    ggplot2::xlab("Date") +
    ggplot2::ylab("UFC PPV Sales (in millions)") +
    ggplot2::ggtitle(label = "Pay-per-view UFC Events",
                     subtitle = "Every UFC pay-per-view since UFC 1") +
    ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ") +
      scale_color_ipsum() +
      scale_fill_ipsum() +
      theme_ipsum_rc()

Why not just assign a different color to every level of buy_rate_mil? Too many different colors would make it hard to track the differences in the PPV buy rate (I would need to keep re-checking the legend to figure out what I was seeing).

Which fighters attract the most viewership?

We can see from looking at the head() of the data in UFCFighterEventsTop50 that Conor McGregor is in four of the top 5 fights.

knitr::kable(
UFCFighterEventsTop50 %>%
  dplyr::select(main_event,
                fighter_1,
                fighter_2,
                buy_rate,
                buy_rate_mil) %>%
  head(5))
main_event fighter_1 fighter_2 buy_rate buy_rate_mil
Nate Diaz vs Conor McGregor Nate Diaz Conor McGregor 1650000 1.65
Brock Lesnar vs Frank Mir Brock Lesnar Frank Mir 1600000 1.60
Conor McGregor vs Nate Diaz Conor McGregor Nate Diaz 1500000 1.50
Eddie Alvarez vs Conor McGregor Eddie Alvarez Conor McGregor 1300000 1.30
Jose Aldo vs Conor McGregor Jose Aldo Conor McGregor 1200000 1.20

Tidy the fighters

I want to create a tidy data frame that gathers up the fighters in from the main_event column (now in the fighter_1 and fighter_2 columns). This means I want a single column (fighter_val) that lists all the fighters, and another column (fighter_key), that tells me whether they were fighter_1 or fighter_2. I also reorganize the data frame with some of dplyrs handy select() helper functions, and sort the data so we can see what this new arrangement looks like.

TidyUFCFighters <- UFCFighterEvents %>%
  tidyr::gather(key = fighter_key,
                value = fighter_val,
                fighter_1:fighter_2) %>%
  dplyr::select(event,
                date,
        dplyr::contains("buy"),
        dplyr::contains("fight"),
        main_event) %>%
  dplyr::arrange(event)
TidyUFCFighters %>% dataShape()
## Observations: 416
## Variables: 8
## Class(es):  data.frame
## First/last variable: event/main_event
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 8
##    event   date       buy_rate buy_rate_mil buy_rate_fct       fighter_key
##  *                                   
##  1 UFC 100 2009-07-11  1600000        1.6   1.50-1.75 million… fighter_1
##  2 UFC 100 2009-07-11  1600000        1.6   1.50-1.75 million… fighter_2
##  3 UFC 101 2009-08-08   850000        0.85  less than 1.00 mi… fighter_1
##  4 UFC 101 2009-08-08   850000        0.85  less than 1.00 mi… fighter_2
##  5 UFC 102 2009-08-29   435000        0.435 less than 1.00 mi… fighter_1
##  6 UFC 99  2009-06-13   360000        0.36  less than 1.00 mi… fighter_2
##  7 UFC Br… 1998-10-16       NA       NA                    fighter_1
##  8 UFC Br… 1998-10-16       NA       NA                    fighter_2
##  9 UFC Ja… 1997-12-21       NA       NA                    fighter_1
## 10 UFC Ja… 1997-12-21       NA       NA                    fighter_2
## # ... with 2 more variables: fighter_val , main_event 

This is exactly what I should expect–each event listed twice–once for each fighter.

Who are the top five (most occuring fighters)?

I can now use the dplyr::count() function to determine which fighter has the most main event occurrences.

knitr::kable(
TidyUFCFighters %>%
  dplyr::count(fighter_val, sort = TRUE) %>%
  head(5))
fighter_val n
Randy Couture 17
Anderson Silva 16
Tito Ortiz 15
Chuck Liddell 11
Jon Jones 11

And now I can see that when it comes to PPV purchases, Randy Couture is the most popular main event fighter in the UFC (followed by Anderson Silva and Tito Ortiz).

Which fighter draws the most PPV purchases?

Now I want to know which fighter draws the most PPV purchases in the UCF. This question is a little trickier than simply asking ‘what events have the highest PPV buy rate?’, because each event has two fighters and it isn’t always clear who is drawing the crowd. One way to get at this is by seeing how many of the most purchased events featured fighters who were also in many events.

I can do this in the following steps:

1. remove the missing PPV purchase data,

2. count the number of UFC events per fighter,

3. sort the data by the number of UFC events per fighter, then by the PPV purchase rate,

4. limit this to only the events with PPV purchase rates over 1 million.

The wrangling steps below use the same functions as above to create TopPPVFighters.

TopPPVFighters <- TidyUFCFighters %>%
  # remove missing buy rates
  filter(!is.na(buy_rate_mil)) %>%
  # count the number of occurances per fighter
  dplyr::count(fighter_val) %>%
  # rename n to fghtr_events
  dplyr::rename(fghtr_events = n) %>%
  # join back to original tidy data set
  dplyr::left_join(., TidyUFCFighters, by = "fighter_val") %>%
 # arrange by descending number of occurances and PPV buy rates
  dplyr::arrange(desc(fghtr_events, buy_rate_mil)) %>%
  # limit to events over 1 million
  dplyr::filter(buy_rate_mil > 1.00) %>%
  # only the distinct fighter_val
  dplyr::distinct(main_event, .keep_all = TRUE)
TopPPVFighters %>% dataShape()
## Observations: 14
## Variables: 9
## Class(es):  tbl_df; tbl; data.frame
## First/last variable: fighter_val/main_event
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 9
##    fighter_val    fghtr_events event   date       buy_rate buy_rate_mil
##                                         
##  1 Anderson Silva           16 UFC 168 2013-12-28  1025000         1.02
##  2 Randy Couture            14 UFC 91  2008-11-15  1010000         1.01
##  3 Chuck Liddell            11 UFC 66  2006-12-30  1050000         1.05
##  4 Rashad Evans              9 UFC 114 2010-05-29  1050000         1.05
##  5 Jose Aldo                 7 UFC 194 2015-12-12  1200000         1.2
##  6 Brock Lesnar              5 UFC 116 2010-07-03  1160000         1.16
##  7 Conor McGregor            5 UFC 196 2016-03-05  1500000         1.5
##  8 Conor McGregor            5 UFC 202 2016-08-20  1650000         1.65
##  9 Conor McGregor            5 UFC 205 2016-11-12  1300000         1.3
## 10 Amanda Nunes              3 UFC 200 2016-07-09  1200000         1.2
## # ... with 3 more variables: buy_rate_fct , fighter_key ,
## #   main_event 

Now I can create a plot that uses the TopPPVFighters data frame, and maps the name of the fighters to the point on the graph. But before I do this, I want to make a few adjustments to the fivethirtyeight theme from ggthemes. As you can see from the code below, I make a few minor changes to display the axis titles and adjust the fonts. Read more about theme_foundation() and ggthemes.

# check the colors for fivethirtyeight theme
# ggthemes_data$fivethirtyeight
 # dkgray medgray ltgray red blue green 
# "#3C3C3C" "#D2D2D2" "#F0F0F0" "#FF2700" "#008FD5" "#77AB43"
theme_fivethirtyeightv1.1 <- function(base_size = 11, base_family = "sans") {
 (theme_foundation(base_size = base_size, base_family = base_family) + 
     theme(line = element_line(color = "black"), 
             rect = element_rect(
                 fill = "#F0F0F0", 
                 linetype = 0, 
                 color = NA), 
       text = element_text(color = "#3C3C3C"), 
       axis.text = element_text(), 
       axis.ticks = element_blank(),
       axis.line = element_blank(),
       legend.background = element_rect(
                fill = "#F0F0F0",
                color = "#D2D2D2",
                size = 1), 
       legend.position = "bottom",
       legend.direction = "horizontal",
         panel.grid = element_line(color = NULL), 
         panel.grid.major = element_line(color = "#D2D2D2"), 
         panel.grid.minor = element_blank(), 
         plot.title = element_text(family = "mono",
                     hjust = 0, 
                     size = rel(1.5), 
                     face = "bold"), 
       plot.margin = unit(c(1, 1, 1, 1), "lines"), 
       strip.background = element_rect()))
}
TopPPVFighters %>% 
 ggplot(aes(x = date, 
            y = buy_rate_mil, 
            label = fighter_val)) + 
 ggplot2::geom_point(aes(size = fghtr_events,
                         color = fighter_val),
                         alpha = 0.7,
                         show.legend = TRUE) +
 ggplot2::guides(color = FALSE) +
 ggplot2::scale_size_continuous(name = "number of events",
                               breaks = c(3, 6, 9, 12, 15),
                               labels = c("3", "6", "9", 
                                        "12", "15")) +
 ggrepel::geom_text_repel(direction = "both",
                          hjust = 0.5,
                          vjust = 0.5,
                          segment.size = 0.5,
                          color = "black",
                          size = 2.5) +
 ggthemes::scale_color_calc() +
 theme_fivethirtyeightv1.1() +
            ggplot2::xlab("Date") +
            ggplot2::ylab("UFC PPV Sales (in millions)") +
            ggplot2::ggtitle(label = "UFC Events by Number of Appearances", 
            subtitle = "larger point = more appearances") + 
            ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

The thing to notice about this graph is that Conor McGregor and Brock Lesnar are the only two fighters above the 1.5 million PPV purchase mark. It remains unknown if McGregor will return to the UFC, so it’s important to consider how many of the big names in UFC are retired or have moved onto different franchises:

After looking at PPV purchase numbers and the recent exodus of talent (voluntary or otherwise), it makes sense that Saturday’s fight felt more like a staged drama event than a typical bout. The UFC has to compete with an increasing number of fighting organizations (Viacom’s Bellator, and Rizin Fighting Federation–formerly Pride–out of Tokyo, and nothing attracts fight fans like a narrative.

In the next post I will look into the MMA attendance data to see if the same big names also sell the most tickets.

devtools::session_info()
##  setting  value
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  tz       America/Los_Angeles
##  date     2018-07-09

Scraping wikipedia tables

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends 
library(tidyverse)
library(magrittr)
library(xml2)
library(rvest)

Load the EHR data below:

EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")  
EHRMedTable %>% glimpse(78)

Find a list of antibiotic medications on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
List of 2
$ node:
$ doc :
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables % rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
List of 3
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables % str()
 List of 2
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
Observations: 168
Variables: 5
$ Generic name "Aminoglycosides", "…
$ Brand names "Aminoglycosides", "…
$ Common uses[3] "Aminoglycosides", "…
$ Possible side effects[3] "Aminoglycosides", "…
$ Mechanism of action "Aminoglycosides", "…

Looks like this is the right table! I will assign it to a data frame

titled, “WikiAntiB” and check the base::names().

WikiAntiB % names()
 [1] "Generic name"             "Brand names"
[3] "Common uses[3]" "Possible side effects[3]"
[5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>% 
dplyr::select(
generic_name = Generic name,
brand_name = Brand names,
common_uses = Common uses[3],
poss_side_effects = Possible side effects[3],
mech_action = Mechanism of action)
WikiAntiB %>% dplyr::glimpse(60)

Observations: 176
Variables: 5
$ generic_name "Aminoglycosides", "Amikacin", …
$ brand_name "Aminoglycosides", "Amikin", "G…
$ common_uses "Aminoglycosides", "Infections …
$ poss_side_effects "Aminoglycosides", "Hearing los…
$ mech_action "Aminoglycosides", "Binding to …

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
dplyr::select(
id,
med = rxddrug)
remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)

Observations: 12,957
Variables: 2
$ id 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,…
$ med "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION…

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80) 
chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" …

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds %     dplyr::mutate(med = stringr::str_to_lower(med)) EHRMeds$med %>% dplyr::glimpse(80) 

chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" …

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>% 
dplyr::select(generic_name) %>%
# select this column to split
tidyr::separate(col = generic_name,
# put medication in gen_name
into = c("gen_name", "etc"),
# separate them on the whitespace
sep = " ",
# but keep the original variable
remove = FALSE) %>%
# then take new gen_name (first med)
dplyr::select(gen_name) %>%
# get the distinct values
dplyr::distinct(gen_name) %>%
dplyr::mutate(gen_name =
# convert to lowercase
str_to_lower(gen_name),
# remove (bs)
gen_name = str_remove_all(gen_name,
pattern = "\(bs\)"),
# replace "/" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "\/",
replacement = "_"),
# replace "-" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "-",
replacement = "_")) %>%
# split the new column again, this time into 2 gen_names
# on the underscore we put there ^
tidyr::separate(col = gen_name,
into = c("gen_name1", "gen_name2"),
sep = "_",
remove = FALSE) %>%
# now get all gen_name meds into one column
tidyr::gather(key = gen_name_key,
value = gen_name_value,
gen_name1:gen_name2) %>%
# remove missing
dplyr::filter(!is.na(gen_name_value)) %>%
# and rename
dplyr::select(generic_name = gen_name_value)

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60) 

Observations: 161
Variables: 1
$ generic_name "aminoglycosides", "amikacin", "ge…

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
base::unique() %>%
base::writeLines()
unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

check rows so I know I'm not fooling myself
EHRMeds %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 701
EEHRMeds %>%
dplyr::filter(stringr::str_detect(string = med,
pattern = antibiotic_med)) %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
dplyr::filter(stringr::str_detect(med,
antibiotic_med)) %>%
dplyr::select(id,
antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>% base::unique() 

[1] "rifaximin"
[2] "amoxicillin"
[3] "hydrocortisone; neomycin; polymyxin b otic"
[4] "trimethoprim"
[5] "cefdinir"
[6] "clindamycin"
[7] "azithromycin"
[8] "sulfamethoxazole"
[9] "ethambutol"
[10] "pyrazinamide"
[11] "minocycline"
[12] "sulfamethoxazole; trimethoprim"
[13] "cefixime"
[14] "polymyxin b; trimethoprim ophthalmic"
[15] "dexamethasone; tobramycin ophthalmic"
[16] "cefuroxime"
[17] "doxycycline"
[18] "amoxicillin; clavulanate"
[19] "erythromycin topical"
[20] "ciprofloxacin"
[21] "vancomycin"
[22] "penicillin v potassium"
[23] "silver sulfadiazine topical"
[24] "penicillin"
[25] "moxifloxacin ophthalmic"
[26] "gatifloxacin ophthalmic"
[27] "metronidazole"
[28] "ciprofloxacin; dexamethasone otic"
[29] "erythromycin ophthalmic"
[30] "gentamicin ophthalmic"
[31] "azithromycin ophthalmic"
[32] "tetracycline"
[33] "ofloxacin ophthalmic"
[34] "ciprofloxacin ophthalmic"
[35] "dexamethasone; neomycin; polymyxin b ophthalmic"
[36] "chloramphenicol"
[37] "mupirocin topical"
[38] "isoniazid"
[39] "levofloxacin"
[40] "nitrofurantoin"
[41] "moxifloxacin"
[42] "benzoyl peroxide; clindamycin topical"
[43] "sulfacetamide sodium ophthalmic"
[44] "neomycin; polymyxin b sulfate topical"
[45] "sulfasalazine"
[46] "metronidazole topical"
[47] "clarithromycin"
[48] "cefprozil"
[49] "clindamycin topical"
[50] "polymyxin b sulfate"
[51] "ofloxacin otic"
[52] "tobramycin ophthalmic"
[53] "dapsone topical"

Join tables back together

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
Observations: 18,704
Variables: 9
$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ seqn 62161, 62161, 62162, 62163, 62164, 62…
$ rxduse 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ rxddrug "FLUOXETINE", "METHYLPHENIDATE", NA, …
$ rxddrgid "d00236", "d00900", NA, NA, "d00181",…
$ rxqseen 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N…
$ rxddays 5840, 5840, NA, NA, 9125, NA, 547, NA…
$ rxdcount 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N…
$ antib_med NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
dplyr::filter(!is.na(antib_med)) %>%
dplyr::count(antib_med) %>%
dplyr::distinct(id, .keep_all = TRUE) %>%
dplyr::arrange(desc(n)) %>%
utils::head(10)
# A tibble: 10 x 2
antib_med n

1 amoxicillin 126
2 azithromycin 40
3 sulfamethoxazole; trimethoprim 33
4 doxycycline 26
5 amoxicillin; clavulanate 23
6 minocycline 16
7 cefdinir 14
8 ciprofloxacin 13
9 penicillin v potassium 13
10 nitrofurantoin 11

BONUS! Using stringr::str_detect() within a dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

test -----
EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_)) %>%
dplyr::count(antib_2meds, antib_med) %>%
tidyr::spread(antib_2meds, n) %>%
head(10)
# A tibble: 10 x 4
antib_med 1 antibiotic med.2 antibiotic me… <NA>

1 amoxicillin 126 NA NA
2 amoxicillin; clavulanate NA 23 NA
3 azithromycin 40 NA NA
4 azithromycin ophthalmic 2 NA NA
5 benzoyl peroxide; clindamy… NA 4 NA
6 cefdinir 14 NA NA
7 cefixime 2 NA NA
8 cefprozil 1 NA NA
9 cefuroxime 3 NA NA
10 chloramphenicol 1 NA NA
assign -----
EHRMedTable <- EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
"EHRMedTable",
as.character(Sys.Date()),
".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text ” Teaching Statistics: A Bag of Tricks ” from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here

Getting Started in RStudio Notebooks

This is the first draft of a post that was featured on storybench.org. Please check out that version if you run into issues here.

 

R is a powerful statistical programming language for manipulating, graphing, and modeling data. One of the major positive aspects of R is that it’s open-source (free!). But “free” does not mean “easy.” Learning to program in R is time-consuming (and occasionally frustrating), but fortunately, the number of helpful tools and packages is constantly increasing.

Enter RStudio

RStudio is an integrated development environment (IDE) that streamlines the R programming workflow into an easy to read layout. RStudio also includes useful tools (referred to as packages) for data manipulation, cleaning, restructuring, visualizations, report writing, and publishing to the web.

Just like R, it’s free. RStudio recently released R Markdown Notebooks, a nice integration of code, plain text, and results that can be exported into PDF, .docx, or HTML format.


Getting started

Start out by installing R and RStudio (you’ll need the preview version found here)

*If you need help installing R or RStudio, feel free to use this Google doc installation guide.

The IDE environment has four panes (seen below),
RStudio_setup

As you can see from the image above, the upper left pane (where I’m writing this tutorial) is the editor. The pane to the right (where it says “Environment is empty“), will show the working dataset. The lower left pane is called the console, which runs the R code. And the pane in the bottom right will display my results.

Opening a New R Notebook

To get started, click on “File” > “New File” > “R Notebook”. R Notebooks automatically start off with a title and some sample code. To see how the analysis is weaved into the Html click on the small “play” button:

play button

Save the file (“File” > “Save”) and then click on “Preview” at the top of the pane.

preview button.png

I don’t want to spoil the suspense, so I won’t put a screenshot of what you’ll see. Just know that R Notebooks does a great job of combining markdown text, R code, and results in a clean, crisp, easy-to-share finished product.

R syntax – numbers & text

You can use RStudio as a simple calculator. Type 2 + 2 directly into the console and press enter. You should see this:

2 + 2
[1] 4

 

You’re probably hoping to use RStudio for something slightly more advanced than simple arithmetic. Fortunately, R can calculate and store multiple values in variables to reference later. This is done with the <- assignment operator:

x <- 2 + 2

The <- is similar to the = sign. In fact, the = sign does the same thing, but the typical convention in R is the <-. To see the contents of x , enter it into the console and press enter.

x
[1] 4

You can also perform mathematical operations with variables. Store 4 + 4 in a variable called y and add it the variable x

y <- 4 + 4
y + x
[1] 12

R identifies numbers and text, or “string” characters. Text can also be stored into variables using the <- symbol and quotations.

a <- "human"
b <- "error"

Text strings are stored differently than numerical data in R. The commands used to manipulate strings are also slightly different.

If you want to combine two strings, use the paste function

paste(a,b)
[1] "human error"

Objects & Data Structures in R

R is an object oriented programming language, which means it recognizes objects according to their structure and type. The most common objects in R are atomic vectors and lists.

Atomic Vectors 1.1numerical & integer vectors

Numeric vectors (also called double) include “real” numbers with decimal places, while integers are whole numbers. To create numerical vectors, use the command c() which stands for concatenating (a fancy term for combining).

Below is an example of a numeric vector of odd numbers less than 10:

odd_vect <- c(1.3, 3.3, 5.5, 7.7, 9.9)

This statement is saying, “combine these five numbers into a vector and call it odd_vect

If I wanted to create an integer (or whole number) vector, I need to follow each number with an L

The assignment operator also works in the other direction–something I didn’t learn until recently. Use it to create another numerical vector named even_vect of even integers less than or equal to 10.

c(2L, 4L, 6L, 8L, 10L) -> even_vect

The c() function works for combining separate numerical vectors, too.  Add these two variables together into a new vector called ten_vect and print the contents:

ten_vect <- c(odd_vect, even_vect)

ten_vect

[1] 1.3 3.5 5.1 7.7 9.1 2.0 4.0 6.0 8.0 10.0

The final numeric vector (ten_vect) has combined both the odd and even values into a single vector.

Atomic vectors 1.2 – logical & character vectors

Logical vectors return two possible values, TRUE or FALSE. We can use logic to interrogate vectors in order to discover their type.

For example, we can use is.numeric to figure out if the ten_vect vector we created ended up being numeric or integer.

is.numeric(ten_vect)

[1] TRUE

Why did the combination of a numerical and integer vector end up being numeric? This is referred to as coercion. When a less flexible data type (numeric) is combined with a more flexible data type (integer), the more flexible element is coerced into the less flexible type.

Atomic vector 1.3 – Character vectors

In R, character vectors contain text strings. We can use character vectors to construct a sentence using a combination of c() and <- functions.
We will start with a preposition
prep_vect <- c("In")

then include a noun

noun_vect <- c("the Brothers Karamazov,")

throw in a subject,

sub_vect <- c("Dmitri")

sprinkle in a verb,

verb_vect <- c("kills")

and finish with an object

obj_vect <- c("his father")

Sentence construction can be a great way to learn how vector objects are structured in R. Atomic vectors are always flat, so you can nest them all…

sent_vect <- c("In",c("the Brothers Karamazov,",c("Dmitri",c("kills",c("his father")))))

sent_vect

[1] "In"                      "the Brothers Karamazov," "Dmitri"                 
[4] "kills"                   "his father"

Or enter them directly:
c("In","the Brothers Karamazov", "Dmitri", "kills", "his father"

[1] "In"                      "the Brothers Karamazov," "Dmitri"                 
[4] "kills"                   "his father"

Both return the same result.

Finally, we can combine each part of the sentence together using paste:

sent_vect <- paste(prep_vect, noun_vect, sub_vect, verb_vect, obj_vect)

[1] "In the Brothers Karamazov, Dmitri kills his father"

Lists

Unlike vectors–which only contain elements of a single type–lists can contain elements of different types.

We will create a list that includes an integer vector (even_vect) a logical vector (TRUE,FALSE), a full sentence ( sent_vect ), a numerical vector (odd_vect), and we will call it, my_list

my_list <- list(even_vect, c(TRUE, FALSE), c(sent_vect), c(odd_vect))

We will look at the structure of our list using str

str(my_list)

List of 4
 $ : int [1:5] 2 4 6 8 10
 $ : logi [1:2] TRUE FALSE
 $ : chr "In the Brothers Karamazov, Dmitri kills his father"
 $ : num [1:5] 1.3 3.3 5.5 7.7 9.9

Lists are recursive–they can contain other lists.

lists_on <- list(list(list(list())))

str(lists_on)

List of 1
 $ :List of 1
 ..$ :List of 1
 .. ..$ : list()

This feature separates Lists from the Atomic vectors described above.

So there you have it! This how-to should give you some basics in R programming. You can save it as HTML, pdf, or Docx file for future reference.