Migrating from a Microsoft Access Database to MySQL with MS ACCDB Viewer

Motivation

The new dbplyr is out on CRAN today, and I’ve wanted to update my MySQL Lahman baseball database with 2017 data, but it’s only available in .csv files and MS Access versions. I could import the .csv files, but thought it would be useful to show how I convert an MS Access database into MySQL using MS ACCDB Viewer, This application has a free trial (the paid version is under $20). I rarely recommend purchasing software, but if you use macOS and find yourself needing to get into an MS Access database, this tool comes in handy.

Getting started

This post assumes:

  1. You are running macOS (this tutorial was built using version 10.14.2)
  2. You have MySQL installed on your machine (see previous post to get everything set up)
  3. You have RStudio installed (using version 1.2.1206)
  4. You have R installed (using version 3.5.1)

Packages

These are the packages you’ll need.

library(RMySQL)
library(RMariaDB)
library(dplyr)
library(tidyverse)
library(Lahman)
library(magrittr)
library(rstudioapi)

Project folder setup

I created an RStudio project file inside a mysql-mdb-accdb folder (mysql-mdb-accdb.Rproj). If you don’t know how RStudio’s project files work, see this post for more information.

Inside this project folder, I created another sub-folder titled databases. I wanted to separate these files by type, so I created a new folder for each version of the 2017 data.

databases/csv
databases/ms-access
databases/sql

After downloading the data, the full folder tree is available below.

databases
├── csv
│ └── baseballdatabank-master
│ ├── README.txt
│ ├── core
│ │ ├── AllstarFull.csv
│ │ ├── Appearances.csv
│ │ ├── AwardsManagers.csv
│ │ ├── AwardsPlayers.csv
│ │ ├── AwardsShareManagers.csv
│ │ ├── AwardsSharePlayers.csv
│ │ ├── Batting.csv
│ │ ├── BattingPost.csv
│ │ ├── CollegePlaying.csv
│ │ ├── Fielding.csv
│ │ ├── FieldingOF.csv
│ │ ├── FieldingOFsplit.csv
│ │ ├── FieldingPost.csv
│ │ ├── HallOfFame.csv
│ │ ├── HomeGames.csv
│ │ ├── Managers.csv
│ │ ├── ManagersHalf.csv
│ │ ├── Parks.csv
│ │ ├── People.csv
│ │ ├── Pitching.csv
│ │ ├── PitchingPost.csv
│ │ ├── Salaries.csv
│ │ ├── Schools.csv
│ │ ├── SeriesPost.csv
│ │ ├── Teams.csv
│ │ ├── TeamsFranchises.csv
│ │ ├── TeamsHalf.csv
│ │ └── readme2014.txt
│ └── upstream
│ └── Teams.csv
└── ms-access
└── lahman2017
├── lahman2017.mdb
└── readme2017.txt

MS Access to MySQL using MS ACCDB Viewer

  1. Download the MS ACCDB Viewer app and the Lahman 2017 MS Access database and .csv files. I’ve created the following sub-folders for the two types of Lahman database files.
  2. Open the MS Access database in the MS ACCDB Viewer app and follow the steps below.

3. After the database exports, you will see a new lahman2017.sql file in your databases/sql. This file is too large to open in RStudio (it’s 46.4 MB), so I’ll open it in a text editor (I use Sublime Text, but many others will work).

What should I expect to see?

If you haven’t done a lot of work in SQL, you will have no way of knowing if this SQL export is ready to be loaded into MySQL (I didn’t know what to expect until I did a lot of reading and trial and error). However, I can look at the 2016 MySQL database from the Lahman website to see what to expect. Below are the first 50 lines each SQL file (2016 on the left, 2017 on the right).

The important differences in the first ~45 lines:

  1. There is no command for creating the lahman2017 database (i.e. CREATE DATABASE)
  2. After creating the database, there is no command to use the lahman2017 database
  3. The SET NAMES argument should be changed to SET NAMES utf8mb4 (this will include characters like emojis)
  4. All the tables and data have clear headers in the previous .sql file, and it’s worthwhile to add these to this version too (though it’s not required)
/*

This MySQL dump came from the Lahman 2017 mdb file available here:
http://www.seanlahman.com/baseball-archive/statistics/
The following was added:
A CREATE DATABASE statement
A USE database statement
Change the SET NAMES 'UTF8'; to SET NAMES utf8mb4;
Add the 'Table structure' and 'Records' titles (optional)

*/
CREATE DATABASE IF NOT EXISTS lahman2017;
USE lahman2017;
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

The first 50 lines of my lahman2017.sql file now look like this:

RStudio’s Terminal pane

In previous posts, I mention opening and using Terminal. RStudio also has it’s own Terminal pane (see below).

This is accessible under Tools > Terminal > New Terminal.

In a new Terminal pane, enter the following commands.

$ cd databases/sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH

This returns what looks like a long and convoluted file path (this is normal).

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/….

Now I enter the following:

mysql -u root -p 

Enter your password and you should see the mysql> prompt. I want to check the databases in MySQL before adding the new lahman2017 data, and I can do this with SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| lahman2016 |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)

Because I moved into the sql folder, I can execute the lahman2017.sql file by entering source lahman2017.sql into the MySQL prompt.

mysql> source lahman2017.sql

This will return a long output telling you things like Query OK and Records: 11370 Duplicates: 0 Warnings: 0. When it’s finished, check the new tables with SHOW TABLES;

You should see this.

Great! Now I have the new database set up in MySQL. I am going to connect to RStudio and start querying.

Changing MySQL settings

Before connecting to MySQL from RStudio, you might need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption
  3. Then click Start MySQL Sever

The dplyr::src_mysql() function

In the previous post, I set up my connection to RStudio using the DBI package.

This time I will set up my connection with the dplyr package (part of the tidyverse). The function that does this is dplyr::src_mysql(), and I include arguments for dbname, host, user, and rstudio::askForPassword("Database password") which prompts a fill-in box for my password.

LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2017", 
host = "localhost",
user = "root",
# password = rstudioapi::askForPassword("Database password")
)

Print the LahmandplyrMySQL object to the console.

LahmandplyrMySQL
src: mysql 8.0.13 [root@localhost:/lahman2017]
tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers,
AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost,
CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost,
HallOfFame, HomeGames, Managers, ManagersHalf, Parks, People, Pitching,
PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises,
TeamsHalf

As I can see, this prints out much more information than the DBI connection in my previous post. The LahmandplyrMySQL object tells me the source of the connection and the tables in the database.

ALWAYS read the README file (readme2017.txt)

The directions are right there in the title. Important stuff is in these files (like changes to the tables in the database). This section is an example of such information:

One notable change: The name of the table that contains biographical information for players has been changed from “Master” to “People” top better reflect its contents.

This information saved me from some potential headache’s (my previous query on the Master table would give an error).

In the next section, I’ll go over the new changes in dbplyr with the dbplyr connection. The dplyr::tbl() function takes the src argument which is the connection to the database (LahmandplyrMySQL). The next argument is sql(), and this is a character string (inside " ") which SQL commands. I like to capitalize the MySQL syntax, refer to tables in SQL with TitleCase, and the columns in lowercase.

API changes

1) Calls of the form dplyr::foo() are now evaluated in the database, rather than locally.

I am assuming this means that using dplyr instead of MySQL in queries is run in the database. Such as the following:

I also pass everything to the dplyr::glimpse(78) so it prints nicely to the window.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view object
dplyr::glimpse(78)

Observations: ??
Variables: 24
Database: mysql 8.0.13 [root@localhost:/lahman2017]
$ playerID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…
$ birthYear 1964, 1987, 1943, 1963, 1971, 1942, 1981, 1934, 1939, …
$ birthMonth 10, 11, 7, 9, 12, 8, 12, 2, 8, 9, 2, 8, 6, 9, 7, 1, 9,…
$ birthDay 26, 13, 15, 21, 28, 9, 27, 5, 5, 8, 16, 17, 2, 15, 15,…
$ birthCountry "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ birthState "IL", "NY", "CA", "CA", "HI", "AL", "CO", "AL", "AL", …
$ birthCity "Chicago", "Staten Island", "Long Beach", "Escondido",…
$ deathYear NA, NA, 2016, NA, NA, 2001, NA, NA, 1984, NA, NA, NA, …
$ deathMonth NA, NA, 2, NA, NA, 1, NA, NA, 8, NA, NA, NA, NA, NA, N…
$ deathDay NA, NA, 15, NA, NA, 22, NA, NA, 16, NA, NA, NA, NA, NA…
$ deathCountry NA, NA, "USA", NA, NA, "USA", NA, NA, "USA", NA, NA, N…
$ deathState NA, NA, "CA", NA, NA, "NY", NA, NA, "GA", NA, NA, NA, …
$ deathCity NA, NA, "Long Beach", NA, NA, "New York", NA, NA, "Atl…
$ nameFirst "Steve", "Tim", "Dave", "Troy", "Benny", "Tommie", "Da…
$ nameLast "Adkins", "Adleman", "Adlesh", "Afenir", "Agbayani", "…
$ nameGiven "Steven Thomas", "Timothy Max", "David George", "Micha…
$ weight 210, 225, 187, 185, 225, 195, 215, 180, 190, 190, 200,…
$ height 78, 77, 72, 76, 72, 71, 75, 72, 75, 75, 78, 74, 71, 75…
$ bats "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ throws "L", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ debut "1990-09-12", "2016-05-01", "1963-05-12", "1987-09-14"…
$ finalGame "1990-10-03", "2017-09-19", "1968-09-28", "1992-07-10"…
$ retroID "adkis001", "adlet001", "adled101", "afent001", "agbab…
$ bbrefID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…

2) The vars argument to tbl_sql() has been formally deprecated.

The vars argument used to be used to supply a list of columns in the tbl. The previous language is below:

If known, the names of the variables in the tbl. This is relatively expensive to determine automatically, so is cached throughout dplyr. However, you should usually be able to leave this blank and it will be determined from the context.

3) src and tbl objects now include a class generated from the class of the underlying connection object. This makes it possible for dplyr backends to implement different behaviour at the dplyr level, when needed.

I assume this is visible in the str() and class() using the previous query.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
utils::str()


List of 2
$ src:List of 2
..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
.. .. ..@ Id: int [1:2] 0 0
..$ disco:
..- attr(*, "class")= chr [1:4] "src_MySQLConnection" "src_dbi" "src_sql" "src"
$ ops:List of 4
..$ name: chr "filter"
..$ x :List of 2
.. ..$ x : 'sql' chr "SELECT * FROM People"
.. ..$ vars: chr [1:24] "playerID" "birthYear" "birthMonth" "birthDay" …
.. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
..$ dots:List of 1
.. ..$ : language ~bats == "R" & birthYear >= 1900
.. .. ..- attr(*, ".Environment")=
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"
- attr(*, "class")= chr [1:5] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" …

The class

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
base::class()

[1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"

SQL translation

I will skip over The new as.integer64(x) is translated to CAST(x AS BIGINT).

cummean() now translates to SQL AVG() as opposed to MEAN().

The cummean() is an aggregate window function. It works by keeping track of 1) the running sum of a values, and 2) the ith (or count) of the values. There is a great example of how this works on StackOverflow here.

I created an example data frame below to show how this works.

get a sample
x <- base::sample(1:10)
create a data frame of the intermediate steps and cummean
base::data.frame(sample= x,
                cumsum_c = cumsum(x),
seq_along_x = seq_along(x),
cummean = cumsum(x) / seq_along(x))
sample cumsum_x seq_along_x cummean
5 5 1 5.000000
3 8 2 4.000000
4 12 3 4.000000
2 14 4 3.500000
1 15 5 3.000000
7 22 6 3.666667
8 30 7 4.285714
9 39 8 4.875000
10 49 9 5.444444
6 55 10 5.500000

The cumsum_x is a running total of sample (5 + 3 = 8, 5 + 3 + 4 = 12, etc.), and seq_along_x is the count of the values. When I get the ratio of cumsum_x / seq_along_x, I get the cummean column (5 / 1 = 5, 8 / 2 = 4.0, etc.)

This is the cumulative mean of RBIs in the 2017 Batting data table.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::mutate(cummean_rbi = cummean(RBI)) %>%
dplyr::select(playerID, cummean_rbi) %>%
utils::head(5)

# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(RBI)
playerID cummean_rbi

1 wilsoha01 191
2 gehrilo01 188.
3 greenha01 186
4 foxxji01 183.
5 gehrilo01 182.

Compare this to the mean RBI per player.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::group_by(playerID) %>%
dplyr::summarize(mean_rbi = mean(RBI, na.rm = TRUE)) %>%
dplyr::select(playerID, mean_rbi) %>%
dplyr::arrange(desc(mean_rbi)) %>%
utils::head(5)
# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(mean_rbi)
playerID mean_rbi
1 dimagjo01 118.
2 gehrilo01 117.
3 pujolal01 113.
4 cabremi01 108.
5 belleal01 103.

Just for kicks, the AVG() function in MySQL is implemented like this:

mysql> SELECT 
playerID,
AVG(RBI) AS avg_rbi
FROM
Batting
GROUP BY
playerID
ORDER BY
avg_rbi DESC
LIMIT 5;
+-----------+----------+
| playerID | avg_rbi |
+-----------+----------+
| dimagjo01 | 118.2308 |
| gehrilo01 | 117.3529 |
| pujolal01 | 112.8235 |
| cabremi01 | 107.5333 |
| belleal01 | 103.2500 |
+-----------+----------+
5 rows in set (0.07 sec)

Closing

Moving form one relational database management system to another can slow down your workflow, so having (inexpensive) tools to quickly transform a database into something you can work with is a time-saver. I don’t like leaving the RStudio IDE (unless it’s to my browser), so the quicker I can get my data source into this environment, the quicker I can get to work on it.

Next post will probably be on PostgreSQL.

Setting up PostgreSQL on macOS

This tutorial was written in RMarkdown in RStudio (version 1.2.1206, R version 3.5.1), using mac OS High Sierra (10.13.6).

Why install PostgreSQL?

PostgreSQL is an open-source relational database management system (RDBMS). You can install PostgreSQL on Linux, Unix, Windows, or Mac. Users can write and execute stored procedures and functions from other programming languages (like R and Python). Multiple data types can also be stored (and even defined) in PostgreSQL, including JSON and other nonrelational data.

Read more about PostgreSQL here.

PostgreSQL application

Download the application from this website and follow the installation instructions.

The syntax documentation for PostgreSQL is here.

First steps

After downloading and installing the PostgreSQL application, open a new Terminal window. In the Terminal, enter the following:

$ sudo mkdir -p /etc/paths.d &&
$ echo /Applications/Postgres.app/Contents/Versions/latest/bin | sudo tee /etc/paths.d/postgresapp

You’ll be asked to enter whatever password you need to install programs on your computer. Type it in and hit enter.

Install PostgreSQL using Homebrew

You can also install PostgreSQL using homebrew. If you don’t have homebrew installed, learn how to do so here.

In a fresh terminal window, enter the following:

$ brew install postgresql
==> Installing postgresql
==> Downloading https://homebrew.bintray.com/bottles/postgresql-11.1.mojave.bottle.tar.gz
################################################################## 100.0%
==> Pouring postgresql-11.1.mojave.bottle.tar.gz
==> Caveats
To migrate existing data from a previous major version of PostgreSQL run:
brew postgresql-upgrade-database
To have launchd start postgresql now and restart at login:
brew services start postgresql
Or, if you don't want/need a background service you can just run:
pg_ctl -D /usr/local/var/postgres start
==> Summary
🍺 /usr/local/Cellar/postgresql/11.1: 3,548 files, 40.3M

Launching PostgreSQL from Terminal

Read the documentation for a more in-depth explanation.

In a fresh Terminal session (If you are using RStudio’s Terminal pane, select Tools > Terminal > New Terminal), enter the following.

psql postgres

This will bring up the psql command line.

psql (10.5)
Type "help" for help.

postgres=#

The =# is where we will be entering our postgres commands.

Check users, roles and attributes

We can access the users, their roles and attributes in PostgreSQL with the following:

postgres=# \du

This will return a table with the following information:

yourname=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Add a new user

The postgres=# prompt indicates I am currently a Superuser. To create a new user for martymcfly, I enter a USER and WITH PASSWORD:

=# CREATE USER martymcfly WITH PASSWORD 'strongpassword123';
-- CREATE ROLE

PostgreSQL typically returns the command that was previously executed. Because CREATE USER is another spelling of CREATE ROLE, PostgreSQL returns CREATE ROLE in the command prompt.

I will check to make sure this is true with \du again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
martymcfly | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Grant attributes to new user

As I can see, martymcfly has no roles or attributes. I want this user to be able to create a new database. I can do this with the following:

=# ALTER ROLE martymcfly CREATEDB;

And I can verify this with the \du command again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
martinfrigaard | Superuser, Create role, Create DB | {}
martymcfly | Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Creating a database

I’m going to switch over the new user I just created by exiting PostgreSQL and signing back in as martymcfly:

=# \q --exit 

Now sign in as a new user

psql postgres -U martymcfly

This brings up a new PostgreSQL prompt.

postgres=>

The =>tells me I am no longer the Superuser, but I know that I have given martymcfly the relevant permissions to create a new database.

Below are the steps needed to 1) create a new database (I’m creating a Lahman baseball database in PostgreSQL), 2) grant privileges to the user martymcfly on the new lahman2017 database, 3) list all the current databases in PostgreSQL, and 4) connect to the lahamn2017 database.

First create lahman2017

CREATE DATABASE lahman2017;
-- CREATE DATABASE

Then give martymcfly permissions

 GRANT ALL PRIVILEGES ON DATABASE lahman2017 TO martymcfly;
-- GRANT

Now list the databases

\list

This returns:

                             List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges |
----------------+----------------+----------+-------------+-------------+--------------------------
lahman2017 | martymcfly | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =Tc/martymcfly +
| | | | | martymcfly=CTc/martymcfly
martinfrigaard | martinfrigaard | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)

Finally, we connect to the new database

\connect lahman2017
psql (11.1, server 11.0)
You are now connected to database "lahman2017" as user "martymcfly".

Great! Next time we will build the schemas and tables in the lahman2017 database.

Querying MySQL from RStudio

In the previous post I demonstrated how to install and use MySQL from the command line. In this tutorial I will show you how to set up and query a relational database management system (RDBMS) from RStudio.

The Lahman database

These queries are done using the Lahman baseball data set available here. I chose these data because 1) they are open to anyone and can be run on your local machine, and 2) they are available as a .sql file and .csv files for loading into any RDBMS software.

Database packages in RStudio

For this tutorial I will be using a local instance of MySQL with help from the DBI, dplyr, and RMySQL packages.

library(DBI) 
library(RMySQL)
library(RMariaDB)
library(dplyr)
library(rstudioapi)
library(ggthemes)

Change MySQL settings

Before we can connect to a database with RStudio, we need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption.
  3. Then click Start MySQL Sever

Connect to MySQL

A DBI connection is built below to my local instance of MySQL.

LahmanDBIMySQL <- DBI::dbConnect(MySQL(),
dbname = "lahman2016",
host = "localhost",
user = "root",
rstudioapi::askForPassword(prompt = "database password")) LahmanDBIMySQL

As I can see, this is a <MySQLConnection:0,1>. This is the object we will use to access the lahman2016 database.

Querying a database from RStudio

We can get a list of tables in the lahman2016 database by using the DBI::dbListTables() function.

DBI::dbListTables(LahmanDBIMySQL)
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingOFsplit"
[13] "FieldingPost" "Griffeys" "HallOfFame"
[16] "HomeGames" "Managers" "ManagersHalf"
[19] "Master" "Parks" "Pitching"
[22] "PitchingPost" "Salaries" "Schools"
[25] "SeriesPost" "Teams" "TeamsFranchises"
[28] "TeamsHalf"

This is similar to the SHOW TABLES; command in MySQL. In the previous tutorial, I built a table of statistics on Ken Griffey Junior and Senior in the Lahman database. The Griffeys table contains following fields:

DBI::dbListFields(LahmanDBIMySQL, "Griffeys") 
## [1] "playerID" "birthYear" "nameFirst" "nameLast" "weight"
## [6] "height" "bats" "throws" "retroID" "bbrefID"
## [11] "teamID" "lgID" "yearID" "G" "AB"
## [16] "R" "H" "HR" "RBI" "BB"
## [21] "SO"

Queries can be passed directly to the DBI::dbGetQuery() function, like the SELECT statement below.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT * FROM Griffeys;") %>%
dplyr::glimpse(78)

Observations: 45
Variables: 21
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ retroID "grifk001", "grifk001", "grifk001", "grifk001", "grifk001…
$ bbrefID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…

I realize I don’t have all the columns I want from the Batting and Fielding tables, so I will query lahman2016 again to collect these data.

Aliasing columns in MySQL tables

There are three columns in the Batting table I want to add to Griffeys: SF, 2B, and 3B. Two of these names violate naming rules in R.

DBI::dbListFields(LahmanDBIMySQL, "Batting") 
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "AB" "R" "H" "2B" "3B" "HR"
[13] "RBI" "SB" "CS" "BB" "SO" "IBB"
[19] "HBP" "SH" "SF" "GIDP"

The 2B and 3B columns contain the number of doubles and triples for each player, and SF is the number of sacrifice flies. Unfortunately, R won’t allow columns to have names that start with a number. I can write a query to include these columns, but use an alias to create R-friendly column names.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;") %>% glimpse(78)

Observations: 45
Variables: 22
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…
$ dubs 5, 9, 15, 28, 35, 33, 27, 28, 21, 23, 21, 20, 28, 7, 15, …
$ trips 1, 5, 9, 9, 8, 8, 4, 10, 6, 2, 3, 1, 4, 0, 3, 1, 0, 0, 3,…
$ sac_flies "0", "0", "3", "3", "2", "3", "3", "5", "4", "3", "2", "4…

Storing SQL in character vectors

I can also assign the query above to a character vector (batting_query) and pass the vector to the DBI::dbGetQuery() function. I will assign these new variables to the Griffeys table.

batting_query <- "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;"
Griffeys <- DBI::dbGetQuery(LahmanDBIMySQL, batting_query)

Now that we see the SQL works and these two tables are joined, we can calculate a few new statistics in R.

Calculating On-base plus slugging (OPS)

A players on-base plus slugging (OPS) is a measure of their overall batting performance. This is the sum of the player’s on-base percentage and slugging average. The steps to calculate this statistic are below.

GriffsOPSbyYear <- Griffeys %>% 
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr"))

I created a few intermediate calculations before creating obslug_perc, and I also created a nameID so the two players can be easily graphed. Below is the on-base plus slugging (OPS) statistic for both Griffeys over their career.

GriffsOPSbyYear %>% 
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) + ggthemes::theme_fivethirtyeight()

The Lahman package in R

There is also a Lahman package in R. I will use this below to demonstrate some of the similarities between dplyr and MySQL syntax.

First I create an object for the Lahman::Master, Lahman::Fielding, and Lahman::Batting tables.

Master <- Lahman::Master
Batting <- Lahman::Batting
Fielding <- Lahman::Fielding

Then the code below creates the same tables and graph using dplyr commands.

GriffeyOPSPlot <- Master %>% 
dplyr::select(playerID,
birthYear,
nameLast) %>%
# find Griffeys
dplyr::filter(nameLast %in% "Griffey") %>%
# join to batting table
dplyr::inner_join(x = .,
y = Batting,
by = "playerID") %>%
# select/rename relevant columns
dplyr::select(teamID,
nameLast,
playerID,
birthYear,
yearID,
G,
AB,
R,
H,
HR,
dubs = X2B,
trips = X3B,
RBI,
BB,
SO,
sac_flies = SF) %>%
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr")) %>%
# create plot of new OPS
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) +
ggthemes::theme_fivethirtyeight()
ggplot2::ggsave(filename = "GriffeyOPSPlot.png", width = 7, height = 5, units = "in")

End

MySQL from the command line (macOS)

Setting up MySQL from the command line

This quick tutorial covers how to set up and query a MySQL database from the command line (Terminal) on macOS Sierra.

What is MySQL?

The SQL in MySQL stands for structured query language. There are half a dozen flavors of SQL, and MySQL is one of the most common. The My comes from the name of co-founder Michael Widenius’s daughter (fun fact: another flavor of SQL, MariaDB, is named after his younger daughter).

MySql is an open source relational database management system. Read more about MySQL on Wikipedia. Or check out the reference manual here.

Download MySQL

Download and install the community edition of MySQL. You will be asked to create an account, but you can opt out and just click on “No thanks, just start my download.

After downloading the dmg, you will be guided through the installation steps. On the Configuration options, I chose Use Strong Password Encryption

and on the next window I entered a password and checked the box for Start MySQL Server once the installation is complete

Or use brew install mysql if you have homebrew installed.

After the install finishes, you should see the MySQL icon in the System Preferences:

MySQL workbench (optional)

Download and install the workbench if you want to use an IDE for querying MySQL (I prefer DataGrip). You should read this documentation on the Workbench.

Install database drivers (using homebrew)

In a future post, I will be using RStudio to query a database using the RMySQL and RMariaDB packages. Follow these instructions here for installing the database drivers on your Mac.

These commands are entered into Terminal.

# Install the unixODBC library
brew install unixodbc
# SQL Server ODBC Drivers (Free TDS)
brew install freetds --with-unixodbc
# PostgreSQL ODBC ODBC Drivers
brew install psqlodbc
# SQLite ODBC Drivers
brew install sqliteodbc

Installing the RMySQL package connector

I need to install the connectors for MySQL and MariaDB using brew install mysql-connector-c in Terminal.

$ brew install mysql-connector-c

After updating Homebrew, the connector is installed.

Install the RMariaDB package connector

I will also install the mariadb-connector-c connector.

$ brew install mariadb-connector-c

Now we can launch MySQL.

Launching up MySQL locally

The commands below are entered directly into Terminal.

  1. To find the path for the local MySQL db:
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH
  1. To start up mysql, enter the following:
$ mysql -u root -p 
  1. You will be prompted for your password you used to setup MySQL–enter it into the Terminal. You should see this:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 8.0.13 MySQL Community Server - GPL

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

The MySQL command line is below:

mysql>

After installing MySQL community edition, you can choose to either run commands from the terminal or within a .sql script in the workbench. Below I demonstrate using MySQL from the command line.

Using MySQL commands in Terminal

To see the User and passwords, enter the following commands into the Terminal. The authentication_string will identify the passwords (but they are encrypted).

NOTE: a semi-colon is needed at the end of each MySQL command.

SELECT 
  User, authentication_string 
FROM 
  mysql.user;

+------------------+------------------------------------------------------------------------+
| User             | authentication_string                                                  |
+------------------+------------------------------------------------------------------------+
| mysql.infoschema | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| mysql.session    | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| mysql.sys        | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| root             | *D932DC725A9210F3B4C903D69F88EDC3AD447A06                              |
+------------------+------------------------------------------------------------------------+
4 rows in set (0.00 sec)

The MySQL commands are working! Let’s build a database!

Building a MySQL database

The lahman2016 database is freely available full of information on baseball players and teams from 1871 through 2016. You can download it here.

After downloading the zipped database into a local data folder I find the following files.

├── data
│   ├── lahman2016-sql
│   │   ├── lahman2016.sql
│   │   └── readme2016.txt

The readme2016.txt file tells me more about the tables in the database.

In a new Terminal window, I navigate to the lahman2016-sql folder and enter the following commands.

$ cd lahman2016-sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH
$ mysql -u root -p

After entering my password into the Terminal, I need to run the lahman2016.sql file. I can do this using the source command in the MySQL prompt.

mysql> source lahman2016.sql

This should show a lot of activity and output, but eventually bring back the MySQL prompt. I can check the databases using SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| lahman2016         |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

Querying a database directly from Terminal

Now that I’ve set up the lahman2016 database, I can use it to demonstrate a few MySQL commands. I will start by selecting this database.

mysql> USE lahman2016;

This prompts the following message.

Database changed

Now I can take a look at the tables in the lahman2016 database.

mysql> SHOW TABLES;
+----------------------+
| Tables_in_lahman2016 |
+----------------------+
| AllstarFull          |
| Appearances          |
| AwardsManagers       |
| AwardsPlayers        |
| AwardsShareManagers  |
| AwardsSharePlayers   |
| Batting              |
| BattingPost          |
| CollegePlaying       |
| Fielding             |
| FieldingOF           |
| FieldingOFsplit      |
| FieldingPost         |
| HallOfFame           |
| HomeGames            |
| Managers             |
| ManagersHalf         |
| Master               |
| Parks                |
| Pitching             |
| PitchingPost         |
| Salaries             |
| Schools              |
| SeriesPost           |
| Teams                |
| TeamsFranchises      |
| TeamsHalf            |
+----------------------+
27 rows in set (0.04 sec)

Adding a primary key to Master table

When I look at the Master table, I see that it was made without a primary key.

SHOW CREATE TABLE Master;
| Master | CREATE TABLE `Master` (
  `playerID` varchar(255) DEFAULT NULL,
  `birthYear` int(11) DEFAULT NULL,
  `birthMonth` int(11) DEFAULT NULL,
  `birthDay` int(11) DEFAULT NULL,
  `birthCountry` varchar(255) DEFAULT NULL,
  `birthState` varchar(255) DEFAULT NULL,
  `birthCity` varchar(255) DEFAULT NULL,
  `deathYear` varchar(255) DEFAULT NULL,
  `deathMonth` varchar(255) DEFAULT NULL,
  `deathDay` varchar(255) DEFAULT NULL,
  `deathCountry` varchar(255) DEFAULT NULL,
  `deathState` varchar(255) DEFAULT NULL,
  `deathCity` varchar(255) DEFAULT NULL,
  `nameFirst` varchar(255) DEFAULT NULL,
  `nameLast` varchar(255) DEFAULT NULL,
  `nameGiven` varchar(255) DEFAULT NULL,
  `weight` int(11) DEFAULT NULL,
  `height` int(11) DEFAULT NULL,
  `bats` varchar(255) DEFAULT NULL,
  `throws` varchar(255) DEFAULT NULL,
  `debut` varchar(255) DEFAULT NULL,
  `finalGame` varchar(255) DEFAULT NULL,
  `retroID` varchar(255) DEFAULT NULL,
  `bbrefID` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

If I want to update the table so the primary key is playerID, I can do this with the commands below.

mysql> ALTER TABLE `lahman2016`.`Master` 
       CHANGE COLUMN `playerID` `playerID` varchar(255) NOT NULL ,
         ADD PRIMARY KEY (`playerID`),
         ADD UNIQUE INDEX `playerID_UNIQUE` (`playerID` ASC) VISIBLE;

Then the Master table returns this when the SHOW CREATE TABLE command is entered into the prompt.

SHOW CREATE TABLE Master;
| Master | CREATE TABLE `Master` (
  `playerID` varchar(255) NOT NULL,
  `birthYear` int(11) DEFAULT NULL,
  `birthMonth` int(11) DEFAULT NULL,
  `birthDay` int(11) DEFAULT NULL,
  `birthCountry` varchar(255) DEFAULT NULL,
  `birthState` varchar(255) DEFAULT NULL,
  `birthCity` varchar(255) DEFAULT NULL,
  `deathYear` varchar(255) DEFAULT NULL,
  `deathMonth` varchar(255) DEFAULT NULL,
  `deathDay` varchar(255) DEFAULT NULL,
  `deathCountry` varchar(255) DEFAULT NULL,
  `deathState` varchar(255) DEFAULT NULL,
  `deathCity` varchar(255) DEFAULT NULL,
  `nameFirst` varchar(255) DEFAULT NULL,
  `nameLast` varchar(255) DEFAULT NULL,
  `nameGiven` varchar(255) DEFAULT NULL,
  `weight` int(11) DEFAULT NULL,
  `height` int(11) DEFAULT NULL,
  `bats` varchar(255) DEFAULT NULL,
  `throws` varchar(255) DEFAULT NULL,
  `debut` varchar(255) DEFAULT NULL,
  `finalGame` varchar(255) DEFAULT NULL,
  `retroID` varchar(255) DEFAULT NULL,
  `bbrefID` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`playerID`),
  UNIQUE KEY `playerID_UNIQUE` (`playerID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Querying a database using .sql scripts

Another way to execute MySQL commands is by creating a .sql file and running these files either from the command line or using the source command from the MySQL prompt (like I did above).

This method is preferred because it is more reproducible (and it is easier to keep track of your work).

For example, if I want to collect data on Ken Griffey Junior and Senior, I can use the MySQL prompt to build up the query.

  1. Get the names of the columns in both tables.
mysql> SHOW COLUMNS FROM Batting;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| playerID | varchar(255) | YES  |     | NULL    |       |
| yearID   | int(11)      | YES  |     | NULL    |       |
| stint    | int(11)      | YES  |     | NULL    |       |
| teamID   | varchar(255) | YES  |     | NULL    |       |
| lgID     | varchar(255) | YES  |     | NULL    |       |
| G        | int(11)      | YES  |     | NULL    |       |
| AB       | int(11)      | YES  |     | NULL    |       |
| R        | int(11)      | YES  |     | NULL    |       |
| H        | int(11)      | YES  |     | NULL    |       |
| 2B       | int(11)      | YES  |     | NULL    |       |
| 3B       | int(11)      | YES  |     | NULL    |       |
| HR       | int(11)      | YES  |     | NULL    |       |
| RBI      | int(11)      | YES  |     | NULL    |       |
| SB       | int(11)      | YES  |     | NULL    |       |
| CS       | int(11)      | YES  |     | NULL    |       |
| BB       | int(11)      | YES  |     | NULL    |       |
| SO       | int(11)      | YES  |     | NULL    |       |
| IBB      | varchar(255) | YES  |     | NULL    |       |
| HBP      | varchar(255) | YES  |     | NULL    |       |
| SH       | varchar(255) | YES  |     | NULL    |       |
| SF       | varchar(255) | YES  |     | NULL    |       |
| GIDP     | varchar(255) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
22 rows in set (0.01 sec)

mysql> SHOW COLUMNS FROM Master;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| playerID     | varchar(255) | NO   | PRI | NULL    |       |
| birthYear    | int(11)      | YES  |     | NULL    |       |
| birthMonth   | int(11)      | YES  |     | NULL    |       |
| birthDay     | int(11)      | YES  |     | NULL    |       |
| birthCountry | varchar(255) | YES  |     | NULL    |       |
| birthState   | varchar(255) | YES  |     | NULL    |       |
| birthCity    | varchar(255) | YES  |     | NULL    |       |
| deathYear    | varchar(255) | YES  |     | NULL    |       |
| deathMonth   | varchar(255) | YES  |     | NULL    |       |
| deathDay     | varchar(255) | YES  |     | NULL    |       |
| deathCountry | varchar(255) | YES  |     | NULL    |       |
| deathState   | varchar(255) | YES  |     | NULL    |       |
| deathCity    | varchar(255) | YES  |     | NULL    |       |
| nameFirst    | varchar(255) | YES  |     | NULL    |       |
| nameLast     | varchar(255) | YES  |     | NULL    |       |
| nameGiven    | varchar(255) | YES  |     | NULL    |       |
| weight       | int(11)      | YES  |     | NULL    |       |
| height       | int(11)      | YES  |     | NULL    |       |
| bats         | varchar(255) | YES  |     | NULL    |       |
| throws       | varchar(255) | YES  |     | NULL    |       |
| debut        | varchar(255) | YES  |     | NULL    |       |
| finalGame    | varchar(255) | YES  |     | NULL    |       |
| retroID      | varchar(255) | YES  |     | NULL    |       |
| bbrefID      | varchar(255) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+
24 rows in set (0.00 sec)
  1. Decide which columns I want using a SELECT and INNER JOIN query.
SELECT 
  -- Master columns
  mas.playerID, 
  mas.birthYear, 
  mas.nameFirst, 
  mas.nameLast, 
  mas.weight, 
  mas.height, 
  mas.bats, 
  mas.throws, 
  mas.retroID, 
  mas.bbrefID,
  -- Batting columns
  bat.teamID,
  bat.lgID,
  bat.yearID,
  bat.G,
  bat.AB,
  bat.R,
  bat.H,
  bat.HR,
  bat.RBI,
  bat.BB,
  bat.SO
FROM 
    Master AS mas
      INNER JOIN 
    Batting AS bat ON mas.playerID = bat.playerID
  WHERE nameLast = "Griffey";
  1. Write commands to create a new table (Griffey) to insert the Griffey data into. This is where the previous column information comes in handy.
CREATE TABLE Griffeys (
    playerID varchar(255) NOT NULL,
    birthYear int(11) NOT NULL,
    nameFirst VARCHAR(255) NOT NULL,
    nameLast VARCHAR(255) NOT NULL,
    weight int(11),
    height int(11),
    bats varchar(255),
    throws varchar(255),
    retroID varchar(255),
    bbrefID varchar(255),
    teamID varchar(255),   
    lgID varchar(255),
    yearID int(11),
    G int(11),
    AB int(11),
    R int(11),
    H int(11),
    HR int(11),
    RBI int(11),
    BB int(11),
    SO int(11)
);
  1. Combine the two commands into a .sql file in the lahman2016-sql folder:

data/lahman2016-sql/griffey-table.sql

The script should contain the following:

/*
***************************************************************************
-- This is code to create: Griffeys table
-- Authored by and feedback to mjfrigaard@gmail.com
-- MIT License
-- Version: 1.0
***************************************************************************
*/

/*
 select database
*/

USE lahman2016;

/*
 create new table
*/

CREATE TABLE Griffeys (
    playerID varchar(255) NOT NULL,
    birthYear int(11) NOT NULL,
    nameFirst VARCHAR(255) NOT NULL,
    nameLast VARCHAR(255) NOT NULL,
    weight int(11),
    height int(11),
    bats varchar(255),
    throws varchar(255),
    retroID varchar(255),
    bbrefID varchar(255),
    teamID varchar(255),   
    lgID varchar(255),
    yearID int(11),
    G int(11),
    AB int(11),
    R int(11),
    H int(11),
    HR int(11),
    RBI int(11),
    BB int(11),
    SO int(11)
);

/*
 insert the select query into the new columns
*/

INSERT INTO Griffeys (
  playerID, 
  birthYear, 
  nameFirst, 
  nameLast, 
  weight, 
  height, 
  bats, 
  throws, 
  retroID, 
  bbrefID,
  teamID,
  lgID,
  yearID,
  G,
  AB,
  R,
  H,
  HR,
  RBI,
  BB,
  SO
)
SELECT 
  -- Master columns
  mas.playerID, 
  mas.birthYear, 
  mas.nameFirst, 
  mas.nameLast, 
  mas.weight, 
  mas.height, 
  mas.bats, 
  mas.throws, 
  mas.retroID, 
  mas.bbrefID,
  -- Batting columns
  bat.teamID,
  bat.lgID,
  bat.yearID,
  bat.G,
  bat.AB,
  bat.R,
  bat.H,
  bat.HR,
  bat.RBI,
  bat.BB,
  bat.SO
  
FROM 
    Master AS mas
      INNER JOIN 
    Batting AS bat ON mas.playerID = bat.playerID
  WHERE nameLast = "Griffey";
  
-- End file
  1. Use source from the MySQL prompt to execute the file.
mysql> source griffey-table.sql

Now I can check this new table using DESCRIBE.

mysql> DESCRIBE Griffeys;
+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| playerID  | varchar(255) | NO   |     | NULL    |       |
| birthYear | int(11)      | NO   |     | NULL    |       |
| nameFirst | varchar(255) | NO   |     | NULL    |       |
| nameLast  | varchar(255) | NO   |     | NULL    |       |
| weight    | int(11)      | YES  |     | NULL    |       |
| height    | int(11)      | YES  |     | NULL    |       |
| bats      | varchar(255) | YES  |     | NULL    |       |
| throws    | varchar(255) | YES  |     | NULL    |       |
| retroID   | varchar(255) | YES  |     | NULL    |       |
| bbrefID   | varchar(255) | YES  |     | NULL    |       |
| teamID    | varchar(255) | YES  |     | NULL    |       |
| lgID      | varchar(255) | YES  |     | NULL    |       |
| yearID    | int(11)      | YES  |     | NULL    |       |
| G         | int(11)      | YES  |     | NULL    |       |
| AB        | int(11)      | YES  |     | NULL    |       |
| R         | int(11)      | YES  |     | NULL    |       |
| H         | int(11)      | YES  |     | NULL    |       |
| HR        | int(11)      | YES  |     | NULL    |       |
| RBI       | int(11)      | YES  |     | NULL    |       |
| BB        | int(11)      | YES  |     | NULL    |       |
| SO        | int(11)      | YES  |     | NULL    |       |
+-----------+--------------+------+-----+---------+-------+
21 rows in set (0.00 sec)

Or SELECT *

mysql> SELECT * FROM Griffeys;
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
| playerID  | birthYear | nameFirst | nameLast | weight | height | bats | throws | retroID  | bbrefID   | teamID | lgID | yearID | G    | AB   | R    | H    | HR   | RBI  | BB   | SO   |
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1973 |   25 |   86 |   19 |   33 |    3 |   14 |    6 |   10 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1974 |   88 |  227 |   24 |   57 |    2 |   19 |   27 |   43 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1975 |  132 |  463 |   95 |  141 |    4 |   46 |   67 |   67 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1976 |  148 |  562 |  111 |  189 |    6 |   74 |   62 |   65 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1977 |  154 |  585 |  117 |  186 |   12 |   57 |   69 |   84 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1978 |  158 |  614 |   90 |  177 |   10 |   63 |   54 |   70 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1979 |   95 |  380 |   62 |  120 |    8 |   32 |   36 |   39 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1980 |  146 |  544 |   89 |  160 |   13 |   85 |   62 |   77 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1981 |  101 |  396 |   65 |  123 |    2 |   34 |   39 |   42 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1982 |  127 |  484 |   70 |  134 |   12 |   54 |   39 |   58 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1983 |  118 |  458 |   60 |  140 |   11 |   46 |   34 |   45 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1984 |  120 |  399 |   44 |  109 |    7 |   56 |   29 |   32 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1985 |  127 |  438 |   68 |  120 |   10 |   69 |   41 |   51 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1986 |   59 |  198 |   33 |   60 |    9 |   26 |   15 |   24 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1986 |   80 |  292 |   36 |   90 |   12 |   32 |   20 |   43 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1987 |  122 |  399 |   65 |  114 |   14 |   64 |   46 |   54 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1988 |   69 |  193 |   21 |   48 |    2 |   19 |   17 |   26 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1988 |   25 |   50 |    5 |   14 |    2 |    4 |    2 |    5 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1989 |  106 |  236 |   26 |   62 |    8 |   30 |   29 |   42 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1989 |  127 |  455 |   61 |  120 |   16 |   61 |   44 |   83 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1990 |   46 |   63 |    6 |   13 |    1 |    8 |    2 |    5 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | SEA    | AL   |   1990 |   21 |   77 |   13 |   29 |    3 |   18 |   10 |    3 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1990 |  155 |  597 |   91 |  179 |   22 |   80 |   63 |   81 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | SEA    | AL   |   1991 |   30 |   85 |   10 |   24 |    1 |    9 |   13 |   13 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1991 |  154 |  548 |   76 |  179 |   22 |  100 |   71 |   82 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1992 |  142 |  565 |   83 |  174 |   27 |  103 |   44 |   67 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1993 |  156 |  582 |  113 |  180 |   45 |  109 |   96 |   91 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1994 |  111 |  433 |   94 |  140 |   40 |   90 |   56 |   73 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1995 |   72 |  260 |   52 |   67 |   17 |   42 |   52 |   53 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1996 |  140 |  545 |  125 |  165 |   49 |  140 |   78 |  104 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1997 |  157 |  608 |  125 |  185 |   56 |  147 |   76 |  121 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1998 |  161 |  633 |  120 |  180 |   56 |  146 |   76 |  121 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1999 |  160 |  606 |  123 |  173 |   48 |  134 |   91 |  108 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2000 |  145 |  520 |  100 |  141 |   40 |  118 |   94 |  117 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2001 |  111 |  364 |   57 |  104 |   22 |   65 |   44 |   72 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2002 |   70 |  197 |   17 |   52 |    8 |   23 |   28 |   39 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2003 |   53 |  166 |   34 |   41 |   13 |   26 |   27 |   44 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2004 |   83 |  300 |   49 |   76 |   20 |   60 |   44 |   67 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2005 |  128 |  491 |   85 |  148 |   35 |   92 |   54 |   93 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2006 |  109 |  428 |   62 |  108 |   27 |   72 |   39 |   78 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2007 |  144 |  528 |   78 |  146 |   30 |   93 |   85 |   99 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2008 |  102 |  359 |   51 |   88 |   15 |   53 |   61 |   64 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CHA    | AL   |   2008 |   41 |  131 |   16 |   34 |    3 |   18 |   17 |   25 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   2009 |  117 |  387 |   44 |   83 |   19 |   57 |   63 |   80 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   2010 |   33 |   98 |    6 |   18 |    0 |    7 |    9 |   17 |
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
45 rows in set (0.00 sec)

Exit MySQL

The command to exit MySQL is–you guessed it–exit.

mysql> exit
bye

Scraping wikipedia tables

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends 
library(tidyverse)
library(magrittr)
library(xml2)
library(rvest)

Load the EHR data below:

EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")  
EHRMedTable %>% glimpse(78)

Find a list of antibiotic medications on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
List of 2
$ node:
$ doc :
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables % rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
List of 3
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables % str()
 List of 2
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
$ :List of 2
..$ node:
..$ doc :
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
Observations: 168
Variables: 5
$ Generic name "Aminoglycosides", "…
$ Brand names "Aminoglycosides", "…
$ Common uses[3] "Aminoglycosides", "…
$ Possible side effects[3] "Aminoglycosides", "…
$ Mechanism of action "Aminoglycosides", "…

Looks like this is the right table! I will assign it to a data frame

titled, “WikiAntiB” and check the base::names().

WikiAntiB % names()
 [1] "Generic name"             "Brand names"
[3] "Common uses[3]" "Possible side effects[3]"
[5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>% 
dplyr::select(
generic_name = Generic name,
brand_name = Brand names,
common_uses = Common uses[3],
poss_side_effects = Possible side effects[3],
mech_action = Mechanism of action)
WikiAntiB %>% dplyr::glimpse(60)

Observations: 176
Variables: 5
$ generic_name "Aminoglycosides", "Amikacin", …
$ brand_name "Aminoglycosides", "Amikin", "G…
$ common_uses "Aminoglycosides", "Infections …
$ poss_side_effects "Aminoglycosides", "Hearing los…
$ mech_action "Aminoglycosides", "Binding to …

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
dplyr::select(
id,
med = rxddrug)
remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)

Observations: 12,957
Variables: 2
$ id 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,…
$ med "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION…

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80) 
chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" …

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds %     dplyr::mutate(med = stringr::str_to_lower(med)) EHRMeds$med %>% dplyr::glimpse(80) 

chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" …

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>% 
dplyr::select(generic_name) %>%
# select this column to split
tidyr::separate(col = generic_name,
# put medication in gen_name
into = c("gen_name", "etc"),
# separate them on the whitespace
sep = " ",
# but keep the original variable
remove = FALSE) %>%
# then take new gen_name (first med)
dplyr::select(gen_name) %>%
# get the distinct values
dplyr::distinct(gen_name) %>%
dplyr::mutate(gen_name =
# convert to lowercase
str_to_lower(gen_name),
# remove (bs)
gen_name = str_remove_all(gen_name,
pattern = "\(bs\)"),
# replace "/" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "\/",
replacement = "_"),
# replace "-" w/ underscore
gen_name = str_replace_all(gen_name,
pattern = "-",
replacement = "_")) %>%
# split the new column again, this time into 2 gen_names
# on the underscore we put there ^
tidyr::separate(col = gen_name,
into = c("gen_name1", "gen_name2"),
sep = "_",
remove = FALSE) %>%
# now get all gen_name meds into one column
tidyr::gather(key = gen_name_key,
value = gen_name_value,
gen_name1:gen_name2) %>%
# remove missing
dplyr::filter(!is.na(gen_name_value)) %>%
# and rename
dplyr::select(generic_name = gen_name_value)

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60) 

Observations: 161
Variables: 1
$ generic_name "aminoglycosides", "amikacin", "ge…

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
base::unique() %>%
base::writeLines()
unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

check rows so I know I'm not fooling myself
EHRMeds %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 701
EEHRMeds %>%
dplyr::filter(stringr::str_detect(string = med,
pattern = antibiotic_med)) %>%
dplyr::distinct(med) %>%
base::nrow() # [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
dplyr::filter(stringr::str_detect(med,
antibiotic_med)) %>%
dplyr::select(id,
antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>% base::unique() 

[1] "rifaximin"
[2] "amoxicillin"
[3] "hydrocortisone; neomycin; polymyxin b otic"
[4] "trimethoprim"
[5] "cefdinir"
[6] "clindamycin"
[7] "azithromycin"
[8] "sulfamethoxazole"
[9] "ethambutol"
[10] "pyrazinamide"
[11] "minocycline"
[12] "sulfamethoxazole; trimethoprim"
[13] "cefixime"
[14] "polymyxin b; trimethoprim ophthalmic"
[15] "dexamethasone; tobramycin ophthalmic"
[16] "cefuroxime"
[17] "doxycycline"
[18] "amoxicillin; clavulanate"
[19] "erythromycin topical"
[20] "ciprofloxacin"
[21] "vancomycin"
[22] "penicillin v potassium"
[23] "silver sulfadiazine topical"
[24] "penicillin"
[25] "moxifloxacin ophthalmic"
[26] "gatifloxacin ophthalmic"
[27] "metronidazole"
[28] "ciprofloxacin; dexamethasone otic"
[29] "erythromycin ophthalmic"
[30] "gentamicin ophthalmic"
[31] "azithromycin ophthalmic"
[32] "tetracycline"
[33] "ofloxacin ophthalmic"
[34] "ciprofloxacin ophthalmic"
[35] "dexamethasone; neomycin; polymyxin b ophthalmic"
[36] "chloramphenicol"
[37] "mupirocin topical"
[38] "isoniazid"
[39] "levofloxacin"
[40] "nitrofurantoin"
[41] "moxifloxacin"
[42] "benzoyl peroxide; clindamycin topical"
[43] "sulfacetamide sodium ophthalmic"
[44] "neomycin; polymyxin b sulfate topical"
[45] "sulfasalazine"
[46] "metronidazole topical"
[47] "clarithromycin"
[48] "cefprozil"
[49] "clindamycin topical"
[50] "polymyxin b sulfate"
[51] "ofloxacin otic"
[52] "tobramycin ophthalmic"
[53] "dapsone topical"

Join tables back together

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
Observations: 18,704
Variables: 9
$ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
$ seqn 62161, 62161, 62162, 62163, 62164, 62…
$ rxduse 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ rxddrug "FLUOXETINE", "METHYLPHENIDATE", NA, …
$ rxddrgid "d00236", "d00900", NA, NA, "d00181",…
$ rxqseen 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N…
$ rxddays 5840, 5840, NA, NA, 9125, NA, 547, NA…
$ rxdcount 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N…
$ antib_med NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
dplyr::filter(!is.na(antib_med)) %>%
dplyr::count(antib_med) %>%
dplyr::distinct(id, .keep_all = TRUE) %>%
dplyr::arrange(desc(n)) %>%
utils::head(10)
# A tibble: 10 x 2
antib_med n

1 amoxicillin 126
2 azithromycin 40
3 sulfamethoxazole; trimethoprim 33
4 doxycycline 26
5 amoxicillin; clavulanate 23
6 minocycline 16
7 cefdinir 14
8 ciprofloxacin 13
9 penicillin v potassium 13
10 nitrofurantoin 11

BONUS! Using stringr::str_detect() within a dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

test -----
EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_)) %>%
dplyr::count(antib_2meds, antib_med) %>%
tidyr::spread(antib_2meds, n) %>%
head(10)
# A tibble: 10 x 4
antib_med 1 antibiotic med.2 antibiotic me… <NA>

1 amoxicillin 126 NA NA
2 amoxicillin; clavulanate NA 23 NA
3 azithromycin 40 NA NA
4 azithromycin ophthalmic 2 NA NA
5 benzoyl peroxide; clindamy… NA 4 NA
6 cefdinir 14 NA NA
7 cefixime 2 NA NA
8 cefprozil 1 NA NA
9 cefuroxime 3 NA NA
10 chloramphenicol 1 NA NA
assign -----
EHRMedTable <- EHRMedTable %>%
dplyr::mutate(antib_2meds = dplyr::case_when(
stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
!stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

if (!file.exists("Data/processed_data")) {
dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
"EHRMedTable",
as.character(Sys.Date()),
".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text ” Teaching Statistics: A Bag of Tricks ” from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here

Literate Programming & Dynamic Document Options in Stata

TL: DR

I’ve spent the last few months attempting to incorporate different literate programming and reproducible research options with the Stata statistical software. This post provides a quick overview of my goal, a brief “how-to” on each option, and my thoughts on realistically introducing them in a workflow.


What is literate programming?

Literate programming is a term coined by Donald E. Knuth in 1984. The general idea is to combine human readable text with machine-readable commands into the same document, manual, or website.  At the beginning of his paper, Knuth writes,

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

Although Knuth is describing the process of writing computer programs, his concept applies to any scenario where a series of commands are used to get a computer to perform a particular set of functions.

The data analysis process includes a successive set of commands used to manipulate the rawest form of the data, transform it into summaries or visualizations, and then model it for predictions and inferences. Each step in the process is based on the preceding step, so tracking the entire method in a systematic way is essential for remaining organized and being able to reproduce your work.

The yardstick I was using to evaluate each Stata literate programming option was how well each method provided a relatively seamless transition between the Stata commands and the human-readable explanations.

A Typical Stata Workflow

Below is an example workflow in Stata.

stata_workflow

While working in Stata, commands get put into a .do file, and they are used to create the output in either  .gph (image) or .smcl (text) files. The text results will then get written up in a .docx, and the tables and figures will be created using .xlsx. These documents are then sent off to a scientific journal somewhere, where the finished paper will most likely end up as a .pdf or .html on the web.

So to recap, the general process is:

.do >> .gph >> .smcl >> .docx >> .xlsx >> .pdf >> .html 

This workflow involves seven different file types to create a single finished manuscript (that will usually only contain only text and images).

Why should I care?

If you’ve read this far, the answer should be obvious: the process outlined above is inefficient. The analysis process is split across Stata, MS Word, MS Excel, Adobe/Preview, and whatever web browser you’re using. This makes working in Stata tedious.

Solution: Digital Notebooks

I recently came across a white paper that discusses the benefits of using Notebooks, and I’ve summarized the main points below:

  • “…notebooks speed up the process of trying out data models and frameworks and testing hypotheses, enabling everyone to work quickly, iteratively and collaboratively”
  • “…can be used to perform an analysis, generate visualizations, and add media and annotations within a single document.”
  • “…can be used to annotate the analyses for future reference and keep track of each step in the data discovery process so that the notebook becomes a source of project communication and documentation, as well as a learning resource for other data scientists.”

Although the paper is referring specifically to the Jupyter Notebooks, RStudio recently introduced the R Notebooks. Both methods combine similar sections for markdown formatted text, data analysis commands, output tables and/or figures, and other relevant portions of the results. These digital notebooks closely resemble paper laboratory notebooks (see below).

This slideshow requires JavaScript.

As you can see from this example, some of the text and calculations have been handwritten, while others have been calculated outside of the notebook, printed, and then pasted back inside the lab notebook. I commend the authors for their transparency, but this doesn’t seem like the most efficient method of keeping track of your work.

Does Stata have an equivalent option?

Sort of. Below I review my experience using three Stata options that collectively provide similar abilities to the notebooks provided by Python and R.


#1 markdoc

markdoc is a package for creating dynamic documents in Stata. It is very similar to the kintr package in R. The package was written by E. F. Haghish and is available on Github. To run markdoc, you’ll need to install Pandoc (which requires a type-setting software for TeX/LaTeX–I used MikTeX on my PC and Homebrew on my Mac).

After installing Pandoc and MikTeX, you’ll also need to download and install wkhtmltopdf.

Installing markdoc

You can install markdoc with the ssc command

ssc install markdoc

You should see:

checking markdoc consistency and verifying not already installed...
installing into c:\ado\plus\...
installation complete.

You’ll also need to install weaver and statax

ssc install weaver
ssc install statax


markdoc
also has a handy dialog box you can install with the following command:

db markdoc

The Output Files

Haghish provides example .do files for getting started with markdoc. I recommend working through each of them, but it shouldn’t be too difficult if you’re used to commenting in your .do files or writing in markdown. The .docx and .pdf output files are clean, orgranized, and formatted.

markdoc_docx

markdoc_pdf

markdoc is ideal for producing high-quality documents directly from the Stata IDE. After you understand the markdoc syntax, you will be able to perform the majority of your work in the .do file. The only downside I encountered in markdoc was a somewhat buggy installation–it worked better for me on the Mac. But the package is incredibly well maintained by the author, and I was able to find answers to my questions on his Github page eventually.


#2 weave

Germán Rodríguez at Princeton created weave, and I consider it a markdoc-light Stata package.

Installing weave

Installation is easy. Just type the following command into the Stata IDE.

net from http://data.princeton.edu/wws509/stata

And there’s an example .do file on his website.

weave essentially uses markdown/html tags for inline images and headers that are written directly into your .do files. The results are inserted into the output as plain text, so there is no need to tweak their formatting. When you’re finished with your analyses, you just type the following commands directly into the IDE.

weave using sample

The beginning of the .do file contains a command for logging using everything as a .usl file. The .usl file is then ‘weaved’ to create a .html output which will automatically open in your web browser.

The Output Files

You can just print the .html file to a .pdf like you would any web page. Chrome seems to create the best-looking .pdfs.

weave_output_pdf

*TIP: use minimal lines on your .do file to create cleaner looking output. I’ve created a detailed example here.

I use weave whenever I’m using Stata on my Mac. It’s easy to use, quick to format, and only requires me to have Stata open with a .do file.  I’ll use markdoc if I am creating a more professional-looking report, but the bugginess of markdoc doesn’t make it very user-friendly


#3 ipystata

The Jupyter Notebooks (previously IPython Notebooks) can be configured to work with Stata commands. Unfortunately, the package works best with Windows/PC.  The setup isn’t too complicated but has a few steps that can trip you up.

Download Anaconda from Continuum

You can download the most recent version of Anaconda from Continuum . This will include the following applications

anaconda_package

The only application I will be covering in this post is the Anaconda Prompt.

Changing the Jupyter Notebook working directory

The first thing you will want to do is set up your Jupyter Notebook in an appropriate working directory. You can do this by right-clicking on the Anaconda Prompt and run it as an administrator (I’ve moved the application to the taskbar).

anaconda_prompt

When the prompt is displayed (it should say Administrator)

anaconda_prompt_admin

copy+paste the file directory to the folder you want the Jupyter Notebook to open in. In the Anaconda Prompt, type

cd C:\Users\Google Drive\...\ipystata\notebooks

This will change your working directory. After you’re in the correct working directory, start up the Jupyter Notebooks by typing the following command in your Anaconda Prompt,

Jupyter Notebook

This should open a new tab in your default web browser.

jupyter

You can open a new notebook using the tab on the far right of the screen by selecting, “New” >> “Python [default]

Registering Stata

You will need to open a Command Prompt window as an administrator by right-clicking on the application and selecting, “Run as administrator” (*you can search for this application in the windows search bar by typing “cmd”).

admin_cmd

from here you need to navigate to your Stata application in your Program Files (usually in the C:\ drive)

stata_path

copy+paste the file destination and enter it into the Command Prompt window preceded by cd

cd C:\Program Files (x86)\Stata14

admin_path_stata

from this location, register the Stata application by typing the name of the .exe file followed by a space and /Register

register_stata

*No news is good news on this command. 

Installing ipystata and pandas

Now go back to your Jupyter Notebook and install pandas. Pandas is an open-source data analysis package for python. Read about it here.

In the first line of your notebook type:

import pandas as pd

To install ipystata, you’ll need to open a Windows PowerShell window (as an administrator) and enter the following command:

pip install ipystata

install_ipystata

After the package has been installed, enter the following command in your Jupyter Notebook:

import ipystata

To test if it worked, type a simple display command preceded by the %%stata . The output should look like this:

jupyter_ipystata

Using the %%stata Magic Commands

Now that you are up and running, the Jupyter Notebook basically replaces your .do file. You will just need to precede the Stata commands with a line containing the %%stata

Start by loading a native dataset

%%stata
sysuse auto, clear

You can get a quick overview of these data by using codebook, compact or describe, short

stata_describe

Including Graphs in the Output

To include graphs in your output, simply include the -gr command on the same line as your  %%stata command.

figure_1

matrix graph

figure_2

scatter plot

Sharing Your Output Online

In my opinion, the best part of using Jupyter Notebooks is the ability to share your work online. You can publish your notebook using the cloud+arrow icon on the toolbar (register your account first).

sharing

In fact, this notebook and a complete example of the ipystata package is available online. I think this feature makes the Jupyter Notebook the best option for literate programming and reproducible research in Stata. The complicated setup is definitely worth the time investment because you’ll be able to have an ongoing stream of commands, formatted text summaries, and output all in one place.