rstudio::conf() day 1 (resource dump)

I just finished my first day at rstudio::conf() and wanted to provide a list of references that came up throughout the day.

Books on teaching

These texts came up in the tidyverse train-the-trainer section today with Greg Wilson (check out his website here).

I mentioned Steven Pinker’s book The Sense of Style in the discussion on the curse of knowledge, and Git for Humans by David Demaree is great if you are ever sadistic enough to try and teach people how to use Git.

Visualization guide

This guide on selecting a chart was shown during the Cognitive Craft portion of the lecture. Another guide to check out is Visualizing the Uncertainty in Data by Nathan Yau.

Courses and workshops

Blogs to follow

I’m sure there are more I am forgetting, but I will add them in a future post.

Quickly import and export data from R with datapasta

EDIT: As usual, Mara Averick was way ahead of me on these two packages πŸ™‚ She wrote a great example (with much cleaner .gifs), so be sure to check out her post too!

https://maraaverick.rbind.io/2018/10/reprex-with-datapasta/
follow her on twitter too! -> https://twitter.com/dataandme

The datapasta package is great for importing and exporting data. It’s copy + paste ability transforms just about any rectangular data you can drag your mouse over into data.frame‘s or tibble‘s.

Using datapasta

Check out the data in this table on male and female height and weights. I can highlight it all, click cmd + c (or ctrl + c),

Now I can head back to RStudio and enter the following in a fresh .R script.

library(datapasta)
library(tidyverse)
datapasta::tribble_paste()

You should see something like this:

ahhh pure bliss!

Pretty slick, huh? Unfortunately, when I try to run this code, I get the following error:

Error in list2(…) : object 'NANA' not found

It looks like the NANA values are throwing the tribble_paste() function off. No worries, datapasta also has a df_paste() function too!

Can this get any better?

Now I can also go through and edit the data.frame() function a bit to remove the pesky NA columns and values.

What if my dataset is really big?

Maybe you have a huge dataset you want to import into R, but you’re not sure if datapasta can handle it? All it takes is a little adjustment on the datapasta::dp_set_max_rows() function.

For example, if I wanted to copy + paste this table into an RStudio session, I could enter datapasta::dp_set_max_rows(num_rows = 15000) in my .R script just above the tribble_paste() function.

datapasta::dp_set_max_rows(num_rows = 15000)
datapasta::tribble_paste()ο»Ώ
it’s slow, but it gets there…

As you can see, tribble_paste() parsed this table into a tibble::tribble() function (and it was over 14000 rows!).

But wait…there’s more!

What if I found a problem with the table I just pasta ‘d into R (or another data set in R)? I don’t know about you, but I’m constantly needing to ask questions or share code on Stackoverflow or RStudio Community.

Well, the handy-dandy dpasta() function helps me create excellent reproducible examples.

Let’s assume I needed to share a sample from the height and weight data I just imported (WordHtWt).

  • First I’ll add some meaningful names to the columns in WordHtWt (using magrittr::set_names()),
  • Then take a small sample with help from dplyr (it’s a good idea to always use the smallest possible data frame to re-create the problem),
  • And…
# new names
world_height_weight_names <- c("country", "male_avg_ht_m",
"male_avg_wt_kg", "male_bmi",
"female_avg_ht_m", "female_avg_wt_kg",
"female_bmi")
# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# PASTA!!!
dpasta()
Voila!

Now I have a nice bit of code I can post (and hopefully get my questions answered).

Friends and alternatives to datapasta

datapasta plays well with the reprex package. If you aren’t sure what reprex does, you should watch the webinar from Jenny Bryan (the package author). If you are looking for the base R alternative to dpasta(), there’s dput(), but the output is not as clean (and it doesn’t have a direct analog to the _paste() functions).


# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# try this with dput()
dput()

structure(list(country = c("Taiwan", "Burma", "Kazakhstan", "Bolivia",
"Belgium", "Mali", "Mauritius", "Laos", "Burundi", "France",
"Mexico", "Nigeria", "Turkey"), male_avg_ht_m = c("1.73 m", "1.65 m",
"1.72 m", "1.67 m", "1.81 m", "1.72 m", "1.71 m", "1.60 m", "1.68 m",
"1.79 m", "1.68 m", "1.67 m", "1.74 m"), male_avg_wt_kg = c("74.8 kg",
"60.4 kg", "77.8 kg", "70.6 kg", "87.8 kg", "67.7 kg", "71.9 kg",
"57.9 kg", "61.5 kg", "83.3 kg", "77.6 kg", "63.0 kg", "82.4 kg"
), male_bmi = c(25, 22.2, 26.3, 25.3, 26.8, 22.9, 24.6, 22.6,
21.8, 26, 27.5, 22.6, 27.2), female_avg_ht_m = c("1.60 m", "1.54 m",
"1.60 m", "1.53 m", "1.65 m", "1.61 m", "1.57 m", "1.51 m", "1.55 m",
"1.65 m", "1.56 m", "1.58 m", "1.60 m"), female_avg_wt_kg = c("60.7 kg",
"54.5 kg", "68.1 kg", "64.8 kg", "70.0 kg", "59.9 kg", "64.1 kg",
"52.4 kg", "51.7 kg", "66.4 kg", "69.4 kg", "59.9 kg", "73.7 kg"
), female_bmi = c(23.7, 23, 26.6, 27.7, 25.7, 23.1, 26, 23, 21.5,
24.4, 28.5, 24, 28.8)), row.names = c(NA, -13L), class = "data.frame")

Additional resources

Check out the vignette for datapasta here, the tidyverse packages, and an excellent description of how to write a reproducible example from Advanced R by Hadley Wickham. Be sure to thank the datapasta author Miles McBain for all the future headaches he just saved you from.

Migrating from a Microsoft Access Database to MySQL with MS ACCDB Viewer

Motivation

The new dbplyr is out on CRAN today, and I’ve wanted to update my MySQL Lahman baseball database with 2017 data, but it’s only available in .csv files and MS Access versions. I could import the .csv files, but thought it would be useful to show how I convert an MS Access database into MySQL using MS ACCDB Viewer, This application has a free trial (the paid version is under $20). I rarely recommend purchasing software, but if you use macOS and find yourself needing to get into an MS Access database, this tool comes in handy.

Getting started

This post assumes:

  1. You are running macOS (this tutorial was built using version 10.14.2)
  2. You have MySQL installed on your machine (see previous post to get everything set up)
  3. You have RStudio installed (using version 1.2.1206)
  4. You have R installed (using version 3.5.1)

Packages

These are the packages you’ll need.

library(RMySQL)
library(RMariaDB)
library(dplyr)
library(tidyverse)
library(Lahman)
library(magrittr)
library(rstudioapi)

Project folder setup

I created an RStudio project file inside a mysql-mdb-accdb folder (mysql-mdb-accdb.Rproj). If you don’t know how RStudio’s project files work, see this post for more information.

Inside this project folder, I created another sub-folder titled databases. I wanted to separate these files by type, so I created a new folder for each version of the 2017 data.

databases/csv
databases/ms-access
databases/sql

After downloading the data, the full folder tree is available below.

databases
β”œβ”€β”€ csv
β”‚ └── baseballdatabank-master
β”‚ β”œβ”€β”€ README.txt
β”‚ β”œβ”€β”€ core
β”‚ β”‚ β”œβ”€β”€ AllstarFull.csv
β”‚ β”‚ β”œβ”€β”€ Appearances.csv
β”‚ β”‚ β”œβ”€β”€ AwardsManagers.csv
β”‚ β”‚ β”œβ”€β”€ AwardsPlayers.csv
β”‚ β”‚ β”œβ”€β”€ AwardsShareManagers.csv
β”‚ β”‚ β”œβ”€β”€ AwardsSharePlayers.csv
β”‚ β”‚ β”œβ”€β”€ Batting.csv
β”‚ β”‚ β”œβ”€β”€ BattingPost.csv
β”‚ β”‚ β”œβ”€β”€ CollegePlaying.csv
β”‚ β”‚ β”œβ”€β”€ Fielding.csv
β”‚ β”‚ β”œβ”€β”€ FieldingOF.csv
β”‚ β”‚ β”œβ”€β”€ FieldingOFsplit.csv
β”‚ β”‚ β”œβ”€β”€ FieldingPost.csv
β”‚ β”‚ β”œβ”€β”€ HallOfFame.csv
β”‚ β”‚ β”œβ”€β”€ HomeGames.csv
β”‚ β”‚ β”œβ”€β”€ Managers.csv
β”‚ β”‚ β”œβ”€β”€ ManagersHalf.csv
β”‚ β”‚ β”œβ”€β”€ Parks.csv
β”‚ β”‚ β”œβ”€β”€ People.csv
β”‚ β”‚ β”œβ”€β”€ Pitching.csv
β”‚ β”‚ β”œβ”€β”€ PitchingPost.csv
β”‚ β”‚ β”œβ”€β”€ Salaries.csv
β”‚ β”‚ β”œβ”€β”€ Schools.csv
β”‚ β”‚ β”œβ”€β”€ SeriesPost.csv
β”‚ β”‚ β”œβ”€β”€ Teams.csv
β”‚ β”‚ β”œβ”€β”€ TeamsFranchises.csv
β”‚ β”‚ β”œβ”€β”€ TeamsHalf.csv
β”‚ β”‚ └── readme2014.txt
β”‚ └── upstream
β”‚ └── Teams.csv
└── ms-access
└── lahman2017
β”œβ”€β”€ lahman2017.mdb
└── readme2017.txt

MS Access to MySQL using MS ACCDB Viewer

  1. Download the MS ACCDB Viewer app and the Lahman 2017 MS Access database and .csv files. I’ve created the following sub-folders for the two types of Lahman database files.
  2. Open the MS Access database in the MS ACCDB Viewer app and follow the steps below.

3. After the database exports, you will see a new lahman2017.sql file in your databases/sql. This file is too large to open in RStudio (it’s 46.4 MB), so I’ll open it in a text editor (I use Sublime Text, but many others will work).

What should I expect to see?

If you haven’t done a lot of work in SQL, you will have no way of knowing if this SQL export is ready to be loaded into MySQL (I didn’t know what to expect until I did a lot of reading and trial and error). However, I can look at the 2016 MySQL database from the Lahman website to see what to expect. Below are the first 50 lines each SQL file (2016 on the left, 2017 on the right).

The important differences in the first ~45 lines:

  1. There is no command for creating the lahman2017 database (i.e. CREATE DATABASE)
  2. After creating the database, there is no command to use the lahman2017 database
  3. The SET NAMES argument should be changed to SET NAMES utf8mb4 (this will include characters like emojis)
  4. All the tables and data have clear headers in the previous .sql file, and it’s worthwhile to add these to this version too (though it’s not required)
/*

This MySQL dump came from the Lahman 2017 mdb file available here:
http://www.seanlahman.com/baseball-archive/statistics/
The following was added:
A CREATE DATABASE statement
A USE database statement
Change the SET NAMES 'UTF8'; to SET NAMES utf8mb4;
Add the 'Table structure' and 'Records' titles (optional)

*/
CREATE DATABASE IF NOT EXISTS lahman2017;
USE lahman2017;
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

The first 50 lines of my lahman2017.sql file now look like this:

RStudio’s Terminal pane

In previous posts, I mention opening and using Terminal. RStudio also has it’s own Terminal pane (see below).

This is accessible under Tools > Terminal > New Terminal.

In a new Terminal pane, enter the following commands.

$ cd databases/sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH

This returns what looks like a long and convoluted file path (this is normal).

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/….

Now I enter the following:

mysql -u root -p 

Enter your password and you should see the mysql> prompt. I want to check the databases in MySQL before adding the new lahman2017 data, and I can do this with SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| lahman2016 |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)

Because I moved into the sql folder, I can execute the lahman2017.sql file by entering source lahman2017.sql into the MySQL prompt.

mysql> source lahman2017.sql

This will return a long output telling you things like Query OK and Records: 11370 Duplicates: 0 Warnings: 0. When it’s finished, check the new tables with SHOW TABLES;

You should see this.

Great! Now I have the new database set up in MySQL. I am going to connect to RStudio and start querying.

Changing MySQL settings

Before connecting to MySQL from RStudio, you might need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption
  3. Then click Start MySQL Sever

The dplyr::src_mysql() function

In the previous post, I set up my connection to RStudio using the DBI package.

This time I will set up my connection with the dplyr package (part of the tidyverse). The function that does this is dplyr::src_mysql(), and I include arguments for dbname, host, user, and rstudio::askForPassword("Database password") which prompts a fill-in box for my password.

LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2017", 
host = "localhost",
user = "root",
# password = rstudioapi::askForPassword("Database password")
)

Print the LahmandplyrMySQL object to the console.

LahmandplyrMySQL
src: mysql 8.0.13 [root@localhost:/lahman2017]
tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers,
AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost,
CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost,
HallOfFame, HomeGames, Managers, ManagersHalf, Parks, People, Pitching,
PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises,
TeamsHalf

As I can see, this prints out much more information than the DBI connection in my previous post. The LahmandplyrMySQL object tells me the source of the connection and the tables in the database.

ALWAYS read the README file (readme2017.txt)

The directions are right there in the title. Important stuff is in these files (like changes to the tables in the database). This section is an example of such information:

One notable change: The name of the table that contains biographical information for players has been changed from β€œMaster” to β€œPeople” top better reflect its contents.

This information saved me from some potential headache’s (my previous query on the Master table would give an error).

In the next section, I’ll go over the new changes in dbplyr with the dbplyr connection. The dplyr::tbl() function takes the src argument which is the connection to the database (LahmandplyrMySQL). The next argument is sql(), and this is a character string (inside " ") which SQL commands. I like to capitalize the MySQL syntax, refer to tables in SQL with TitleCase, and the columns in lowercase.

API changes

1) Calls of the form dplyr::foo() are now evaluated in the database, rather than locally.

I am assuming this means that using dplyr instead of MySQL in queries is run in the database. Such as the following:

I also pass everything to the dplyr::glimpse(78) so it prints nicely to the window.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view object
dplyr::glimpse(78)

Observations: ??
Variables: 24
Database: mysql 8.0.13 [root@localhost:/lahman2017]
$ playerID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…
$ birthYear 1964, 1987, 1943, 1963, 1971, 1942, 1981, 1934, 1939, …
$ birthMonth 10, 11, 7, 9, 12, 8, 12, 2, 8, 9, 2, 8, 6, 9, 7, 1, 9,…
$ birthDay 26, 13, 15, 21, 28, 9, 27, 5, 5, 8, 16, 17, 2, 15, 15,…
$ birthCountry "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ birthState "IL", "NY", "CA", "CA", "HI", "AL", "CO", "AL", "AL", …
$ birthCity "Chicago", "Staten Island", "Long Beach", "Escondido",…
$ deathYear NA, NA, 2016, NA, NA, 2001, NA, NA, 1984, NA, NA, NA, …
$ deathMonth NA, NA, 2, NA, NA, 1, NA, NA, 8, NA, NA, NA, NA, NA, N…
$ deathDay NA, NA, 15, NA, NA, 22, NA, NA, 16, NA, NA, NA, NA, NA…
$ deathCountry NA, NA, "USA", NA, NA, "USA", NA, NA, "USA", NA, NA, N…
$ deathState NA, NA, "CA", NA, NA, "NY", NA, NA, "GA", NA, NA, NA, …
$ deathCity NA, NA, "Long Beach", NA, NA, "New York", NA, NA, "Atl…
$ nameFirst "Steve", "Tim", "Dave", "Troy", "Benny", "Tommie", "Da…
$ nameLast "Adkins", "Adleman", "Adlesh", "Afenir", "Agbayani", "…
$ nameGiven "Steven Thomas", "Timothy Max", "David George", "Micha…
$ weight 210, 225, 187, 185, 225, 195, 215, 180, 190, 190, 200,…
$ height 78, 77, 72, 76, 72, 71, 75, 72, 75, 75, 78, 74, 71, 75…
$ bats "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ throws "L", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ debut "1990-09-12", "2016-05-01", "1963-05-12", "1987-09-14"…
$ finalGame "1990-10-03", "2017-09-19", "1968-09-28", "1992-07-10"…
$ retroID "adkis001", "adlet001", "adled101", "afent001", "agbab…
$ bbrefID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…

2) The vars argument to tbl_sql() has been formally deprecated.

The vars argument used to be used to supply a list of columns in the tbl. The previous language is below:

If known, the names of the variables in the tbl. This is relatively expensive to determine automatically, so is cached throughout dplyr. However, you should usually be able to leave this blank and it will be determined from the context.

3) src and tbl objects now include a class generated from the class of the underlying connection object. This makes it possible for dplyr backends to implement different behaviour at the dplyr level, when needed.

I assume this is visible in the str() and class() using the previous query.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
utils::str()


List of 2
$ src:List of 2
..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
.. .. ..@ Id: int [1:2] 0 0
..$ disco:
..- attr(*, "class")= chr [1:4] "src_MySQLConnection" "src_dbi" "src_sql" "src"
$ ops:List of 4
..$ name: chr "filter"
..$ x :List of 2
.. ..$ x : 'sql' chr "SELECT * FROM People"
.. ..$ vars: chr [1:24] "playerID" "birthYear" "birthMonth" "birthDay" …
.. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
..$ dots:List of 1
.. ..$ : language ~bats == "R" & birthYear >= 1900
.. .. ..- attr(*, ".Environment")=
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"
- attr(*, "class")= chr [1:5] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" …

The class

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
base::class()

[1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"

SQL translation

I will skip over The new as.integer64(x) is translated to CAST(x AS BIGINT).

cummean() now translates to SQL AVG() as opposed to MEAN().

The cummean() is an aggregate window function. It works by keeping track of 1) the running sum of a values, and 2) the ith (or count) of the values. There is a great example of how this works on StackOverflow here.

I created an example data frame below to show how this works.

get a sample
x <- base::sample(1:10)
create a data frame of the intermediate steps and cummean
base::data.frame(sample= x,
                cumsum_c = cumsum(x),
seq_along_x = seq_along(x),
cummean = cumsum(x) / seq_along(x))
sample cumsum_x seq_along_x cummean
5 5 1 5.000000
3 8 2 4.000000
4 12 3 4.000000
2 14 4 3.500000
1 15 5 3.000000
7 22 6 3.666667
8 30 7 4.285714
9 39 8 4.875000
10 49 9 5.444444
6 55 10 5.500000

The cumsum_x is a running total of sample (5 + 3 = 8, 5 + 3 + 4 = 12, etc.), and seq_along_x is the count of the values. When I get the ratio of cumsum_x / seq_along_x, I get the cummean column (5 / 1 = 5, 8 / 2 = 4.0, etc.)

This is the cumulative mean of RBIs in the 2017 Batting data table.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::mutate(cummean_rbi = cummean(RBI)) %>%
dplyr::select(playerID, cummean_rbi) %>%
utils::head(5)

# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(RBI)
playerID cummean_rbi

1 wilsoha01 191
2 gehrilo01 188.
3 greenha01 186
4 foxxji01 183.
5 gehrilo01 182.

Compare this to the mean RBI per player.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::group_by(playerID) %>%
dplyr::summarize(mean_rbi = mean(RBI, na.rm = TRUE)) %>%
dplyr::select(playerID, mean_rbi) %>%
dplyr::arrange(desc(mean_rbi)) %>%
utils::head(5)
# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(mean_rbi)
playerID mean_rbi
1 dimagjo01 118.
2 gehrilo01 117.
3 pujolal01 113.
4 cabremi01 108.
5 belleal01 103.

Just for kicks, the AVG() function in MySQL is implemented like this:

mysql> SELECT 
playerID,
AVG(RBI) AS avg_rbi
FROM
Batting
GROUP BY
playerID
ORDER BY
avg_rbi DESC
LIMIT 5;
+-----------+----------+
| playerID | avg_rbi |
+-----------+----------+
| dimagjo01 | 118.2308 |
| gehrilo01 | 117.3529 |
| pujolal01 | 112.8235 |
| cabremi01 | 107.5333 |
| belleal01 | 103.2500 |
+-----------+----------+
5 rows in set (0.07 sec)

Closing

Moving form one relational database management system to another can slow down your workflow, so having (inexpensive) tools to quickly transform a database into something you can work with is a time-saver. I don’t like leaving the RStudio IDE (unless it’s to my browser), so the quicker I can get my data source into this environment, the quicker I can get to work on it.

Next post will probably be on PostgreSQL.

Setting up PostgreSQL on macOS

This tutorial was written in RMarkdown in RStudio (version 1.2.1206, R version 3.5.1), using mac OS High Sierra (10.13.6).

Why install PostgreSQL?

PostgreSQL is an open-source relational database management system (RDBMS). You can install PostgreSQL on Linux, Unix, Windows, or Mac. Users can write and execute stored procedures and functions from other programming languages (like R and Python). Multiple data types can also be stored (and even defined) in PostgreSQL, including JSON and other nonrelational data.

Read more about PostgreSQL here.

PostgreSQL application

Download the application from this website and follow the installation instructions.

The syntax documentation for PostgreSQL is here.

First steps

After downloading and installing the PostgreSQL application, open a new Terminal window. In the Terminal, enter the following:

$ sudo mkdir -p /etc/paths.d &&
$ echo /Applications/Postgres.app/Contents/Versions/latest/bin | sudo tee /etc/paths.d/postgresapp

You’ll be asked to enter whatever password you need to install programs on your computer. Type it in and hit enter.

Install PostgreSQL using Homebrew

You can also install PostgreSQL using homebrew. If you don’t have homebrew installed, learn how to do so here.

In a fresh terminal window, enter the following:

$ brew install postgresql
==> Installing postgresql
==> Downloading https://homebrew.bintray.com/bottles/postgresql-11.1.mojave.bottle.tar.gz
################################################################## 100.0%
==> Pouring postgresql-11.1.mojave.bottle.tar.gz
==> Caveats
To migrate existing data from a previous major version of PostgreSQL run:
brew postgresql-upgrade-database
To have launchd start postgresql now and restart at login:
brew services start postgresql
Or, if you don't want/need a background service you can just run:
pg_ctl -D /usr/local/var/postgres start
==> Summary
🍺 /usr/local/Cellar/postgresql/11.1: 3,548 files, 40.3M

Launching PostgreSQL from Terminal

Read the documentation for a more in-depth explanation.

In a fresh Terminal session (If you are using RStudio’s Terminal pane, select Tools > Terminal > New Terminal), enter the following.

psql postgres

This will bring up the psql command line.

psql (10.5)
Type "help" for help.

postgres=#

The =# is where we will be entering our postgres commands.

Check users, roles and attributes

We can access the users, their roles and attributes in PostgreSQL with the following:

postgres=# \du

This will return a table with the following information:

yourname=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Add a new user

The postgres=# prompt indicates I am currently a Superuser. To create a new user for martymcfly, I enter a USER and WITH PASSWORD:

=# CREATE USER martymcfly WITH PASSWORD 'strongpassword123';
-- CREATE ROLE

PostgreSQL typically returns the command that was previously executed. Because CREATE USER is another spelling of CREATE ROLE, PostgreSQL returns CREATE ROLE in the command prompt.

I will check to make sure this is true with \du again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
martymcfly | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Grant attributes to new user

As I can see, martymcfly has no roles or attributes. I want this user to be able to create a new database. I can do this with the following:

=# ALTER ROLE martymcfly CREATEDB;

And I can verify this with the \du command again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
martinfrigaard | Superuser, Create role, Create DB | {}
martymcfly | Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Creating a database

I’m going to switch over the new user I just created by exiting PostgreSQL and signing back in as martymcfly:

=# \q --exit 

Now sign in as a new user

psql postgres -U martymcfly

This brings up a new PostgreSQL prompt.

postgres=>

The =>tells me I am no longer the Superuser, but I know that I have given martymcfly the relevant permissions to create a new database.

Below are the steps needed to 1) create a new database (I’m creating a Lahman baseball database in PostgreSQL), 2) grant privileges to the user martymcfly on the new lahman2017 database, 3) list all the current databases in PostgreSQL, and 4) connect to the lahamn2017 database.

First create lahman2017

CREATE DATABASE lahman2017;
-- CREATE DATABASE

Then give martymcfly permissions

 GRANT ALL PRIVILEGES ON DATABASE lahman2017 TO martymcfly;
-- GRANT

Now list the databases

\list

This returns:

                             List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges |
----------------+----------------+----------+-------------+-------------+--------------------------
lahman2017 | martymcfly | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =Tc/martymcfly +
| | | | | martymcfly=CTc/martymcfly
martinfrigaard | martinfrigaard | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)

Finally, we connect to the new database

\connect lahman2017
psql (11.1, server 11.0)
You are now connected to database "lahman2017" as user "martymcfly".

Great! Next time we will build the schemas and tables in the lahman2017 database.

Querying MySQL from RStudio

In the previous post I demonstrated how to install and use MySQL from the command line. In this tutorial I will show you how to set up and query a relational database management system (RDBMS) from RStudio.

The Lahman database

These queries are done using the Lahman baseball data set available here. I chose these data because 1) they are open to anyone and can be run on your local machine, and 2) they are available as a .sql file and .csv files for loading into any RDBMS software.

Database packages in RStudio

For this tutorial I will be using a local instance of MySQL with help from the DBI, dplyr, and RMySQL packages.

library(DBI) 
library(RMySQL)
library(RMariaDB)
library(dplyr)
library(rstudioapi)
library(ggthemes)

Change MySQL settings

Before we can connect to a database with RStudio, we need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption.
  3. Then click Start MySQL Sever

Connect to MySQL

A DBI connection is built below to my local instance of MySQL.

LahmanDBIMySQL <- DBI::dbConnect(MySQL(),
dbname = "lahman2016",
host = "localhost",
user = "root",
rstudioapi::askForPassword(prompt = "database password")) LahmanDBIMySQL

As I can see, this is a <MySQLConnection:0,1>. This is the object we will use to access the lahman2016 database.

Querying a database from RStudio

We can get a list of tables in the lahman2016 database by using the DBI::dbListTables() function.

DBI::dbListTables(LahmanDBIMySQL)
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingOFsplit"
[13] "FieldingPost" "Griffeys" "HallOfFame"
[16] "HomeGames" "Managers" "ManagersHalf"
[19] "Master" "Parks" "Pitching"
[22] "PitchingPost" "Salaries" "Schools"
[25] "SeriesPost" "Teams" "TeamsFranchises"
[28] "TeamsHalf"

This is similar to the SHOW TABLES; command in MySQL. In the previous tutorial, I built a table of statistics on Ken Griffey Junior and Senior in the Lahman database. The Griffeys table contains following fields:

DBI::dbListFields(LahmanDBIMySQL, "Griffeys") 
## [1] "playerID" "birthYear" "nameFirst" "nameLast" "weight"
## [6] "height" "bats" "throws" "retroID" "bbrefID"
## [11] "teamID" "lgID" "yearID" "G" "AB"
## [16] "R" "H" "HR" "RBI" "BB"
## [21] "SO"

Queries can be passed directly to the DBI::dbGetQuery() function, like the SELECT statement below.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT * FROM Griffeys;") %>%
dplyr::glimpse(78)

Observations: 45
Variables: 21
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ retroID "grifk001", "grifk001", "grifk001", "grifk001", "grifk001…
$ bbrefID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…

I realize I don’t have all the columns I want from the Batting and Fielding tables, so I will query lahman2016 again to collect these data.

Aliasing columns in MySQL tables

There are three columns in the Batting table I want to add to Griffeys: SF, 2B, and 3B. Two of these names violate naming rules in R.

DBI::dbListFields(LahmanDBIMySQL, "Batting") 
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "AB" "R" "H" "2B" "3B" "HR"
[13] "RBI" "SB" "CS" "BB" "SO" "IBB"
[19] "HBP" "SH" "SF" "GIDP"

The 2B and 3B columns contain the number of doubles and triples for each player, and SF is the number of sacrifice flies. Unfortunately, R won’t allow columns to have names that start with a number. I can write a query to include these columns, but use an alias to create R-friendly column names.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;") %>% glimpse(78)

Observations: 45
Variables: 22
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…
$ dubs 5, 9, 15, 28, 35, 33, 27, 28, 21, 23, 21, 20, 28, 7, 15, …
$ trips 1, 5, 9, 9, 8, 8, 4, 10, 6, 2, 3, 1, 4, 0, 3, 1, 0, 0, 3,…
$ sac_flies "0", "0", "3", "3", "2", "3", "3", "5", "4", "3", "2", "4…

Storing SQL in character vectors

I can also assign the query above to a character vector (batting_query) and pass the vector to the DBI::dbGetQuery() function. I will assign these new variables to the Griffeys table.

batting_query <- "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;"
Griffeys <- DBI::dbGetQuery(LahmanDBIMySQL, batting_query)

Now that we see the SQL works and these two tables are joined, we can calculate a few new statistics in R.

Calculating On-base plus slugging (OPS)

A players on-base plus slugging (OPS) is a measure of their overall batting performance. This is the sum of the player’s on-base percentage and slugging average. The steps to calculate this statistic are below.

GriffsOPSbyYear <- Griffeys %>% 
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr"))

I created a few intermediate calculations before creating obslug_perc, and I also created a nameID so the two players can be easily graphed. Below is the on-base plus slugging (OPS) statistic for both Griffeys over their career.

GriffsOPSbyYear %>% 
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) + ggthemes::theme_fivethirtyeight()

The Lahman package in R

There is also a Lahman package in R. I will use this below to demonstrate some of the similarities between dplyr and MySQL syntax.

First I create an object for the Lahman::Master, Lahman::Fielding, and Lahman::Batting tables.

Master <- Lahman::Master
Batting <- Lahman::Batting
Fielding <- Lahman::Fielding

Then the code below creates the same tables and graph using dplyr commands.

GriffeyOPSPlot <- Master %>% 
dplyr::select(playerID,
birthYear,
nameLast) %>%
# find Griffeys
dplyr::filter(nameLast %in% "Griffey") %>%
# join to batting table
dplyr::inner_join(x = .,
y = Batting,
by = "playerID") %>%
# select/rename relevant columns
dplyr::select(teamID,
nameLast,
playerID,
birthYear,
yearID,
G,
AB,
R,
H,
HR,
dubs = X2B,
trips = X3B,
RBI,
BB,
SO,
sac_flies = SF) %>%
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr")) %>%
# create plot of new OPS
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) +
ggthemes::theme_fivethirtyeight()
ggplot2::ggsave(filename = "GriffeyOPSPlot.png", width = 7, height = 5, units = "in")

End