rstudio::conf() day 1 (resource dump)

I just finished my first day at rstudio::conf() and wanted to provide a list of references that came up throughout the day.

Books on teaching

These texts came up in the tidyverse train-the-trainer section today with Greg Wilson (check out his website here).

I mentioned Steven Pinker’s book The Sense of Style in the discussion on the curse of knowledge, and Git for Humans by David Demaree is great if you are ever sadistic enough to try and teach people how to use Git.

Visualization guide

This guide on selecting a chart was shown during the Cognitive Craft portion of the lecture. Another guide to check out is Visualizing the Uncertainty in Data by Nathan Yau.

Courses and workshops

Blogs to follow

I’m sure there are more I am forgetting, but I will add them in a future post.

Quickly import and export data from R with datapasta

EDIT: As usual, Mara Averick was way ahead of me on these two packages 🙂 She wrote a great example (with much cleaner .gifs), so be sure to check out her post too!

https://maraaverick.rbind.io/2018/10/reprex-with-datapasta/
follow her on twitter too! -> https://twitter.com/dataandme

The datapasta package is great for importing and exporting data. It’s copy + paste ability transforms just about any rectangular data you can drag your mouse over into data.frame‘s or tibble‘s.

Using datapasta

Check out the data in this table on male and female height and weights. I can highlight it all, click cmd + c (or ctrl + c),

Now I can head back to RStudio and enter the following in a fresh .R script.

library(datapasta)
library(tidyverse)
datapasta::tribble_paste()

You should see something like this:

ahhh pure bliss!

Pretty slick, huh? Unfortunately, when I try to run this code, I get the following error:

Error in list2(…) : object 'NANA' not found

It looks like the NANA values are throwing the tribble_paste() function off. No worries, datapasta also has a df_paste() function too!

Can this get any better?

Now I can also go through and edit the data.frame() function a bit to remove the pesky NA columns and values.

What if my dataset is really big?

Maybe you have a huge dataset you want to import into R, but you’re not sure if datapasta can handle it? All it takes is a little adjustment on the datapasta::dp_set_max_rows() function.

For example, if I wanted to copy + paste this table into an RStudio session, I could enter datapasta::dp_set_max_rows(num_rows = 15000) in my .R script just above the tribble_paste() function.

datapasta::dp_set_max_rows(num_rows = 15000)
datapasta::tribble_paste()
it’s slow, but it gets there…

As you can see, tribble_paste() parsed this table into a tibble::tribble() function (and it was over 14000 rows!).

But wait…there’s more!

What if I found a problem with the table I just pasta ‘d into R (or another data set in R)? I don’t know about you, but I’m constantly needing to ask questions or share code on Stackoverflow or RStudio Community.

Well, the handy-dandy dpasta() function helps me create excellent reproducible examples.

Let’s assume I needed to share a sample from the height and weight data I just imported (WordHtWt).

  • First I’ll add some meaningful names to the columns in WordHtWt (using magrittr::set_names()),
  • Then take a small sample with help from dplyr (it’s a good idea to always use the smallest possible data frame to re-create the problem),
  • And…
# new names
world_height_weight_names <- c("country", "male_avg_ht_m",
"male_avg_wt_kg", "male_bmi",
"female_avg_ht_m", "female_avg_wt_kg",
"female_bmi")
# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# PASTA!!!
dpasta()
Voila!

Now I have a nice bit of code I can post (and hopefully get my questions answered).

Friends and alternatives to datapasta

datapasta plays well with the reprex package. If you aren’t sure what reprex does, you should watch the webinar from Jenny Bryan (the package author). If you are looking for the base R alternative to dpasta(), there’s dput(), but the output is not as clean (and it doesn’t have a direct analog to the _paste() functions).


# clean and set names
WordHtWt %>%
# set some better names
magrittr::set_names(world_height_weight_names) %>%
# get a sample for reprex
dplyr::sample_frac(size = 0.10) %>%
# try this with dput()
dput()

structure(list(country = c("Taiwan", "Burma", "Kazakhstan", "Bolivia",
"Belgium", "Mali", "Mauritius", "Laos", "Burundi", "France",
"Mexico", "Nigeria", "Turkey"), male_avg_ht_m = c("1.73 m", "1.65 m",
"1.72 m", "1.67 m", "1.81 m", "1.72 m", "1.71 m", "1.60 m", "1.68 m",
"1.79 m", "1.68 m", "1.67 m", "1.74 m"), male_avg_wt_kg = c("74.8 kg",
"60.4 kg", "77.8 kg", "70.6 kg", "87.8 kg", "67.7 kg", "71.9 kg",
"57.9 kg", "61.5 kg", "83.3 kg", "77.6 kg", "63.0 kg", "82.4 kg"
), male_bmi = c(25, 22.2, 26.3, 25.3, 26.8, 22.9, 24.6, 22.6,
21.8, 26, 27.5, 22.6, 27.2), female_avg_ht_m = c("1.60 m", "1.54 m",
"1.60 m", "1.53 m", "1.65 m", "1.61 m", "1.57 m", "1.51 m", "1.55 m",
"1.65 m", "1.56 m", "1.58 m", "1.60 m"), female_avg_wt_kg = c("60.7 kg",
"54.5 kg", "68.1 kg", "64.8 kg", "70.0 kg", "59.9 kg", "64.1 kg",
"52.4 kg", "51.7 kg", "66.4 kg", "69.4 kg", "59.9 kg", "73.7 kg"
), female_bmi = c(23.7, 23, 26.6, 27.7, 25.7, 23.1, 26, 23, 21.5,
24.4, 28.5, 24, 28.8)), row.names = c(NA, -13L), class = "data.frame")

Additional resources

Check out the vignette for datapasta here, the tidyverse packages, and an excellent description of how to write a reproducible example from Advanced R by Hadley Wickham. Be sure to thank the datapasta author Miles McBain for all the future headaches he just saved you from.

Migrating from a Microsoft Access Database to MySQL with MS ACCDB Viewer

Motivation

The new dbplyr is out on CRAN today, and I’ve wanted to update my MySQL Lahman baseball database with 2017 data, but it’s only available in .csv files and MS Access versions. I could import the .csv files, but thought it would be useful to show how I convert an MS Access database into MySQL using MS ACCDB Viewer, This application has a free trial (the paid version is under $20). I rarely recommend purchasing software, but if you use macOS and find yourself needing to get into an MS Access database, this tool comes in handy.

Getting started

This post assumes:

  1. You are running macOS (this tutorial was built using version 10.14.2)
  2. You have MySQL installed on your machine (see previous post to get everything set up)
  3. You have RStudio installed (using version 1.2.1206)
  4. You have R installed (using version 3.5.1)

Packages

These are the packages you’ll need.

library(RMySQL)
library(RMariaDB)
library(dplyr)
library(tidyverse)
library(Lahman)
library(magrittr)
library(rstudioapi)

Project folder setup

I created an RStudio project file inside a mysql-mdb-accdb folder (mysql-mdb-accdb.Rproj). If you don’t know how RStudio’s project files work, see this post for more information.

Inside this project folder, I created another sub-folder titled databases. I wanted to separate these files by type, so I created a new folder for each version of the 2017 data.

databases/csv
databases/ms-access
databases/sql

After downloading the data, the full folder tree is available below.

databases
├── csv
│ └── baseballdatabank-master
│ ├── README.txt
│ ├── core
│ │ ├── AllstarFull.csv
│ │ ├── Appearances.csv
│ │ ├── AwardsManagers.csv
│ │ ├── AwardsPlayers.csv
│ │ ├── AwardsShareManagers.csv
│ │ ├── AwardsSharePlayers.csv
│ │ ├── Batting.csv
│ │ ├── BattingPost.csv
│ │ ├── CollegePlaying.csv
│ │ ├── Fielding.csv
│ │ ├── FieldingOF.csv
│ │ ├── FieldingOFsplit.csv
│ │ ├── FieldingPost.csv
│ │ ├── HallOfFame.csv
│ │ ├── HomeGames.csv
│ │ ├── Managers.csv
│ │ ├── ManagersHalf.csv
│ │ ├── Parks.csv
│ │ ├── People.csv
│ │ ├── Pitching.csv
│ │ ├── PitchingPost.csv
│ │ ├── Salaries.csv
│ │ ├── Schools.csv
│ │ ├── SeriesPost.csv
│ │ ├── Teams.csv
│ │ ├── TeamsFranchises.csv
│ │ ├── TeamsHalf.csv
│ │ └── readme2014.txt
│ └── upstream
│ └── Teams.csv
└── ms-access
└── lahman2017
├── lahman2017.mdb
└── readme2017.txt

MS Access to MySQL using MS ACCDB Viewer

  1. Download the MS ACCDB Viewer app and the Lahman 2017 MS Access database and .csv files. I’ve created the following sub-folders for the two types of Lahman database files.
  2. Open the MS Access database in the MS ACCDB Viewer app and follow the steps below.

3. After the database exports, you will see a new lahman2017.sql file in your databases/sql. This file is too large to open in RStudio (it’s 46.4 MB), so I’ll open it in a text editor (I use Sublime Text, but many others will work).

What should I expect to see?

If you haven’t done a lot of work in SQL, you will have no way of knowing if this SQL export is ready to be loaded into MySQL (I didn’t know what to expect until I did a lot of reading and trial and error). However, I can look at the 2016 MySQL database from the Lahman website to see what to expect. Below are the first 50 lines each SQL file (2016 on the left, 2017 on the right).

The important differences in the first ~45 lines:

  1. There is no command for creating the lahman2017 database (i.e. CREATE DATABASE)
  2. After creating the database, there is no command to use the lahman2017 database
  3. The SET NAMES argument should be changed to SET NAMES utf8mb4 (this will include characters like emojis)
  4. All the tables and data have clear headers in the previous .sql file, and it’s worthwhile to add these to this version too (though it’s not required)
/*

This MySQL dump came from the Lahman 2017 mdb file available here:
http://www.seanlahman.com/baseball-archive/statistics/
The following was added:
A CREATE DATABASE statement
A USE database statement
Change the SET NAMES 'UTF8'; to SET NAMES utf8mb4;
Add the 'Table structure' and 'Records' titles (optional)

*/
CREATE DATABASE IF NOT EXISTS lahman2017;
USE lahman2017;
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

The first 50 lines of my lahman2017.sql file now look like this:

RStudio’s Terminal pane

In previous posts, I mention opening and using Terminal. RStudio also has it’s own Terminal pane (see below).

This is accessible under Tools > Terminal > New Terminal.

In a new Terminal pane, enter the following commands.

$ cd databases/sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH

This returns what looks like a long and convoluted file path (this is normal).

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/….

Now I enter the following:

mysql -u root -p 

Enter your password and you should see the mysql> prompt. I want to check the databases in MySQL before adding the new lahman2017 data, and I can do this with SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| lahman2016 |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)

Because I moved into the sql folder, I can execute the lahman2017.sql file by entering source lahman2017.sql into the MySQL prompt.

mysql> source lahman2017.sql

This will return a long output telling you things like Query OK and Records: 11370 Duplicates: 0 Warnings: 0. When it’s finished, check the new tables with SHOW TABLES;

You should see this.

Great! Now I have the new database set up in MySQL. I am going to connect to RStudio and start querying.

Changing MySQL settings

Before connecting to MySQL from RStudio, you might need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption
  3. Then click Start MySQL Sever

The dplyr::src_mysql() function

In the previous post, I set up my connection to RStudio using the DBI package.

This time I will set up my connection with the dplyr package (part of the tidyverse). The function that does this is dplyr::src_mysql(), and I include arguments for dbname, host, user, and rstudio::askForPassword("Database password") which prompts a fill-in box for my password.

LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2017", 
host = "localhost",
user = "root",
# password = rstudioapi::askForPassword("Database password")
)

Print the LahmandplyrMySQL object to the console.

LahmandplyrMySQL
src: mysql 8.0.13 [root@localhost:/lahman2017]
tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers,
AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost,
CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost,
HallOfFame, HomeGames, Managers, ManagersHalf, Parks, People, Pitching,
PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises,
TeamsHalf

As I can see, this prints out much more information than the DBI connection in my previous post. The LahmandplyrMySQL object tells me the source of the connection and the tables in the database.

ALWAYS read the README file (readme2017.txt)

The directions are right there in the title. Important stuff is in these files (like changes to the tables in the database). This section is an example of such information:

One notable change: The name of the table that contains biographical information for players has been changed from “Master” to “People” top better reflect its contents.

This information saved me from some potential headache’s (my previous query on the Master table would give an error).

In the next section, I’ll go over the new changes in dbplyr with the dbplyr connection. The dplyr::tbl() function takes the src argument which is the connection to the database (LahmandplyrMySQL). The next argument is sql(), and this is a character string (inside " ") which SQL commands. I like to capitalize the MySQL syntax, refer to tables in SQL with TitleCase, and the columns in lowercase.

API changes

1) Calls of the form dplyr::foo() are now evaluated in the database, rather than locally.

I am assuming this means that using dplyr instead of MySQL in queries is run in the database. Such as the following:

I also pass everything to the dplyr::glimpse(78) so it prints nicely to the window.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view object
dplyr::glimpse(78)

Observations: ??
Variables: 24
Database: mysql 8.0.13 [root@localhost:/lahman2017]
$ playerID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…
$ birthYear 1964, 1987, 1943, 1963, 1971, 1942, 1981, 1934, 1939, …
$ birthMonth 10, 11, 7, 9, 12, 8, 12, 2, 8, 9, 2, 8, 6, 9, 7, 1, 9,…
$ birthDay 26, 13, 15, 21, 28, 9, 27, 5, 5, 8, 16, 17, 2, 15, 15,…
$ birthCountry "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ birthState "IL", "NY", "CA", "CA", "HI", "AL", "CO", "AL", "AL", …
$ birthCity "Chicago", "Staten Island", "Long Beach", "Escondido",…
$ deathYear NA, NA, 2016, NA, NA, 2001, NA, NA, 1984, NA, NA, NA, …
$ deathMonth NA, NA, 2, NA, NA, 1, NA, NA, 8, NA, NA, NA, NA, NA, N…
$ deathDay NA, NA, 15, NA, NA, 22, NA, NA, 16, NA, NA, NA, NA, NA…
$ deathCountry NA, NA, "USA", NA, NA, "USA", NA, NA, "USA", NA, NA, N…
$ deathState NA, NA, "CA", NA, NA, "NY", NA, NA, "GA", NA, NA, NA, …
$ deathCity NA, NA, "Long Beach", NA, NA, "New York", NA, NA, "Atl…
$ nameFirst "Steve", "Tim", "Dave", "Troy", "Benny", "Tommie", "Da…
$ nameLast "Adkins", "Adleman", "Adlesh", "Afenir", "Agbayani", "…
$ nameGiven "Steven Thomas", "Timothy Max", "David George", "Micha…
$ weight 210, 225, 187, 185, 225, 195, 215, 180, 190, 190, 200,…
$ height 78, 77, 72, 76, 72, 71, 75, 72, 75, 75, 78, 74, 71, 75…
$ bats "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ throws "L", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ debut "1990-09-12", "2016-05-01", "1963-05-12", "1987-09-14"…
$ finalGame "1990-10-03", "2017-09-19", "1968-09-28", "1992-07-10"…
$ retroID "adkis001", "adlet001", "adled101", "afent001", "agbab…
$ bbrefID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…

2) The vars argument to tbl_sql() has been formally deprecated.

The vars argument used to be used to supply a list of columns in the tbl. The previous language is below:

If known, the names of the variables in the tbl. This is relatively expensive to determine automatically, so is cached throughout dplyr. However, you should usually be able to leave this blank and it will be determined from the context.

3) src and tbl objects now include a class generated from the class of the underlying connection object. This makes it possible for dplyr backends to implement different behaviour at the dplyr level, when needed.

I assume this is visible in the str() and class() using the previous query.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
utils::str()


List of 2
$ src:List of 2
..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
.. .. ..@ Id: int [1:2] 0 0
..$ disco:
..- attr(*, "class")= chr [1:4] "src_MySQLConnection" "src_dbi" "src_sql" "src"
$ ops:List of 4
..$ name: chr "filter"
..$ x :List of 2
.. ..$ x : 'sql' chr "SELECT * FROM People"
.. ..$ vars: chr [1:24] "playerID" "birthYear" "birthMonth" "birthDay" …
.. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
..$ dots:List of 1
.. ..$ : language ~bats == "R" & birthYear >= 1900
.. .. ..- attr(*, ".Environment")=
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"
- attr(*, "class")= chr [1:5] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" …

The class

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
base::class()

[1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"

SQL translation

I will skip over The new as.integer64(x) is translated to CAST(x AS BIGINT).

cummean() now translates to SQL AVG() as opposed to MEAN().

The cummean() is an aggregate window function. It works by keeping track of 1) the running sum of a values, and 2) the ith (or count) of the values. There is a great example of how this works on StackOverflow here.

I created an example data frame below to show how this works.

get a sample
x <- base::sample(1:10)
create a data frame of the intermediate steps and cummean
base::data.frame(sample= x,
                cumsum_c = cumsum(x),
seq_along_x = seq_along(x),
cummean = cumsum(x) / seq_along(x))
sample cumsum_x seq_along_x cummean
5 5 1 5.000000
3 8 2 4.000000
4 12 3 4.000000
2 14 4 3.500000
1 15 5 3.000000
7 22 6 3.666667
8 30 7 4.285714
9 39 8 4.875000
10 49 9 5.444444
6 55 10 5.500000

The cumsum_x is a running total of sample (5 + 3 = 8, 5 + 3 + 4 = 12, etc.), and seq_along_x is the count of the values. When I get the ratio of cumsum_x / seq_along_x, I get the cummean column (5 / 1 = 5, 8 / 2 = 4.0, etc.)

This is the cumulative mean of RBIs in the 2017 Batting data table.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::mutate(cummean_rbi = cummean(RBI)) %>%
dplyr::select(playerID, cummean_rbi) %>%
utils::head(5)

# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(RBI)
playerID cummean_rbi

1 wilsoha01 191
2 gehrilo01 188.
3 greenha01 186
4 foxxji01 183.
5 gehrilo01 182.

Compare this to the mean RBI per player.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::group_by(playerID) %>%
dplyr::summarize(mean_rbi = mean(RBI, na.rm = TRUE)) %>%
dplyr::select(playerID, mean_rbi) %>%
dplyr::arrange(desc(mean_rbi)) %>%
utils::head(5)
# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(mean_rbi)
playerID mean_rbi
1 dimagjo01 118.
2 gehrilo01 117.
3 pujolal01 113.
4 cabremi01 108.
5 belleal01 103.

Just for kicks, the AVG() function in MySQL is implemented like this:

mysql> SELECT 
playerID,
AVG(RBI) AS avg_rbi
FROM
Batting
GROUP BY
playerID
ORDER BY
avg_rbi DESC
LIMIT 5;
+-----------+----------+
| playerID | avg_rbi |
+-----------+----------+
| dimagjo01 | 118.2308 |
| gehrilo01 | 117.3529 |
| pujolal01 | 112.8235 |
| cabremi01 | 107.5333 |
| belleal01 | 103.2500 |
+-----------+----------+
5 rows in set (0.07 sec)

Closing

Moving form one relational database management system to another can slow down your workflow, so having (inexpensive) tools to quickly transform a database into something you can work with is a time-saver. I don’t like leaving the RStudio IDE (unless it’s to my browser), so the quicker I can get my data source into this environment, the quicker I can get to work on it.

Next post will probably be on PostgreSQL.

Setting up PostgreSQL on macOS

This tutorial was written in RMarkdown in RStudio (version 1.2.1206, R version 3.5.1), using mac OS High Sierra (10.13.6).

Why install PostgreSQL?

PostgreSQL is an open-source relational database management system (RDBMS). You can install PostgreSQL on Linux, Unix, Windows, or Mac. Users can write and execute stored procedures and functions from other programming languages (like R and Python). Multiple data types can also be stored (and even defined) in PostgreSQL, including JSON and other nonrelational data.

Read more about PostgreSQL here.

PostgreSQL application

Download the application from this website and follow the installation instructions.

The syntax documentation for PostgreSQL is here.

First steps

After downloading and installing the PostgreSQL application, open a new Terminal window. In the Terminal, enter the following:

$ sudo mkdir -p /etc/paths.d &&
$ echo /Applications/Postgres.app/Contents/Versions/latest/bin | sudo tee /etc/paths.d/postgresapp

You’ll be asked to enter whatever password you need to install programs on your computer. Type it in and hit enter.

Install PostgreSQL using Homebrew

You can also install PostgreSQL using homebrew. If you don’t have homebrew installed, learn how to do so here.

In a fresh terminal window, enter the following:

$ brew install postgresql
==> Installing postgresql
==> Downloading https://homebrew.bintray.com/bottles/postgresql-11.1.mojave.bottle.tar.gz
################################################################## 100.0%
==> Pouring postgresql-11.1.mojave.bottle.tar.gz
==> Caveats
To migrate existing data from a previous major version of PostgreSQL run:
brew postgresql-upgrade-database
To have launchd start postgresql now and restart at login:
brew services start postgresql
Or, if you don't want/need a background service you can just run:
pg_ctl -D /usr/local/var/postgres start
==> Summary
🍺 /usr/local/Cellar/postgresql/11.1: 3,548 files, 40.3M

Launching PostgreSQL from Terminal

Read the documentation for a more in-depth explanation.

In a fresh Terminal session (If you are using RStudio’s Terminal pane, select Tools > Terminal > New Terminal), enter the following.

psql postgres

This will bring up the psql command line.

psql (10.5)
Type "help" for help.

postgres=#

The =# is where we will be entering our postgres commands.

Check users, roles and attributes

We can access the users, their roles and attributes in PostgreSQL with the following:

postgres=# \du

This will return a table with the following information:

yourname=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Add a new user

The postgres=# prompt indicates I am currently a Superuser. To create a new user for martymcfly, I enter a USER and WITH PASSWORD:

=# CREATE USER martymcfly WITH PASSWORD 'strongpassword123';
-- CREATE ROLE

PostgreSQL typically returns the command that was previously executed. Because CREATE USER is another spelling of CREATE ROLE, PostgreSQL returns CREATE ROLE in the command prompt.

I will check to make sure this is true with \du again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
yourname | Superuser, Create role, Create DB | {}
martymcfly | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Grant attributes to new user

As I can see, martymcfly has no roles or attributes. I want this user to be able to create a new database. I can do this with the following:

=# ALTER ROLE martymcfly CREATEDB;

And I can verify this with the \du command again.

=# \du
List of roles
Role name | Attributes | Member of
----------------+------------------------------------------------------------+-----------
martinfrigaard | Superuser, Create role, Create DB | {}
martymcfly | Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Creating a database

I’m going to switch over the new user I just created by exiting PostgreSQL and signing back in as martymcfly:

=# \q --exit 

Now sign in as a new user

psql postgres -U martymcfly

This brings up a new PostgreSQL prompt.

postgres=>

The =>tells me I am no longer the Superuser, but I know that I have given martymcfly the relevant permissions to create a new database.

Below are the steps needed to 1) create a new database (I’m creating a Lahman baseball database in PostgreSQL), 2) grant privileges to the user martymcfly on the new lahman2017 database, 3) list all the current databases in PostgreSQL, and 4) connect to the lahamn2017 database.

First create lahman2017

CREATE DATABASE lahman2017;
-- CREATE DATABASE

Then give martymcfly permissions

 GRANT ALL PRIVILEGES ON DATABASE lahman2017 TO martymcfly;
-- GRANT

Now list the databases

\list

This returns:

                             List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges |
----------------+----------------+----------+-------------+-------------+--------------------------
lahman2017 | martymcfly | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =Tc/martymcfly +
| | | | | martymcfly=CTc/martymcfly
martinfrigaard | martinfrigaard | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(5 rows)

Finally, we connect to the new database

\connect lahman2017
psql (11.1, server 11.0)
You are now connected to database "lahman2017" as user "martymcfly".

Great! Next time we will build the schemas and tables in the lahman2017 database.

Querying MySQL from RStudio

In the previous post I demonstrated how to install and use MySQL from the command line. In this tutorial I will show you how to set up and query a relational database management system (RDBMS) from RStudio.

The Lahman database

These queries are done using the Lahman baseball data set available here. I chose these data because 1) they are open to anyone and can be run on your local machine, and 2) they are available as a .sql file and .csv files for loading into any RDBMS software.

Database packages in RStudio

For this tutorial I will be using a local instance of MySQL with help from the DBI, dplyr, and RMySQL packages.

library(DBI) 
library(RMySQL)
library(RMariaDB)
library(dplyr)
library(rstudioapi)
library(ggthemes)

Change MySQL settings

Before we can connect to a database with RStudio, we need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption.
  3. Then click Start MySQL Sever

Connect to MySQL

A DBI connection is built below to my local instance of MySQL.

LahmanDBIMySQL <- DBI::dbConnect(MySQL(),
dbname = "lahman2016",
host = "localhost",
user = "root",
rstudioapi::askForPassword(prompt = "database password")) LahmanDBIMySQL

As I can see, this is a <MySQLConnection:0,1>. This is the object we will use to access the lahman2016 database.

Querying a database from RStudio

We can get a list of tables in the lahman2016 database by using the DBI::dbListTables() function.

DBI::dbListTables(LahmanDBIMySQL)
[1] "AllstarFull" "Appearances" "AwardsManagers"
[4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost" "CollegePlaying"
[10] "Fielding" "FieldingOF" "FieldingOFsplit"
[13] "FieldingPost" "Griffeys" "HallOfFame"
[16] "HomeGames" "Managers" "ManagersHalf"
[19] "Master" "Parks" "Pitching"
[22] "PitchingPost" "Salaries" "Schools"
[25] "SeriesPost" "Teams" "TeamsFranchises"
[28] "TeamsHalf"

This is similar to the SHOW TABLES; command in MySQL. In the previous tutorial, I built a table of statistics on Ken Griffey Junior and Senior in the Lahman database. The Griffeys table contains following fields:

DBI::dbListFields(LahmanDBIMySQL, "Griffeys") 
## [1] "playerID" "birthYear" "nameFirst" "nameLast" "weight"
## [6] "height" "bats" "throws" "retroID" "bbrefID"
## [11] "teamID" "lgID" "yearID" "G" "AB"
## [16] "R" "H" "HR" "RBI" "BB"
## [21] "SO"

Queries can be passed directly to the DBI::dbGetQuery() function, like the SELECT statement below.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT * FROM Griffeys;") %>%
dplyr::glimpse(78)

Observations: 45
Variables: 21
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ retroID "grifk001", "grifk001", "grifk001", "grifk001", "grifk001…
$ bbrefID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…

I realize I don’t have all the columns I want from the Batting and Fielding tables, so I will query lahman2016 again to collect these data.

Aliasing columns in MySQL tables

There are three columns in the Batting table I want to add to Griffeys: SF, 2B, and 3B. Two of these names violate naming rules in R.

DBI::dbListFields(LahmanDBIMySQL, "Batting") 
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "AB" "R" "H" "2B" "3B" "HR"
[13] "RBI" "SB" "CS" "BB" "SO" "IBB"
[19] "HBP" "SH" "SF" "GIDP"

The 2B and 3B columns contain the number of doubles and triples for each player, and SF is the number of sacrifice flies. Unfortunately, R won’t allow columns to have names that start with a number. I can write a query to include these columns, but use an alias to create R-friendly column names.

DBI::dbGetQuery(LahmanDBIMySQL, "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;") %>% glimpse(78)

Observations: 45
Variables: 22
$ playerID "griffke01", "griffke01", "griffke01", "griffke01", "grif…
$ birthYear 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
$ nameFirst "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "Ken", "…
$ nameLast "Griffey", "Griffey", "Griffey", "Griffey", "Griffey", "G…
$ weight 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 19…
$ height 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 71, 7…
$ bats "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ throws "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L…
$ lgID "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "AL…
$ yearID 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 198…
$ teamID "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "…
$ G 25, 88, 132, 148, 154, 158, 95, 146, 101, 127, 118, 120, …
$ AB 86, 227, 463, 562, 585, 614, 380, 544, 396, 484, 458, 399…
$ R 19, 24, 95, 111, 117, 90, 62, 89, 65, 70, 60, 44, 68, 33,…
$ H 33, 57, 141, 189, 186, 177, 120, 160, 123, 134, 140, 109,…
$ HR 3, 2, 4, 6, 12, 10, 8, 13, 2, 12, 11, 7, 10, 9, 12, 14, 2…
$ RBI 14, 19, 46, 74, 57, 63, 32, 85, 34, 54, 46, 56, 69, 26, 3…
$ BB 6, 27, 67, 62, 69, 54, 36, 62, 39, 39, 34, 29, 41, 15, 20…
$ SO 10, 43, 67, 65, 84, 70, 39, 77, 42, 58, 45, 32, 51, 24, 4…
$ dubs 5, 9, 15, 28, 35, 33, 27, 28, 21, 23, 21, 20, 28, 7, 15, …
$ trips 1, 5, 9, 9, 8, 8, 4, 10, 6, 2, 3, 1, 4, 0, 3, 1, 0, 0, 3,…
$ sac_flies "0", "0", "3", "3", "2", "3", "3", "5", "4", "3", "2", "4…

Storing SQL in character vectors

I can also assign the query above to a character vector (batting_query) and pass the vector to the DBI::dbGetQuery() function. I will assign these new variables to the Griffeys table.

batting_query <- "SELECT 
grf.playerID,
grf.birthYear,
grf.nameFirst,
grf.nameLast,
grf.weight,
grf.height,
grf.bats,
grf.throws,
grf.lgID,
grf.yearID,
grf.teamID,
grf.G,
grf.AB,
grf.R,
grf.H,
grf.HR,
grf.RBI,
grf.BB,
grf.SO,
bat.2B AS dubs,
bat.3B AS trips,
bat.SF AS sac_flies
FROM Griffeys AS grf
INNER JOIN Batting AS bat
ON grf.playerID = bat.playerID
AND grf.yearID = bat.yearID
AND grf.teamID = bat.teamID;"
Griffeys <- DBI::dbGetQuery(LahmanDBIMySQL, batting_query)

Now that we see the SQL works and these two tables are joined, we can calculate a few new statistics in R.

Calculating On-base plus slugging (OPS)

A players on-base plus slugging (OPS) is a measure of their overall batting performance. This is the sum of the player’s on-base percentage and slugging average. The steps to calculate this statistic are below.

GriffsOPSbyYear <- Griffeys %>% 
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr"))

I created a few intermediate calculations before creating obslug_perc, and I also created a nameID so the two players can be easily graphed. Below is the on-base plus slugging (OPS) statistic for both Griffeys over their career.

GriffsOPSbyYear %>% 
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) + ggthemes::theme_fivethirtyeight()

The Lahman package in R

There is also a Lahman package in R. I will use this below to demonstrate some of the similarities between dplyr and MySQL syntax.

First I create an object for the Lahman::Master, Lahman::Fielding, and Lahman::Batting tables.

Master <- Lahman::Master
Batting <- Lahman::Batting
Fielding <- Lahman::Fielding

Then the code below creates the same tables and graph using dplyr commands.

GriffeyOPSPlot <- Master %>% 
dplyr::select(playerID,
birthYear,
nameLast) %>%
# find Griffeys
dplyr::filter(nameLast %in% "Griffey") %>%
# join to batting table
dplyr::inner_join(x = .,
y = Batting,
by = "playerID") %>%
# select/rename relevant columns
dplyr::select(teamID,
nameLast,
playerID,
birthYear,
yearID,
G,
AB,
R,
H,
HR,
dubs = X2B,
trips = X3B,
RBI,
BB,
SO,
sac_flies = SF) %>%
dplyr::mutate(
# slugging percentage
slug_perc = (H - dubs - trips - HR + 2 * dubs + 3 * trips + 4 * HR) / AB,
# convert sac_flies to numeric
sac_flies = as.numeric(sac_flies),
# On Base Percentage
ob_perc = (H + BB) / (H + AB + BB + sac_flies),
# On-base plus slugging
obslug_perc = slug_perc + ob_perc,
# create a nameID
nameID =
case_when(
birthYear == 1950 ~ "Ken Griffey Sr",
birthYear == 1969 ~ "Ken Griffey Jr"),
# convert to a factor
nameID = factor(nameID),
# make KGSr the reference
nameID = relevel(nameID, ref = "Ken Griffey Sr")) %>%
# create plot of new OPS
ggplot2::ggplot(aes(x = yearID, y = obslug_perc,
group = nameID, color = teamID)) +
geom_line() +
ggplot2::facet_grid(. ~ nameID, scales = "free") +
ggplot2::labs(
title = "The Griffey's career on-base plus slugging (OPS)"
) +
ggthemes::theme_fivethirtyeight()
ggplot2::ggsave(filename = "GriffeyOPSPlot.png", width = 7, height = 5, units = "in")

End

MySQL from the command line (macOS)

Setting up MySQL from the command line

This quick tutorial covers how to set up and query a MySQL database from the command line (Terminal) on macOS Sierra.

What is MySQL?

The SQL in MySQL stands for structured query language. There are half a dozen flavors of SQL, and MySQL is one of the most common. The My comes from the name of co-founder Michael Widenius’s daughter (fun fact: another flavor of SQL, MariaDB, is named after his younger daughter).

MySql is an open source relational database management system. Read more about MySQL on Wikipedia. Or check out the reference manual here.

Download MySQL

Download and install the community edition of MySQL. You will be asked to create an account, but you can opt out and just click on “No thanks, just start my download.

After downloading the dmg, you will be guided through the installation steps. On the Configuration options, I chose Use Strong Password Encryption

and on the next window I entered a password and checked the box for Start MySQL Server once the installation is complete

Or use brew install mysql if you have homebrew installed.

After the install finishes, you should see the MySQL icon in the System Preferences:

MySQL workbench (optional)

Download and install the workbench if you want to use an IDE for querying MySQL (I prefer DataGrip). You should read this documentation on the Workbench.

Install database drivers (using homebrew)

In a future post, I will be using RStudio to query a database using the RMySQL and RMariaDB packages. Follow these instructions here for installing the database drivers on your Mac.

These commands are entered into Terminal.

# Install the unixODBC library
brew install unixodbc
# SQL Server ODBC Drivers (Free TDS)
brew install freetds --with-unixodbc
# PostgreSQL ODBC ODBC Drivers
brew install psqlodbc
# SQLite ODBC Drivers
brew install sqliteodbc

Installing the RMySQL package connector

I need to install the connectors for MySQL and MariaDB using brew install mysql-connector-c in Terminal.

$ brew install mysql-connector-c

After updating Homebrew, the connector is installed.

Install the RMariaDB package connector

I will also install the mariadb-connector-c connector.

$ brew install mariadb-connector-c

Now we can launch MySQL.

Launching up MySQL locally

The commands below are entered directly into Terminal.

  1. To find the path for the local MySQL db:
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH
  1. To start up mysql, enter the following:
$ mysql -u root -p 
  1. You will be prompted for your password you used to setup MySQL–enter it into the Terminal. You should see this:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 8.0.13 MySQL Community Server - GPL

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

The MySQL command line is below:

mysql>

After installing MySQL community edition, you can choose to either run commands from the terminal or within a .sql script in the workbench. Below I demonstrate using MySQL from the command line.

Using MySQL commands in Terminal

To see the User and passwords, enter the following commands into the Terminal. The authentication_string will identify the passwords (but they are encrypted).

NOTE: a semi-colon is needed at the end of each MySQL command.

SELECT 
  User, authentication_string 
FROM 
  mysql.user;

+------------------+------------------------------------------------------------------------+
| User             | authentication_string                                                  |
+------------------+------------------------------------------------------------------------+
| mysql.infoschema | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| mysql.session    | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| mysql.sys        | $A$005$THISISACOMBINATIONOFINVALIDSALTANDPASSWORDTHATMUSTNEVERBRBEUSED |
| root             | *D932DC725A9210F3B4C903D69F88EDC3AD447A06                              |
+------------------+------------------------------------------------------------------------+
4 rows in set (0.00 sec)

The MySQL commands are working! Let’s build a database!

Building a MySQL database

The lahman2016 database is freely available full of information on baseball players and teams from 1871 through 2016. You can download it here.

After downloading the zipped database into a local data folder I find the following files.

├── data
│   ├── lahman2016-sql
│   │   ├── lahman2016.sql
│   │   └── readme2016.txt

The readme2016.txt file tells me more about the tables in the database.

In a new Terminal window, I navigate to the lahman2016-sql folder and enter the following commands.

$ cd lahman2016-sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH
$ mysql -u root -p

After entering my password into the Terminal, I need to run the lahman2016.sql file. I can do this using the source command in the MySQL prompt.

mysql> source lahman2016.sql

This should show a lot of activity and output, but eventually bring back the MySQL prompt. I can check the databases using SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| lahman2016         |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

Querying a database directly from Terminal

Now that I’ve set up the lahman2016 database, I can use it to demonstrate a few MySQL commands. I will start by selecting this database.

mysql> USE lahman2016;

This prompts the following message.

Database changed

Now I can take a look at the tables in the lahman2016 database.

mysql> SHOW TABLES;
+----------------------+
| Tables_in_lahman2016 |
+----------------------+
| AllstarFull          |
| Appearances          |
| AwardsManagers       |
| AwardsPlayers        |
| AwardsShareManagers  |
| AwardsSharePlayers   |
| Batting              |
| BattingPost          |
| CollegePlaying       |
| Fielding             |
| FieldingOF           |
| FieldingOFsplit      |
| FieldingPost         |
| HallOfFame           |
| HomeGames            |
| Managers             |
| ManagersHalf         |
| Master               |
| Parks                |
| Pitching             |
| PitchingPost         |
| Salaries             |
| Schools              |
| SeriesPost           |
| Teams                |
| TeamsFranchises      |
| TeamsHalf            |
+----------------------+
27 rows in set (0.04 sec)

Adding a primary key to Master table

When I look at the Master table, I see that it was made without a primary key.

SHOW CREATE TABLE Master;
| Master | CREATE TABLE `Master` (
  `playerID` varchar(255) DEFAULT NULL,
  `birthYear` int(11) DEFAULT NULL,
  `birthMonth` int(11) DEFAULT NULL,
  `birthDay` int(11) DEFAULT NULL,
  `birthCountry` varchar(255) DEFAULT NULL,
  `birthState` varchar(255) DEFAULT NULL,
  `birthCity` varchar(255) DEFAULT NULL,
  `deathYear` varchar(255) DEFAULT NULL,
  `deathMonth` varchar(255) DEFAULT NULL,
  `deathDay` varchar(255) DEFAULT NULL,
  `deathCountry` varchar(255) DEFAULT NULL,
  `deathState` varchar(255) DEFAULT NULL,
  `deathCity` varchar(255) DEFAULT NULL,
  `nameFirst` varchar(255) DEFAULT NULL,
  `nameLast` varchar(255) DEFAULT NULL,
  `nameGiven` varchar(255) DEFAULT NULL,
  `weight` int(11) DEFAULT NULL,
  `height` int(11) DEFAULT NULL,
  `bats` varchar(255) DEFAULT NULL,
  `throws` varchar(255) DEFAULT NULL,
  `debut` varchar(255) DEFAULT NULL,
  `finalGame` varchar(255) DEFAULT NULL,
  `retroID` varchar(255) DEFAULT NULL,
  `bbrefID` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

If I want to update the table so the primary key is playerID, I can do this with the commands below.

mysql> ALTER TABLE `lahman2016`.`Master` 
       CHANGE COLUMN `playerID` `playerID` varchar(255) NOT NULL ,
         ADD PRIMARY KEY (`playerID`),
         ADD UNIQUE INDEX `playerID_UNIQUE` (`playerID` ASC) VISIBLE;

Then the Master table returns this when the SHOW CREATE TABLE command is entered into the prompt.

SHOW CREATE TABLE Master;
| Master | CREATE TABLE `Master` (
  `playerID` varchar(255) NOT NULL,
  `birthYear` int(11) DEFAULT NULL,
  `birthMonth` int(11) DEFAULT NULL,
  `birthDay` int(11) DEFAULT NULL,
  `birthCountry` varchar(255) DEFAULT NULL,
  `birthState` varchar(255) DEFAULT NULL,
  `birthCity` varchar(255) DEFAULT NULL,
  `deathYear` varchar(255) DEFAULT NULL,
  `deathMonth` varchar(255) DEFAULT NULL,
  `deathDay` varchar(255) DEFAULT NULL,
  `deathCountry` varchar(255) DEFAULT NULL,
  `deathState` varchar(255) DEFAULT NULL,
  `deathCity` varchar(255) DEFAULT NULL,
  `nameFirst` varchar(255) DEFAULT NULL,
  `nameLast` varchar(255) DEFAULT NULL,
  `nameGiven` varchar(255) DEFAULT NULL,
  `weight` int(11) DEFAULT NULL,
  `height` int(11) DEFAULT NULL,
  `bats` varchar(255) DEFAULT NULL,
  `throws` varchar(255) DEFAULT NULL,
  `debut` varchar(255) DEFAULT NULL,
  `finalGame` varchar(255) DEFAULT NULL,
  `retroID` varchar(255) DEFAULT NULL,
  `bbrefID` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`playerID`),
  UNIQUE KEY `playerID_UNIQUE` (`playerID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Querying a database using .sql scripts

Another way to execute MySQL commands is by creating a .sql file and running these files either from the command line or using the source command from the MySQL prompt (like I did above).

This method is preferred because it is more reproducible (and it is easier to keep track of your work).

For example, if I want to collect data on Ken Griffey Junior and Senior, I can use the MySQL prompt to build up the query.

  1. Get the names of the columns in both tables.
mysql> SHOW COLUMNS FROM Batting;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| playerID | varchar(255) | YES  |     | NULL    |       |
| yearID   | int(11)      | YES  |     | NULL    |       |
| stint    | int(11)      | YES  |     | NULL    |       |
| teamID   | varchar(255) | YES  |     | NULL    |       |
| lgID     | varchar(255) | YES  |     | NULL    |       |
| G        | int(11)      | YES  |     | NULL    |       |
| AB       | int(11)      | YES  |     | NULL    |       |
| R        | int(11)      | YES  |     | NULL    |       |
| H        | int(11)      | YES  |     | NULL    |       |
| 2B       | int(11)      | YES  |     | NULL    |       |
| 3B       | int(11)      | YES  |     | NULL    |       |
| HR       | int(11)      | YES  |     | NULL    |       |
| RBI      | int(11)      | YES  |     | NULL    |       |
| SB       | int(11)      | YES  |     | NULL    |       |
| CS       | int(11)      | YES  |     | NULL    |       |
| BB       | int(11)      | YES  |     | NULL    |       |
| SO       | int(11)      | YES  |     | NULL    |       |
| IBB      | varchar(255) | YES  |     | NULL    |       |
| HBP      | varchar(255) | YES  |     | NULL    |       |
| SH       | varchar(255) | YES  |     | NULL    |       |
| SF       | varchar(255) | YES  |     | NULL    |       |
| GIDP     | varchar(255) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
22 rows in set (0.01 sec)

mysql> SHOW COLUMNS FROM Master;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| playerID     | varchar(255) | NO   | PRI | NULL    |       |
| birthYear    | int(11)      | YES  |     | NULL    |       |
| birthMonth   | int(11)      | YES  |     | NULL    |       |
| birthDay     | int(11)      | YES  |     | NULL    |       |
| birthCountry | varchar(255) | YES  |     | NULL    |       |
| birthState   | varchar(255) | YES  |     | NULL    |       |
| birthCity    | varchar(255) | YES  |     | NULL    |       |
| deathYear    | varchar(255) | YES  |     | NULL    |       |
| deathMonth   | varchar(255) | YES  |     | NULL    |       |
| deathDay     | varchar(255) | YES  |     | NULL    |       |
| deathCountry | varchar(255) | YES  |     | NULL    |       |
| deathState   | varchar(255) | YES  |     | NULL    |       |
| deathCity    | varchar(255) | YES  |     | NULL    |       |
| nameFirst    | varchar(255) | YES  |     | NULL    |       |
| nameLast     | varchar(255) | YES  |     | NULL    |       |
| nameGiven    | varchar(255) | YES  |     | NULL    |       |
| weight       | int(11)      | YES  |     | NULL    |       |
| height       | int(11)      | YES  |     | NULL    |       |
| bats         | varchar(255) | YES  |     | NULL    |       |
| throws       | varchar(255) | YES  |     | NULL    |       |
| debut        | varchar(255) | YES  |     | NULL    |       |
| finalGame    | varchar(255) | YES  |     | NULL    |       |
| retroID      | varchar(255) | YES  |     | NULL    |       |
| bbrefID      | varchar(255) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+
24 rows in set (0.00 sec)
  1. Decide which columns I want using a SELECT and INNER JOIN query.
SELECT 
  -- Master columns
  mas.playerID, 
  mas.birthYear, 
  mas.nameFirst, 
  mas.nameLast, 
  mas.weight, 
  mas.height, 
  mas.bats, 
  mas.throws, 
  mas.retroID, 
  mas.bbrefID,
  -- Batting columns
  bat.teamID,
  bat.lgID,
  bat.yearID,
  bat.G,
  bat.AB,
  bat.R,
  bat.H,
  bat.HR,
  bat.RBI,
  bat.BB,
  bat.SO
FROM 
    Master AS mas
      INNER JOIN 
    Batting AS bat ON mas.playerID = bat.playerID
  WHERE nameLast = "Griffey";
  1. Write commands to create a new table (Griffey) to insert the Griffey data into. This is where the previous column information comes in handy.
CREATE TABLE Griffeys (
    playerID varchar(255) NOT NULL,
    birthYear int(11) NOT NULL,
    nameFirst VARCHAR(255) NOT NULL,
    nameLast VARCHAR(255) NOT NULL,
    weight int(11),
    height int(11),
    bats varchar(255),
    throws varchar(255),
    retroID varchar(255),
    bbrefID varchar(255),
    teamID varchar(255),   
    lgID varchar(255),
    yearID int(11),
    G int(11),
    AB int(11),
    R int(11),
    H int(11),
    HR int(11),
    RBI int(11),
    BB int(11),
    SO int(11)
);
  1. Combine the two commands into a .sql file in the lahman2016-sql folder:

data/lahman2016-sql/griffey-table.sql

The script should contain the following:

/*
***************************************************************************
-- This is code to create: Griffeys table
-- Authored by and feedback to mjfrigaard@gmail.com
-- MIT License
-- Version: 1.0
***************************************************************************
*/

/*
 select database
*/

USE lahman2016;

/*
 create new table
*/

CREATE TABLE Griffeys (
    playerID varchar(255) NOT NULL,
    birthYear int(11) NOT NULL,
    nameFirst VARCHAR(255) NOT NULL,
    nameLast VARCHAR(255) NOT NULL,
    weight int(11),
    height int(11),
    bats varchar(255),
    throws varchar(255),
    retroID varchar(255),
    bbrefID varchar(255),
    teamID varchar(255),   
    lgID varchar(255),
    yearID int(11),
    G int(11),
    AB int(11),
    R int(11),
    H int(11),
    HR int(11),
    RBI int(11),
    BB int(11),
    SO int(11)
);

/*
 insert the select query into the new columns
*/

INSERT INTO Griffeys (
  playerID, 
  birthYear, 
  nameFirst, 
  nameLast, 
  weight, 
  height, 
  bats, 
  throws, 
  retroID, 
  bbrefID,
  teamID,
  lgID,
  yearID,
  G,
  AB,
  R,
  H,
  HR,
  RBI,
  BB,
  SO
)
SELECT 
  -- Master columns
  mas.playerID, 
  mas.birthYear, 
  mas.nameFirst, 
  mas.nameLast, 
  mas.weight, 
  mas.height, 
  mas.bats, 
  mas.throws, 
  mas.retroID, 
  mas.bbrefID,
  -- Batting columns
  bat.teamID,
  bat.lgID,
  bat.yearID,
  bat.G,
  bat.AB,
  bat.R,
  bat.H,
  bat.HR,
  bat.RBI,
  bat.BB,
  bat.SO
  
FROM 
    Master AS mas
      INNER JOIN 
    Batting AS bat ON mas.playerID = bat.playerID
  WHERE nameLast = "Griffey";
  
-- End file
  1. Use source from the MySQL prompt to execute the file.
mysql> source griffey-table.sql

Now I can check this new table using DESCRIBE.

mysql> DESCRIBE Griffeys;
+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| playerID  | varchar(255) | NO   |     | NULL    |       |
| birthYear | int(11)      | NO   |     | NULL    |       |
| nameFirst | varchar(255) | NO   |     | NULL    |       |
| nameLast  | varchar(255) | NO   |     | NULL    |       |
| weight    | int(11)      | YES  |     | NULL    |       |
| height    | int(11)      | YES  |     | NULL    |       |
| bats      | varchar(255) | YES  |     | NULL    |       |
| throws    | varchar(255) | YES  |     | NULL    |       |
| retroID   | varchar(255) | YES  |     | NULL    |       |
| bbrefID   | varchar(255) | YES  |     | NULL    |       |
| teamID    | varchar(255) | YES  |     | NULL    |       |
| lgID      | varchar(255) | YES  |     | NULL    |       |
| yearID    | int(11)      | YES  |     | NULL    |       |
| G         | int(11)      | YES  |     | NULL    |       |
| AB        | int(11)      | YES  |     | NULL    |       |
| R         | int(11)      | YES  |     | NULL    |       |
| H         | int(11)      | YES  |     | NULL    |       |
| HR        | int(11)      | YES  |     | NULL    |       |
| RBI       | int(11)      | YES  |     | NULL    |       |
| BB        | int(11)      | YES  |     | NULL    |       |
| SO        | int(11)      | YES  |     | NULL    |       |
+-----------+--------------+------+-----+---------+-------+
21 rows in set (0.00 sec)

Or SELECT *

mysql> SELECT * FROM Griffeys;
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
| playerID  | birthYear | nameFirst | nameLast | weight | height | bats | throws | retroID  | bbrefID   | teamID | lgID | yearID | G    | AB   | R    | H    | HR   | RBI  | BB   | SO   |
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1973 |   25 |   86 |   19 |   33 |    3 |   14 |    6 |   10 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1974 |   88 |  227 |   24 |   57 |    2 |   19 |   27 |   43 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1975 |  132 |  463 |   95 |  141 |    4 |   46 |   67 |   67 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1976 |  148 |  562 |  111 |  189 |    6 |   74 |   62 |   65 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1977 |  154 |  585 |  117 |  186 |   12 |   57 |   69 |   84 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1978 |  158 |  614 |   90 |  177 |   10 |   63 |   54 |   70 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1979 |   95 |  380 |   62 |  120 |    8 |   32 |   36 |   39 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1980 |  146 |  544 |   89 |  160 |   13 |   85 |   62 |   77 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1981 |  101 |  396 |   65 |  123 |    2 |   34 |   39 |   42 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1982 |  127 |  484 |   70 |  134 |   12 |   54 |   39 |   58 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1983 |  118 |  458 |   60 |  140 |   11 |   46 |   34 |   45 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1984 |  120 |  399 |   44 |  109 |    7 |   56 |   29 |   32 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1985 |  127 |  438 |   68 |  120 |   10 |   69 |   41 |   51 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | NYA    | AL   |   1986 |   59 |  198 |   33 |   60 |    9 |   26 |   15 |   24 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1986 |   80 |  292 |   36 |   90 |   12 |   32 |   20 |   43 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1987 |  122 |  399 |   65 |  114 |   14 |   64 |   46 |   54 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | ATL    | NL   |   1988 |   69 |  193 |   21 |   48 |    2 |   19 |   17 |   26 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1988 |   25 |   50 |    5 |   14 |    2 |    4 |    2 |    5 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1989 |  106 |  236 |   26 |   62 |    8 |   30 |   29 |   42 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1989 |  127 |  455 |   61 |  120 |   16 |   61 |   44 |   83 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | CIN    | NL   |   1990 |   46 |   63 |    6 |   13 |    1 |    8 |    2 |    5 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | SEA    | AL   |   1990 |   21 |   77 |   13 |   29 |    3 |   18 |   10 |    3 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1990 |  155 |  597 |   91 |  179 |   22 |   80 |   63 |   81 |
| griffke01 |      1950 | Ken       | Griffey  |    190 |     71 | L    | L      | grifk001 | griffke01 | SEA    | AL   |   1991 |   30 |   85 |   10 |   24 |    1 |    9 |   13 |   13 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1991 |  154 |  548 |   76 |  179 |   22 |  100 |   71 |   82 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1992 |  142 |  565 |   83 |  174 |   27 |  103 |   44 |   67 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1993 |  156 |  582 |  113 |  180 |   45 |  109 |   96 |   91 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1994 |  111 |  433 |   94 |  140 |   40 |   90 |   56 |   73 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1995 |   72 |  260 |   52 |   67 |   17 |   42 |   52 |   53 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1996 |  140 |  545 |  125 |  165 |   49 |  140 |   78 |  104 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1997 |  157 |  608 |  125 |  185 |   56 |  147 |   76 |  121 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1998 |  161 |  633 |  120 |  180 |   56 |  146 |   76 |  121 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   1999 |  160 |  606 |  123 |  173 |   48 |  134 |   91 |  108 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2000 |  145 |  520 |  100 |  141 |   40 |  118 |   94 |  117 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2001 |  111 |  364 |   57 |  104 |   22 |   65 |   44 |   72 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2002 |   70 |  197 |   17 |   52 |    8 |   23 |   28 |   39 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2003 |   53 |  166 |   34 |   41 |   13 |   26 |   27 |   44 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2004 |   83 |  300 |   49 |   76 |   20 |   60 |   44 |   67 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2005 |  128 |  491 |   85 |  148 |   35 |   92 |   54 |   93 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2006 |  109 |  428 |   62 |  108 |   27 |   72 |   39 |   78 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2007 |  144 |  528 |   78 |  146 |   30 |   93 |   85 |   99 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CIN    | NL   |   2008 |  102 |  359 |   51 |   88 |   15 |   53 |   61 |   64 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | CHA    | AL   |   2008 |   41 |  131 |   16 |   34 |    3 |   18 |   17 |   25 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   2009 |  117 |  387 |   44 |   83 |   19 |   57 |   63 |   80 |
| griffke02 |      1969 | Ken       | Griffey  |    195 |     75 | L    | L      | grifk002 | griffke02 | SEA    | AL   |   2010 |   33 |   98 |    6 |   18 |    0 |    7 |    9 |   17 |
+-----------+-----------+-----------+----------+--------+--------+------+--------+----------+-----------+--------+------+--------+------+------+------+------+------+------+------+------+
45 rows in set (0.00 sec)

Exit MySQL

The command to exit MySQL is–you guessed it–exit.

mysql> exit
bye

UFC 226 brought out the big guns (again…)

Why did the UFC let Brock Lesnar challenge Daniel Cormier for the belt?

This weekend I watched UFC 226, Miocic vs. Cormier. This was a title match with Miocic defending the heavyweight belt for the third time. Cormier landed a hard right hook in the final minute of the first round that dropped Miocic to the mat, and after a few ground-and-pound head shots, Cormier was given the win by KO (punches).

Before Joe Rogan could finish his fight recap interview, Cormier grabbed the microphone and called out Brock Lesnar (who was conveniently standing just outside the octagon). Lesnar entered the ring, said some rather disrespectful things about Miocic, then called out Cormier, alluding to an upcoming title fight between the 39-year-old Cormier and for WWE star.

I found this spectacle to be surreal, and couldn’t believe Joe Rogan let someone take his microphone. To say the least, it was an unprofessional display of sportsmanship and made me feel like I was watching more of a stage play than an athletic event.

Why would Dana White allow Brock Lesnar back into the UFC?

This question was bouncing around in my head most of the night, and I decided to dig into the numbers to see if I could understand what happened.

The Sports Daily data

After some snooping. I found a data set of pay-per-view sales (the actual metric is ‘Buy Rate’). These data come from the post titled, ” All-Time UFC PPV Sales Data,” and they contain data up until 225 (right before the fight in question).

The code chunk below downloads the table of data from the website, extracts the table, and checks the shape of this new data frame.

PPVUFC_url <- "http://thesportsdaily.com/2018/02/16/all-time-ufc-ppv-sales-data-fox11/"
PPVUFC_extraction <- PPVUFC_url %>%
     read_html() %>%
     html_nodes("table")
# check the structure of the new PPVUFC_extraction object
# PPVUFC_extraction %>% str()
# extract the html table
PpvUfcRaw <- rvest::html_table(PPVUFC_extraction[[1]]) 
# check the shape of the raw data 
# PpvUfcRaw %>% dplyr::glimpse(78)
# Now I need to clean these data up a bit by making the following changes:
# 1. rename the variables
# 2. remove the first row of data (they are the column names)
# 3. remove an empty row of data (row 2)
PayPerViewUFC <- PpvUfcRaw %>%
  dplyr::rename(event = X1,
                date = X2,
                main_event = X3,
                buy_rate = X4)
PayPerViewUFC <- PayPerViewUFC %>%
  filter(event != "Event" & event != "")
# check the dataShape
# dataShape() is a function I wrote that combines a little bit of
# dplyr::glimpse(), utils::head(), utils::tail(), and base::class()
dataShape <- function(df) {
    obs <- nrow(df)
    vars <- ncol(df)
    class <- paste0(class(df), collapse = "; ")
    first_var <- base::names(df) %>% head(1)
    last_var <- base::names(df) %>% tail(1)
    group <- is_grouped_df(df)
    heads_tails <- tibble::as_tibble(.env$ht(df))
    cat("Observations: ", obs, "\n", sep = "")
    cat("Variables: ", vars, "\n", sep = "")
    cat("Class(es): ", class, "\n", sep = " ")     
    cat("First/last variable: ", first_var, "/", last_var, "\n", sep = "")
    cat("Grouped: ", group, "\n", sep = "")
    cat("Top 5 & bottom 5 observations:", "\n", sep = "") 
    heads_tails 
} 
PayPerViewUFC %>% dataShape()
## Observations: 219
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date      main_event                         buy_rate
##  *                                       
##  1 UFC 1   Nov 12/93 Tournament                         86,000
##  2 UFC 2   Mar 11/94 Tournament                         300,000
##  3 UFC 3   Sept 9/94 Tournament                         90,000
##  4 UFC 4   Dec16/94  Tournament                         120,000
##  5 UFC 5   Apr 7/95  Royce Gracie vs Ken Shamrock       260,000
##  6 UFC 221 Feb 11/18 Yoel Romero vs Luke Rockhold       N/A
##  7 UFC 222 Mar 3/18  Cris Cyborg vs Yana Kunitskaya     N/A
##  8 UFC 223 Apr 7/18  Khabib Nurmagomedov vs Al Iaquinta N/A
##  9 UFC 224 May 12/18 Amanda Nunes vs Raquel Pennington  N/A
## 10 UFC 225 Jun 9/18  Robert Whittaker vs Yoel Romero    250,000

Export the raw data and the processed data file.

# writeLines(fs::dir_ls("data"))
write_csv(as_data_frame(PpvUfcRaw), "data/PpvUfcRaw.csv")
write_csv(as_data_frame(PayPerViewUFC), "data/PayPerViewUFC.csv")

Unique Events

These should be the UFC event (all the way back to the beginning). I want to see if this is unique (1 per row). The best to way to be sure of this is with base::identical()

base::identical(x = nrow(dplyr::distinct(PayPerViewUFC, event)),
          y = nrow(PayPerViewUFC))
## [1] TRUE

That’s helpful information–now I know I don’t have duplicate identification numbers for each UFC event.

The dates for each event

The next column in the data set is date, and these are all given as Month(abbreviation) DD/YY. I can quickly clean these data up using the lubridate::mdy() functions.

# check the format
PayPerViewUFC$date %>% glimpse(78)
##  chr [1:219] "Nov 12/93" "Mar 11/94" "Sept 9/94" "Dec16/94" "Apr 7/95" ...
# pick the function and parse date
PayPerViewUFC$date <- lubridate::mdy(PayPerViewUFC$date) 
# check the new date PayPerViewUFC$date %>% glimpse(78)
##  Date[1:219], format: "1993-11-12" "1994-03-11" "1994-09-09" "1994-12-16" "1995-04-07" ...

The Main Event column

These are the titles for each main event in the UFC. We can see a quick count of how many are listed as Tournament, how many rematches there are (those with a Main Event that showed up more than once), and how many events are listed only once.

PayPerViewUFC %>%
  dplyr::count(main_event, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    main_event                            n
##                                 
##  1 Tournament                           11
##  2 Chuck Liddell vs Randy Couture        3
##  3 Anderson Silva vs Chael Sonnen        2
##  4 Andrei Arlovski vs Tim Sylvia         2
##  5 Chuck Liddell vs Tito Ortiz           2
##  6 Frankie Edgar vs Benson Henderson     2
##  7 Frankie Edgar vs Gray Maynard         2
##  8 George St-Pierre vs Matt Serra        2
##  9 Johny Hendricks vs Robbie Lawler      2
## 10 Jose Aldo vs Chad Mendes              2

The ‘Buy Rates’ column

This column is the pay per view buy rate, but as noted in the original post,

No need for much of an introduction here – this is a list of the sales totals for every UFC pay-per-view since Day 1. Now, since the UFC is a private company and doesn’t release sales info, all this is based on estimates, usually released by the inimitable Dave Meltzer based on info from PPV providers (and listed on the Wikipedia pages for the events – UFC 178 & UFC 179 from Tapology). And during the Dark Ages of the sport, when it was pretty much banned everywhere, no PPV info is available.

These can be formatted correctly by removing the comma (stringr::str_remove_all()) and then converting to numeric (base::as.numeric()).

PayPerViewUFC <- PayPerViewUFC %>%
  mutate(buy_rate = stringr::str_remove_all(string = buy_rate, pattern = ","),
         buy_rate = base::as.numeric(buy_rate))
PayPerViewUFC %>% dataShape()
## Observations: 219
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date       main_event                         buy_rate
##  *                                          
##  1 UFC 1   1993-11-12 Tournament                            86000
##  2 UFC 2   1994-03-11 Tournament                           300000
##  3 UFC 3   1994-09-09 Tournament                            90000
##  4 UFC 4   1994-12-16 Tournament                           120000
##  5 UFC 5   1995-04-07 Royce Gracie vs Ken Shamrock         260000
##  6 UFC 221 2018-02-11 Yoel Romero vs Luke Rockhold             NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Yana Kunitskaya           NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomedov vs Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs Raquel Pennington        NA
## 10 UFC 225 2018-06-09 Robert Whittaker vs Yoel Romero      250000

Remove Tournaments to focus on events with fighters

I’m going to filter out the Main Events that were merely listed as Tournament.

UFCFighterEvents <- PayPerViewUFC %>%
  dplyr::filter(main_event != "Tournament")
UFCFighterEvents %>% dataShape()
## Observations: 208
## Variables: 4
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 4
##    event   date       main_event                         buy_rate
##  *                                          
##  1 UFC 5   1995-04-07 Royce Gracie vs Ken Shamrock         260000
##  2 UFC 6   1995-07-14 Ken Shamrock vs Dan Severn           240000
##  3 UFC 7   1995-09-08 Ken Shamrock vs Oleg Taktarov        190000
##  4 UFC 8   1996-02-16 Ken Shamrock vs Kimo Leopoldo        300000
##  5 UFC 9   1996-05-17 Ken Shamrock vs Dan Severn           141000
##  6 UFC 221 2018-02-11 Yoel Romero vs Luke Rockhold             NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Yana Kunitskaya           NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomedov vs Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs Raquel Pennington        NA
## 10 UFC 225 2018-06-09 Robert Whittaker vs Yoel Romero      250000

Separate the fighters into their own columns

Now I want to separate the names of both fighters in the main_event column into two different columns fighter_1 and fighter_2.

UFCFighterEvents <- UFCFighterEvents %>%
    tidyr::separate(
        col = main_event,
        into = c("fighter_1",
                 "fighter_2"),
        sep = " vs ",
        remove = FALSE)
UFCFighterEvents %>% dataShape()
## Observations: 208
## Variables: 6
## Class(es):  data.frame
## First/last variable: event/buy_rate
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 6
##    event   date       main_event         fighter_1    fighter_2   buy_rate
##  *                                         
##  1 UFC 5   1995-04-07 Royce Gracie vs K… Royce Gracie Ken Shamro…   260000
##  2 UFC 6   1995-07-14 Ken Shamrock vs D… Ken Shamrock Dan Severn    240000
##  3 UFC 7   1995-09-08 Ken Shamrock vs O… Ken Shamrock Oleg Takta…   190000
##  4 UFC 8   1996-02-16 Ken Shamrock vs K… Ken Shamrock Kimo Leopo…   300000
##  5 UFC 9   1996-05-17 Ken Shamrock vs D… Ken Shamrock Dan Severn    141000
##  6 UFC 221 2018-02-11 Yoel Romero vs Lu… Yoel Romero  Luke Rockh…       NA
##  7 UFC 222 2018-03-03 Cris Cyborg vs Ya… Cris Cyborg  Yana Kunit…       NA
##  8 UFC 223 2018-04-07 Khabib Nurmagomed… Khabib Nurm… Al Iaquinta       NA
##  9 UFC 224 2018-05-12 Amanda Nunes vs R… Amanda Nunes Raquel Pen…       NA
## 10 UFC 225 2018-06-09 Robert Whittaker … Robert Whit… Yoel Romero   250000

Pay-per-view purchases over time

Now I can visualize these data by plotting date on the x-axis and buy_rate on the y-axis.

UFCFighterEvents %>%
  ggplot(aes(x = date,
             y = buy_rate)) +
    geom_point(alpha = 0.5,
               size = 1.5) +
          ggplot2::labs(x = "Date",
                        y = "Pay-Per-View Buy Rate",
                        title = "UFC pay-per-view viewership") +
   ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

This shows a pretty clear rise of UFC pay-per-view purchases over time since 2001. I want to get rid of some additional zeros in the buy_rate variable by creating buy_rate_mil (which is done by dividing each number by 1,000,000).

I also want to check the top 50 UCF fights and see who the main attractions were because I’m curious about what fighters bring the most viewership (and when this tends to happen). I am going to switch to a line plot because I will be looking at fewer points and I’d like to see the trends (or changes) more clearly.

# Create buy_rate_mil --------
UFCFighterEvents <- UFCFighterEvents %>%
  dplyr::mutate(buy_rate_mil = buy_rate/1000000)
UFCFighterEventsTop50 <- UFCFighterEvents %>%
  dplyr::arrange(desc(buy_rate)) %>%
  utils::head(50)
UFCFighterEventsTop50 %>%
    ggplot(aes(x = date, y = buy_rate_mil)) +
    geom_line() +
      theme_ipsum() +
        ggplot2::labs(x = "Date",
                      y = "Pay-Per-View Buy Rate (Millions)",
                      title = "Top 50 UFC pay-per-view events") +
   ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

This graph shows the top ‘most viewed’ 50 UFC events. I added the theme_ipsum_rc() from the hrbrthemes package. Read more about it here..

This looks like I’m actually looking at three distinct peaks (one in mid 2009, the others in early and late 2016). Any idea who was fighting in these events?

Write a function to summarize the buy_rate and buy_rate_mil variables

I want some descriptive statistics on the two PPV metrics in this data set, so I am going to write a function that will give me a quick numerical summary of each variable. Learn more about writing functions in the awesome “Programming with dplyr” vignette.

numvarSum <- function(df, expr) {
  expr <- enquo(expr) # turns expr into a quosure
  summarise(df,
          n = sum((!is.na(!!expr))), # non-missing
          na = sum((is.na(!!expr))), # missing
          mean = mean(!!expr, na.rm = TRUE), # unquotes mean()
          median = median(!!expr, na.rm = TRUE), # unquotes median()
          sd = sd(!!expr, na.rm = TRUE), # unquotes sd()
          variance = var(!!expr, na.rm = TRUE), # unquotes var()
          min = min(!!expr, na.rm = TRUE), # unquotes min()
          max = max(!!expr, na.rm = TRUE), # unquotes max()
          se = sd/sqrt(n)) # standard error
}
UFCFighterEvents %>%
  numvarSum(buy_rate)
##     n na   mean median     sd  variance   min     max    se
## 1 184 24 437701 350000 317497 1.008e+11 35000 1650000 23406
UFCFighterEvents %>%
  numvarSum(buy_rate_mil)
##     n na   mean median     sd variance   min  max      se
## 1 184 24 0.4377   0.35 0.3175   0.1008 0.035 1.65 0.02341

How many events had more than 1 million views?

The line graph above shows the buying rate trends of the top 50 UFC events over time. However, I want some more details on the events themselves. I will start by creating a factor variable (buy_rate_fct) that sorts the buy_rate_mil variable into five levels. The numerical summaries above can help me make sure the new variable is coded correctly.

UFCFighterEvents <- UFCFighterEvents %>% 
 dplyr::mutate(buy_rate_fct = case_when(
buy_rate_mil < 1.00 ~ "less than 1.00 million ppvs",
buy_rate_mil >= 1.00 & buy_rate_mil < 1.25 ~ "1.00-1.25 million ppvs",
buy_rate_mil >= 1.25 & buy_rate_mil < 1.50 ~ "1.25-1.50 million ppvs",
buy_rate_mil >= 1.50 & buy_rate_mil < 1.75 ~ "1.50-1.75 million ppvs",
buy_rate_mil >= 1.75 & buy_rate_mil < 2.00 ~ "1.75-2.00 million ppvs"),
# convert to factor
buy_rate_fct = factor(buy_rate_fct,
                     levels = c("1.75-2.00 million ppvs",
                                "1.50-1.75 million ppvs",
                                "1.25-1.50 million ppvs",
                                "1.00-1.25 million ppvs",
                            "less than 1.00 million ppvs")))
# check the buy_rate_fct
knitr::kable(
UFCFighterEvents %>% dplyr::count(buy_rate_fct))
buy_rate_fct n
1.50-1.75 million ppvs 3
1.25-1.50 million ppvs 1
1.00-1.25 million ppvs 11
less than 1.00 million ppvs 169
NA 24

Now I can map the adjusted buy rate (buy_rate_mil) on the y-axis and the categorical buy rate variable (buy_rate_fct) to the color aesthetic to see more details of the distribution.

UFCFighterEvents %>%
  dplyr::filter(!is.na(buy_rate_fct)) %>%
    ggplot(aes(x = date,
               y = buy_rate_mil,
               group = buy_rate_fct)) +
    geom_point(aes(color = buy_rate_fct), size = 1.5) +
    ggplot2::xlab("Date") +
    ggplot2::ylab("UFC PPV Sales (in millions)") +
    ggplot2::ggtitle(label = "Pay-per-view UFC Events",
                     subtitle = "Every UFC pay-per-view since UFC 1") +
    ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ") +
      scale_color_ipsum() +
      scale_fill_ipsum() +
      theme_ipsum_rc()

Why not just assign a different color to every level of buy_rate_mil? Too many different colors would make it hard to track the differences in the PPV buy rate (I would need to keep re-checking the legend to figure out what I was seeing).

Which fighters attract the most viewership?

We can see from looking at the head() of the data in UFCFighterEventsTop50 that Conor McGregor is in four of the top 5 fights.

knitr::kable(
UFCFighterEventsTop50 %>%
  dplyr::select(main_event,
                fighter_1,
                fighter_2,
                buy_rate,
                buy_rate_mil) %>%
  head(5))
main_event fighter_1 fighter_2 buy_rate buy_rate_mil
Nate Diaz vs Conor McGregor Nate Diaz Conor McGregor 1650000 1.65
Brock Lesnar vs Frank Mir Brock Lesnar Frank Mir 1600000 1.60
Conor McGregor vs Nate Diaz Conor McGregor Nate Diaz 1500000 1.50
Eddie Alvarez vs Conor McGregor Eddie Alvarez Conor McGregor 1300000 1.30
Jose Aldo vs Conor McGregor Jose Aldo Conor McGregor 1200000 1.20

Tidy the fighters

I want to create a tidy data frame that gathers up the fighters in from the main_event column (now in the fighter_1 and fighter_2 columns). This means I want a single column (fighter_val) that lists all the fighters, and another column (fighter_key), that tells me whether they were fighter_1 or fighter_2. I also reorganize the data frame with some of dplyrs handy select() helper functions, and sort the data so we can see what this new arrangement looks like.

TidyUFCFighters <- UFCFighterEvents %>%
  tidyr::gather(key = fighter_key,
                value = fighter_val,
                fighter_1:fighter_2) %>%
  dplyr::select(event,
                date,
        dplyr::contains("buy"),
        dplyr::contains("fight"),
        main_event) %>%
  dplyr::arrange(event)
TidyUFCFighters %>% dataShape()
## Observations: 416
## Variables: 8
## Class(es):  data.frame
## First/last variable: event/main_event
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 8
##    event   date       buy_rate buy_rate_mil buy_rate_fct       fighter_key
##  *                                   
##  1 UFC 100 2009-07-11  1600000        1.6   1.50-1.75 million… fighter_1
##  2 UFC 100 2009-07-11  1600000        1.6   1.50-1.75 million… fighter_2
##  3 UFC 101 2009-08-08   850000        0.85  less than 1.00 mi… fighter_1
##  4 UFC 101 2009-08-08   850000        0.85  less than 1.00 mi… fighter_2
##  5 UFC 102 2009-08-29   435000        0.435 less than 1.00 mi… fighter_1
##  6 UFC 99  2009-06-13   360000        0.36  less than 1.00 mi… fighter_2
##  7 UFC Br… 1998-10-16       NA       NA                    fighter_1
##  8 UFC Br… 1998-10-16       NA       NA                    fighter_2
##  9 UFC Ja… 1997-12-21       NA       NA                    fighter_1
## 10 UFC Ja… 1997-12-21       NA       NA                    fighter_2
## # ... with 2 more variables: fighter_val , main_event 

This is exactly what I should expect–each event listed twice–once for each fighter.

Who are the top five (most occuring fighters)?

I can now use the dplyr::count() function to determine which fighter has the most main event occurrences.

knitr::kable(
TidyUFCFighters %>%
  dplyr::count(fighter_val, sort = TRUE) %>%
  head(5))
fighter_val n
Randy Couture 17
Anderson Silva 16
Tito Ortiz 15
Chuck Liddell 11
Jon Jones 11

And now I can see that when it comes to PPV purchases, Randy Couture is the most popular main event fighter in the UFC (followed by Anderson Silva and Tito Ortiz).

Which fighter draws the most PPV purchases?

Now I want to know which fighter draws the most PPV purchases in the UCF. This question is a little trickier than simply asking ‘what events have the highest PPV buy rate?’, because each event has two fighters and it isn’t always clear who is drawing the crowd. One way to get at this is by seeing how many of the most purchased events featured fighters who were also in many events.

I can do this in the following steps:

1. remove the missing PPV purchase data,

2. count the number of UFC events per fighter,

3. sort the data by the number of UFC events per fighter, then by the PPV purchase rate,

4. limit this to only the events with PPV purchase rates over 1 million.

The wrangling steps below use the same functions as above to create TopPPVFighters.

TopPPVFighters <- TidyUFCFighters %>%
  # remove missing buy rates
  filter(!is.na(buy_rate_mil)) %>%
  # count the number of occurances per fighter
  dplyr::count(fighter_val) %>%
  # rename n to fghtr_events
  dplyr::rename(fghtr_events = n) %>%
  # join back to original tidy data set
  dplyr::left_join(., TidyUFCFighters, by = "fighter_val") %>%
 # arrange by descending number of occurances and PPV buy rates
  dplyr::arrange(desc(fghtr_events, buy_rate_mil)) %>%
  # limit to events over 1 million
  dplyr::filter(buy_rate_mil > 1.00) %>%
  # only the distinct fighter_val
  dplyr::distinct(main_event, .keep_all = TRUE)
TopPPVFighters %>% dataShape()
## Observations: 14
## Variables: 9
## Class(es):  tbl_df; tbl; data.frame
## First/last variable: fighter_val/main_event
## Grouped: FALSE
## Top 5 & bottom 5 observations:
## # A tibble: 10 x 9
##    fighter_val    fghtr_events event   date       buy_rate buy_rate_mil
##                                         
##  1 Anderson Silva           16 UFC 168 2013-12-28  1025000         1.02
##  2 Randy Couture            14 UFC 91  2008-11-15  1010000         1.01
##  3 Chuck Liddell            11 UFC 66  2006-12-30  1050000         1.05
##  4 Rashad Evans              9 UFC 114 2010-05-29  1050000         1.05
##  5 Jose Aldo                 7 UFC 194 2015-12-12  1200000         1.2
##  6 Brock Lesnar              5 UFC 116 2010-07-03  1160000         1.16
##  7 Conor McGregor            5 UFC 196 2016-03-05  1500000         1.5
##  8 Conor McGregor            5 UFC 202 2016-08-20  1650000         1.65
##  9 Conor McGregor            5 UFC 205 2016-11-12  1300000         1.3
## 10 Amanda Nunes              3 UFC 200 2016-07-09  1200000         1.2
## # ... with 3 more variables: buy_rate_fct , fighter_key ,
## #   main_event 

Now I can create a plot that uses the TopPPVFighters data frame, and maps the name of the fighters to the point on the graph. But before I do this, I want to make a few adjustments to the fivethirtyeight theme from ggthemes. As you can see from the code below, I make a few minor changes to display the axis titles and adjust the fonts. Read more about theme_foundation() and ggthemes.

# check the colors for fivethirtyeight theme
# ggthemes_data$fivethirtyeight
 # dkgray medgray ltgray red blue green 
# "#3C3C3C" "#D2D2D2" "#F0F0F0" "#FF2700" "#008FD5" "#77AB43"
theme_fivethirtyeightv1.1 <- function(base_size = 11, base_family = "sans") {
 (theme_foundation(base_size = base_size, base_family = base_family) + 
     theme(line = element_line(color = "black"), 
             rect = element_rect(
                 fill = "#F0F0F0", 
                 linetype = 0, 
                 color = NA), 
       text = element_text(color = "#3C3C3C"), 
       axis.text = element_text(), 
       axis.ticks = element_blank(),
       axis.line = element_blank(),
       legend.background = element_rect(
                fill = "#F0F0F0",
                color = "#D2D2D2",
                size = 1), 
       legend.position = "bottom",
       legend.direction = "horizontal",
         panel.grid = element_line(color = NULL), 
         panel.grid.major = element_line(color = "#D2D2D2"), 
         panel.grid.minor = element_blank(), 
         plot.title = element_text(family = "mono",
                     hjust = 0, 
                     size = rel(1.5), 
                     face = "bold"), 
       plot.margin = unit(c(1, 1, 1, 1), "lines"), 
       strip.background = element_rect()))
}
TopPPVFighters %>% 
 ggplot(aes(x = date, 
            y = buy_rate_mil, 
            label = fighter_val)) + 
 ggplot2::geom_point(aes(size = fghtr_events,
                         color = fighter_val),
                         alpha = 0.7,
                         show.legend = TRUE) +
 ggplot2::guides(color = FALSE) +
 ggplot2::scale_size_continuous(name = "number of events",
                               breaks = c(3, 6, 9, 12, 15),
                               labels = c("3", "6", "9", 
                                        "12", "15")) +
 ggrepel::geom_text_repel(direction = "both",
                          hjust = 0.5,
                          vjust = 0.5,
                          segment.size = 0.5,
                          color = "black",
                          size = 2.5) +
 ggthemes::scale_color_calc() +
 theme_fivethirtyeightv1.1() +
            ggplot2::xlab("Date") +
            ggplot2::ylab("UFC PPV Sales (in millions)") +
            ggplot2::ggtitle(label = "UFC Events by Number of Appearances", 
            subtitle = "larger point = more appearances") + 
            ggplot2::labs(caption = "data source: https://goo.gl/UWhwEZ")

The thing to notice about this graph is that Conor McGregor and Brock Lesnar are the only two fighters above the 1.5 million PPV purchase mark. It remains unknown if McGregor will return to the UFC, so it’s important to consider how many of the big names in UFC are retired or have moved onto different franchises:

After looking at PPV purchase numbers and the recent exodus of talent (voluntary or otherwise), it makes sense that Saturday’s fight felt more like a staged drama event than a typical bout. The UFC has to compete with an increasing number of fighting organizations (Viacom’s Bellator, and Rizin Fighting Federation–formerly Pride–out of Tokyo, and nothing attracts fight fans like a narrative.

In the next post I will look into the MMA attendance data to see if the same big names also sell the most tickets.

devtools::session_info()
##  setting  value
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  tz       America/Los_Angeles
##  date     2018-07-09