Migrating from a Microsoft Access Database to MySQL with MS ACCDB Viewer

Motivation

The new dbplyr is out on CRAN today, and I’ve wanted to update my MySQL Lahman baseball database with 2017 data, but it’s only available in .csv files and MS Access versions. I could import the .csv files, but thought it would be useful to show how I convert an MS Access database into MySQL using MS ACCDB Viewer, This application has a free trial (the paid version is under $20). I rarely recommend purchasing software, but if you use macOS and find yourself needing to get into an MS Access database, this tool comes in handy.

Getting started

This post assumes:

  1. You are running macOS (this tutorial was built using version 10.14.2)
  2. You have MySQL installed on your machine (see previous post to get everything set up)
  3. You have RStudio installed (using version 1.2.1206)
  4. You have R installed (using version 3.5.1)

Packages

These are the packages you’ll need.

library(RMySQL)
library(RMariaDB)
library(dplyr)
library(tidyverse)
library(Lahman)
library(magrittr)
library(rstudioapi)

Project folder setup

I created an RStudio project file inside a mysql-mdb-accdb folder (mysql-mdb-accdb.Rproj). If you don’t know how RStudio’s project files work, see this post for more information.

Inside this project folder, I created another sub-folder titled databases. I wanted to separate these files by type, so I created a new folder for each version of the 2017 data.

databases/csv
databases/ms-access
databases/sql

After downloading the data, the full folder tree is available below.

databases
├── csv
│ └── baseballdatabank-master
│ ├── README.txt
│ ├── core
│ │ ├── AllstarFull.csv
│ │ ├── Appearances.csv
│ │ ├── AwardsManagers.csv
│ │ ├── AwardsPlayers.csv
│ │ ├── AwardsShareManagers.csv
│ │ ├── AwardsSharePlayers.csv
│ │ ├── Batting.csv
│ │ ├── BattingPost.csv
│ │ ├── CollegePlaying.csv
│ │ ├── Fielding.csv
│ │ ├── FieldingOF.csv
│ │ ├── FieldingOFsplit.csv
│ │ ├── FieldingPost.csv
│ │ ├── HallOfFame.csv
│ │ ├── HomeGames.csv
│ │ ├── Managers.csv
│ │ ├── ManagersHalf.csv
│ │ ├── Parks.csv
│ │ ├── People.csv
│ │ ├── Pitching.csv
│ │ ├── PitchingPost.csv
│ │ ├── Salaries.csv
│ │ ├── Schools.csv
│ │ ├── SeriesPost.csv
│ │ ├── Teams.csv
│ │ ├── TeamsFranchises.csv
│ │ ├── TeamsHalf.csv
│ │ └── readme2014.txt
│ └── upstream
│ └── Teams.csv
└── ms-access
└── lahman2017
├── lahman2017.mdb
└── readme2017.txt

MS Access to MySQL using MS ACCDB Viewer

  1. Download the MS ACCDB Viewer app and the Lahman 2017 MS Access database and .csv files. I’ve created the following sub-folders for the two types of Lahman database files.
  2. Open the MS Access database in the MS ACCDB Viewer app and follow the steps below.

3. After the database exports, you will see a new lahman2017.sql file in your databases/sql. This file is too large to open in RStudio (it’s 46.4 MB), so I’ll open it in a text editor (I use Sublime Text, but many others will work).

What should I expect to see?

If you haven’t done a lot of work in SQL, you will have no way of knowing if this SQL export is ready to be loaded into MySQL (I didn’t know what to expect until I did a lot of reading and trial and error). However, I can look at the 2016 MySQL database from the Lahman website to see what to expect. Below are the first 50 lines each SQL file (2016 on the left, 2017 on the right).

The important differences in the first ~45 lines:

  1. There is no command for creating the lahman2017 database (i.e. CREATE DATABASE)
  2. After creating the database, there is no command to use the lahman2017 database
  3. The SET NAMES argument should be changed to SET NAMES utf8mb4 (this will include characters like emojis)
  4. All the tables and data have clear headers in the previous .sql file, and it’s worthwhile to add these to this version too (though it’s not required)
/*

This MySQL dump came from the Lahman 2017 mdb file available here:
http://www.seanlahman.com/baseball-archive/statistics/
The following was added:
A CREATE DATABASE statement
A USE database statement
Change the SET NAMES 'UTF8'; to SET NAMES utf8mb4;
Add the 'Table structure' and 'Records' titles (optional)

*/
CREATE DATABASE IF NOT EXISTS lahman2017;
USE lahman2017;
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

The first 50 lines of my lahman2017.sql file now look like this:

RStudio’s Terminal pane

In previous posts, I mention opening and using Terminal. RStudio also has it’s own Terminal pane (see below).

This is accessible under Tools > Terminal > New Terminal.

In a new Terminal pane, enter the following commands.

$ cd databases/sql
$ export PATH=$PATH:/usr/local/mysql/bin
$ echo $PATH

This returns what looks like a long and convoluted file path (this is normal).

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/….

Now I enter the following:

mysql -u root -p 

Enter your password and you should see the mysql> prompt. I want to check the databases in MySQL before adding the new lahman2017 data, and I can do this with SHOW DATABASES;.

mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| lahman2016 |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)

Because I moved into the sql folder, I can execute the lahman2017.sql file by entering source lahman2017.sql into the MySQL prompt.

mysql> source lahman2017.sql

This will return a long output telling you things like Query OK and Records: 11370 Duplicates: 0 Warnings: 0. When it’s finished, check the new tables with SHOW TABLES;

You should see this.

Great! Now I have the new database set up in MySQL. I am going to connect to RStudio and start querying.

Changing MySQL settings

Before connecting to MySQL from RStudio, you might need to change a few settings for MySQL in the System Preferences.

  1. Select MySQL >> Initialize Database
  2. Now enter a new password for the root user and select Use Legacy Password Encryption
  3. Then click Start MySQL Sever

The dplyr::src_mysql() function

In the previous post, I set up my connection to RStudio using the DBI package.

This time I will set up my connection with the dplyr package (part of the tidyverse). The function that does this is dplyr::src_mysql(), and I include arguments for dbname, host, user, and rstudio::askForPassword("Database password") which prompts a fill-in box for my password.

LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2017", 
host = "localhost",
user = "root",
# password = rstudioapi::askForPassword("Database password")
)

Print the LahmandplyrMySQL object to the console.

LahmandplyrMySQL
src: mysql 8.0.13 [root@localhost:/lahman2017]
tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers,
AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost,
CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost,
HallOfFame, HomeGames, Managers, ManagersHalf, Parks, People, Pitching,
PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises,
TeamsHalf

As I can see, this prints out much more information than the DBI connection in my previous post. The LahmandplyrMySQL object tells me the source of the connection and the tables in the database.

ALWAYS read the README file (readme2017.txt)

The directions are right there in the title. Important stuff is in these files (like changes to the tables in the database). This section is an example of such information:

One notable change: The name of the table that contains biographical information for players has been changed from “Master” to “People” top better reflect its contents.

This information saved me from some potential headache’s (my previous query on the Master table would give an error).

In the next section, I’ll go over the new changes in dbplyr with the dbplyr connection. The dplyr::tbl() function takes the src argument which is the connection to the database (LahmandplyrMySQL). The next argument is sql(), and this is a character string (inside " ") which SQL commands. I like to capitalize the MySQL syntax, refer to tables in SQL with TitleCase, and the columns in lowercase.

API changes

1) Calls of the form dplyr::foo() are now evaluated in the database, rather than locally.

I am assuming this means that using dplyr instead of MySQL in queries is run in the database. Such as the following:

I also pass everything to the dplyr::glimpse(78) so it prints nicely to the window.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view object
dplyr::glimpse(78)

Observations: ??
Variables: 24
Database: mysql 8.0.13 [root@localhost:/lahman2017]
$ playerID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…
$ birthYear 1964, 1987, 1943, 1963, 1971, 1942, 1981, 1934, 1939, …
$ birthMonth 10, 11, 7, 9, 12, 8, 12, 2, 8, 9, 2, 8, 6, 9, 7, 1, 9,…
$ birthDay 26, 13, 15, 21, 28, 9, 27, 5, 5, 8, 16, 17, 2, 15, 15,…
$ birthCountry "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ birthState "IL", "NY", "CA", "CA", "HI", "AL", "CO", "AL", "AL", …
$ birthCity "Chicago", "Staten Island", "Long Beach", "Escondido",…
$ deathYear NA, NA, 2016, NA, NA, 2001, NA, NA, 1984, NA, NA, NA, …
$ deathMonth NA, NA, 2, NA, NA, 1, NA, NA, 8, NA, NA, NA, NA, NA, N…
$ deathDay NA, NA, 15, NA, NA, 22, NA, NA, 16, NA, NA, NA, NA, NA…
$ deathCountry NA, NA, "USA", NA, NA, "USA", NA, NA, "USA", NA, NA, N…
$ deathState NA, NA, "CA", NA, NA, "NY", NA, NA, "GA", NA, NA, NA, …
$ deathCity NA, NA, "Long Beach", NA, NA, "New York", NA, NA, "Atl…
$ nameFirst "Steve", "Tim", "Dave", "Troy", "Benny", "Tommie", "Da…
$ nameLast "Adkins", "Adleman", "Adlesh", "Afenir", "Agbayani", "…
$ nameGiven "Steven Thomas", "Timothy Max", "David George", "Micha…
$ weight 210, 225, 187, 185, 225, 195, 215, 180, 190, 190, 200,…
$ height 78, 77, 72, 76, 72, 71, 75, 72, 75, 75, 78, 74, 71, 75…
$ bats "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ throws "L", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ debut "1990-09-12", "2016-05-01", "1963-05-12", "1987-09-14"…
$ finalGame "1990-10-03", "2017-09-19", "1968-09-28", "1992-07-10"…
$ retroID "adkis001", "adlet001", "adled101", "afent001", "agbab…
$ bbrefID "adkinst01", "adlemti01", "adlesda01", "afenitr01", "a…

2) The vars argument to tbl_sql() has been formally deprecated.

The vars argument used to be used to supply a list of columns in the tbl. The previous language is below:

If known, the names of the variables in the tbl. This is relatively expensive to determine automatically, so is cached throughout dplyr. However, you should usually be able to leave this blank and it will be determined from the context.

3) src and tbl objects now include a class generated from the class of the underlying connection object. This makes it possible for dplyr backends to implement different behaviour at the dplyr level, when needed.

I assume this is visible in the str() and class() using the previous query.

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
utils::str()


List of 2
$ src:List of 2
..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
.. .. ..@ Id: int [1:2] 0 0
..$ disco:
..- attr(*, "class")= chr [1:4] "src_MySQLConnection" "src_dbi" "src_sql" "src"
$ ops:List of 4
..$ name: chr "filter"
..$ x :List of 2
.. ..$ x : 'sql' chr "SELECT * FROM People"
.. ..$ vars: chr [1:24] "playerID" "birthYear" "birthMonth" "birthDay" …
.. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
..$ dots:List of 1
.. ..$ : language ~bats == "R" & birthYear >= 1900
.. .. ..- attr(*, ".Environment")=
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"
- attr(*, "class")= chr [1:5] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" …

The class

start with SQL
dplyr::tbl(src = LahmandplyrMySQL, sql("SELECT * FROM People")) %>%
# switch to dplyr
dplyr::filter(bats == "R" & birthYear >= 1900) %>%
# view structure
base::class()

[1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"

SQL translation

I will skip over The new as.integer64(x) is translated to CAST(x AS BIGINT).

cummean() now translates to SQL AVG() as opposed to MEAN().

The cummean() is an aggregate window function. It works by keeping track of 1) the running sum of a values, and 2) the ith (or count) of the values. There is a great example of how this works on StackOverflow here.

I created an example data frame below to show how this works.

get a sample
x <- base::sample(1:10)
create a data frame of the intermediate steps and cummean
base::data.frame(sample= x,
                cumsum_c = cumsum(x),
seq_along_x = seq_along(x),
cummean = cumsum(x) / seq_along(x))
sample cumsum_x seq_along_x cummean
5 5 1 5.000000
3 8 2 4.000000
4 12 3 4.000000
2 14 4 3.500000
1 15 5 3.000000
7 22 6 3.666667
8 30 7 4.285714
9 39 8 4.875000
10 49 9 5.444444
6 55 10 5.500000

The cumsum_x is a running total of sample (5 + 3 = 8, 5 + 3 + 4 = 12, etc.), and seq_along_x is the count of the values. When I get the ratio of cumsum_x / seq_along_x, I get the cummean column (5 / 1 = 5, 8 / 2 = 4.0, etc.)

This is the cumulative mean of RBIs in the 2017 Batting data table.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::mutate(cummean_rbi = cummean(RBI)) %>%
dplyr::select(playerID, cummean_rbi) %>%
utils::head(5)

# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(RBI)
playerID cummean_rbi

1 wilsoha01 191
2 gehrilo01 188.
3 greenha01 186
4 foxxji01 183.
5 gehrilo01 182.

Compare this to the mean RBI per player.

get the batting table
dplyr::tbl(src = LahmandplyrMySQL,
sql("SELECT playerID, RBI FROM Batting")) %>%
dplyr::arrange(desc(RBI)) %>%
dplyr::group_by(playerID) %>%
dplyr::summarize(mean_rbi = mean(RBI, na.rm = TRUE)) %>%
dplyr::select(playerID, mean_rbi) %>%
dplyr::arrange(desc(mean_rbi)) %>%
utils::head(5)
# Source: lazy query [?? x 2]
# Database: mysql 8.0.13 [root@localhost:/lahman2017]
# Ordered by: desc(mean_rbi)
playerID mean_rbi
1 dimagjo01 118.
2 gehrilo01 117.
3 pujolal01 113.
4 cabremi01 108.
5 belleal01 103.

Just for kicks, the AVG() function in MySQL is implemented like this:

mysql> SELECT 
playerID,
AVG(RBI) AS avg_rbi
FROM
Batting
GROUP BY
playerID
ORDER BY
avg_rbi DESC
LIMIT 5;
+-----------+----------+
| playerID | avg_rbi |
+-----------+----------+
| dimagjo01 | 118.2308 |
| gehrilo01 | 117.3529 |
| pujolal01 | 112.8235 |
| cabremi01 | 107.5333 |
| belleal01 | 103.2500 |
+-----------+----------+
5 rows in set (0.07 sec)

Closing

Moving form one relational database management system to another can slow down your workflow, so having (inexpensive) tools to quickly transform a database into something you can work with is a time-saver. I don’t like leaving the RStudio IDE (unless it’s to my browser), so the quicker I can get my data source into this environment, the quicker I can get to work on it.

Next post will probably be on PostgreSQL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.