Literate Programming & Dynamic Document Options in Stata

TL: DR

I’ve spent the last few months attempting to incorporate different literate programming and reproducible research options with the Stata statistical software. This post provides a quick overview of my goal, a brief “how-to” on each option, and my thoughts on realistically introducing them in a workflow.


What is literate programming?

Literate programming is a term coined by Donald E. Knuth in 1984. The general idea is to combine human readable text with machine-readable commands into the same document, manual, or website.  At the beginning of his paper, Knuth writes,

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

Although Knuth is describing the process of writing computer programs, his concept applies to any scenario where a series of commands are used to get a computer to perform a particular set of functions.

The data analysis process includes a successive set of commands used to manipulate the rawest form of the data, transform it into summaries or visualizations, and then model it for predictions and inferences. Each step in the process is based on the preceding step, so tracking the entire method in a systematic way is essential for remaining organized and being able to reproduce your work.

The yardstick I was using to evaluate each Stata literate programming option was how well each method provided a relatively seamless transition between the Stata commands and the human-readable explanations.

A Typical Stata Workflow

Below is an example workflow in Stata.

stata_workflow

While working in Stata, commands get put into a .do file, and they are used to create the output in either  .gph (image) or .smcl (text) files. The text results will then get written up in a .docx, and the tables and figures will be created using .xlsx. These documents are then sent off to a scientific journal somewhere, where the finished paper will most likely end up as a .pdf or .html on the web.

So to recap, the general process is:

.do >> .gph >> .smcl >> .docx >> .xlsx >> .pdf >> .html 

This workflow involves seven different file types to create a single finished manuscript (that will usually only contain only text and images).

Why should I care?

If you’ve read this far, the answer should be obvious: the process outlined above is inefficient. The analysis process is split across Stata, MS Word, MS Excel, Adobe/Preview, and whatever web browser you’re using. This makes working in Stata tedious.

Solution: Digital Notebooks

I recently came across a white paper that discusses the benefits of using Notebooks, and I’ve summarized the main points below:

  • “…notebooks speed up the process of trying out data models and frameworks and testing hypotheses, enabling everyone to work quickly, iteratively and collaboratively”
  • “…can be used to perform an analysis, generate visualizations, and add media and annotations within a single document.”
  • “…can be used to annotate the analyses for future reference and keep track of each step in the data discovery process so that the notebook becomes a source of project communication and documentation, as well as a learning resource for other data scientists.”

Although the paper is referring specifically to the Jupyter Notebooks, RStudio recently introduced the R Notebooks. Both methods combine similar sections for markdown formatted text, data analysis commands, output tables and/or figures, and other relevant portions of the results. These digital notebooks closely resemble paper laboratory notebooks (see below).

This slideshow requires JavaScript.

As you can see from this example, some of the text and calculations have been handwritten, while others have been calculated outside of the notebook, printed, and then pasted back inside the lab notebook. I commend the authors for their transparency, but this doesn’t seem like the most efficient method of keeping track of your work.

Does Stata have an equivalent option?

Sort of. Below I review my experience using three Stata options that collectively provide similar abilities to the notebooks provided by Python and R.


#1 markdoc

markdoc is a package for creating dynamic documents in Stata. It is very similar to the kintr package in R. The package was written by E. F. Haghish and is available on Github. To run markdoc, you’ll need to install Pandoc (which requires a type-setting software for TeX/LaTeX–I used MikTeX on my PC and Homebrew on my Mac).

After installing Pandoc and MikTeX, you’ll also need to download and install wkhtmltopdf.

Installing markdoc

You can install markdoc with the ssc command

ssc install markdoc

You should see:

checking markdoc consistency and verifying not already installed...
installing into c:\ado\plus\...
installation complete.

You’ll also need to install weaver and statax

ssc install weaver
ssc install statax


markdoc
also has a handy dialog box you can install with the following command:

db markdoc

The Output Files

Haghish provides example .do files for getting started with markdoc. I recommend working through each of them, but it shouldn’t be too difficult if you’re used to commenting in your .do files or writing in markdown. The .docx and .pdf output files are clean, orgranized, and formatted.

markdoc_docx

markdoc_pdf

markdoc is ideal for producing high-quality documents directly from the Stata IDE. After you understand the markdoc syntax, you will be able to perform the majority of your work in the .do file. The only downside I encountered in markdoc was a somewhat buggy installation–it worked better for me on the Mac. But the package is incredibly well maintained by the author, and I was able to find answers to my questions on his Github page eventually.


#2 weave

Germán Rodríguez at Princeton created weave, and I consider it a markdoc-light Stata package.

Installing weave

Installation is easy. Just type the following command into the Stata IDE.

net from http://data.princeton.edu/wws509/stata

And there’s an example .do file on his website.

weave essentially uses markdown/html tags for inline images and headers that are written directly into your .do files. The results are inserted into the output as plain text, so there is no need to tweak their formatting. When you’re finished with your analyses, you just type the following commands directly into the IDE.

weave using sample

The beginning of the .do file contains a command for logging using everything as a .usl file. The .usl file is then ‘weaved’ to create a .html output which will automatically open in your web browser.

The Output Files

You can just print the .html file to a .pdf like you would any web page. Chrome seems to create the best-looking .pdfs.

weave_output_pdf

*TIP: use minimal lines on your .do file to create cleaner looking output. I’ve created a detailed example here.

I use weave whenever I’m using Stata on my Mac. It’s easy to use, quick to format, and only requires me to have Stata open with a .do file.  I’ll use markdoc if I am creating a more professional-looking report, but the bugginess of markdoc doesn’t make it very user-friendly


#3 ipystata

The Jupyter Notebooks (previously IPython Notebooks) can be configured to work with Stata commands. Unfortunately, the package works best with Windows/PC.  The setup isn’t too complicated but has a few steps that can trip you up.

Download Anaconda from Continuum

You can download the most recent version of Anaconda from Continuum . This will include the following applications

anaconda_package

The only application I will be covering in this post is the Anaconda Prompt.

Changing the Jupyter Notebook working directory

The first thing you will want to do is set up your Jupyter Notebook in an appropriate working directory. You can do this by right-clicking on the Anaconda Prompt and run it as an administrator (I’ve moved the application to the taskbar).

anaconda_prompt

When the prompt is displayed (it should say Administrator)

anaconda_prompt_admin

copy+paste the file directory to the folder you want the Jupyter Notebook to open in. In the Anaconda Prompt, type

cd C:\Users\Google Drive\...\ipystata\notebooks

This will change your working directory. After you’re in the correct working directory, start up the Jupyter Notebooks by typing the following command in your Anaconda Prompt,

Jupyter Notebook

This should open a new tab in your default web browser.

jupyter

You can open a new notebook using the tab on the far right of the screen by selecting, “New” >> “Python [default]

Registering Stata

You will need to open a Command Prompt window as an administrator by right-clicking on the application and selecting, “Run as administrator” (*you can search for this application in the windows search bar by typing “cmd”).

admin_cmd

from here you need to navigate to your Stata application in your Program Files (usually in the C:\ drive)

stata_path

copy+paste the file destination and enter it into the Command Prompt window preceded by cd

cd C:\Program Files (x86)\Stata14

admin_path_stata

from this location, register the Stata application by typing the name of the .exe file followed by a space and /Register

register_stata

*No news is good news on this command. 

Installing ipystata and pandas

Now go back to your Jupyter Notebook and install pandas. Pandas is an open-source data analysis package for python. Read about it here.

In the first line of your notebook type:

import pandas as pd

To install ipystata, you’ll need to open a Windows PowerShell window (as an administrator) and enter the following command:

pip install ipystata

install_ipystata

After the package has been installed, enter the following command in your Jupyter Notebook:

import ipystata

To test if it worked, type a simple display command preceded by the %%stata . The output should look like this:

jupyter_ipystata

Using the %%stata Magic Commands

Now that you are up and running, the Jupyter Notebook basically replaces your .do file. You will just need to precede the Stata commands with a line containing the %%stata

Start by loading a native dataset

%%stata
sysuse auto, clear

You can get a quick overview of these data by using codebook, compact or describe, short

stata_describe

Including Graphs in the Output

To include graphs in your output, simply include the -gr command on the same line as your  %%stata command.

figure_1

matrix graph

figure_2

scatter plot

Sharing Your Output Online

In my opinion, the best part of using Jupyter Notebooks is the ability to share your work online. You can publish your notebook using the cloud+arrow icon on the toolbar (register your account first).

sharing

In fact, this notebook and a complete example of the ipystata package is available online. I think this feature makes the Jupyter Notebook the best option for literate programming and reproducible research in Stata. The complicated setup is definitely worth the time investment because you’ll be able to have an ongoing stream of commands, formatted text summaries, and output all in one place.