I’ve spent the last few months attempting to incorporate different literate programming and reproducible research options with the Stata statistical software. This post provides a quick overview of my goal, a brief “how-to” on each option, and my thoughts on realistically introducing them in a workflow.
What is literate programming?
Literate programming is a term coined by Donald E. Knuth in 1984. The general idea is to combine human readable text with machine-readable commands into the same document, manual, or website. At the beginning of his paper, Knuth writes,
“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
Although Knuth is describing the process of writing computer programs, his concept applies to any scenario where a series of commands are used to get a computer to perform a particular set of functions.
The data analysis process includes a successive set of commands used to manipulate the rawest form of the data, transform it into summaries or visualizations, and then model it for predictions and inferences. Each step in the process is based on the preceding step, so tracking the entire method in a systematic way is essential for remaining organized and being able to reproduce your work.
The yardstick I was using to evaluate each Stata literate programming option was how well each method provided a relatively seamless transition between the Stata commands and the human-readable explanations.
A Typical Stata Workflow
Below is an example workflow in Stata.
While working in Stata, commands get put into a .do file, and they are used to create the output in either .gph (image) or .smcl (text) files. The text results will then get written up in a .docx, and the tables and figures will be created using .xlsx. These documents are then sent off to a scientific journal somewhere, where the finished paper will most likely end up as a .pdf or .html on the web.
So to recap, the general process is:
.do >> .gph >> .smcl >> .docx >> .xlsx >> .pdf >> .html
This workflow involves seven different file types to create a single finished manuscript (that will usually only contain only text and images).
Why should I care?
If you’ve read this far, the answer should be obvious: the process outlined above is inefficient. The analysis process is split across Stata, MS Word, MS Excel, Adobe/Preview, and whatever web browser you’re using. This makes working in Stata tedious.
Solution: Digital Notebooks
I recently came across a white paper that discusses the benefits of using Notebooks, and I’ve summarized the main points below:
- “…notebooks speed up the process of trying out data models and frameworks and testing hypotheses, enabling everyone to work quickly, iteratively and collaboratively”
- “…can be used to perform an analysis, generate visualizations, and add media and annotations within a single document.”
- “…can be used to annotate the analyses for future reference and keep track of each step in the data discovery process so that the notebook becomes a source of project communication and documentation, as well as a learning resource for other data scientists.”
Although the paper is referring specifically to the Jupyter Notebooks, RStudio recently introduced the R Notebooks. Both methods combine similar sections for markdown formatted text, data analysis commands, output tables and/or figures, and other relevant portions of the results. These digital notebooks closely resemble paper laboratory notebooks (see below).
As you can see from this example, some of the text and calculations have been handwritten, while others have been calculated outside of the notebook, printed, and then pasted back inside the lab notebook. I commend the authors for their transparency, but this doesn’t seem like the most efficient method of keeping track of your work.
Does Stata have an equivalent option?
Sort of. Below I review my experience using three Stata options that collectively provide similar abilities to the notebooks provided by Python and R.
markdoc is a package for creating dynamic documents in Stata. It is very similar to the kintr package in R. The package was written by E. F. Haghish and is available on Github. To run
markdoc, you’ll need to install Pandoc (which requires a type-setting software for TeX/LaTeX–I used MikTeX on my PC and Homebrew on my Mac).
After installing Pandoc and MikTeX, you’ll also need to download and install wkhtmltopdf.
You can install markdoc with the
ssc install markdoc
You should see:
checking markdoc consistency and verifying not already installed... installing into c:\ado\plus\... installation complete.
You’ll also need to install weaver and statax
ssc install weaver ssc install statax
markdoc also has a handy dialog box you can install with the following command:
The Output Files
Haghish provides example .do files for getting started with markdoc. I recommend working through each of them, but it shouldn’t be too difficult if you’re used to commenting in your .do files or writing in markdown. The .docx and .pdf output files are clean, orgranized, and formatted.
markdoc is ideal for producing high-quality documents directly from the Stata IDE. After you understand the markdoc syntax, you will be able to perform the majority of your work in the .do file. The only downside I encountered in markdoc was a somewhat buggy installation–it worked better for me on the Mac. But the package is incredibly well maintained by the author, and I was able to find answers to my questions on his Github page eventually.
Germán Rodríguez at Princeton created weave, and I consider it a markdoc-light Stata package.
Installation is easy. Just type the following command into the Stata IDE.
net from http://data.princeton.edu/wws509/stata
And there’s an example .do file on his website.
weave essentially uses markdown/html tags for inline images and headers that are written directly into your .do files. The results are inserted into the output as plain text, so there is no need to tweak their formatting. When you’re finished with your analyses, you just type the following commands directly into the IDE.
weave using sample
The beginning of the .do file contains a command for logging using everything as a .usl file. The .usl file is then ‘weaved’ to create a .html output which will automatically open in your web browser.
The Output Files
You can just print the .html file to a .pdf like you would any web page. Chrome seems to create the best-looking .pdfs.
*TIP: use minimal lines on your .do file to create cleaner looking output. I’ve created a detailed example here.
I use weave whenever I’m using Stata on my Mac. It’s easy to use, quick to format, and only requires me to have Stata open with a .do file. I’ll use markdoc if I am creating a more professional-looking report, but the bugginess of markdoc doesn’t make it very user-friendly
The Jupyter Notebooks (previously IPython Notebooks) can be configured to work with Stata commands. Unfortunately, the package works best with Windows/PC. The setup isn’t too complicated but has a few steps that can trip you up.
Download Anaconda from Continuum
You can download the most recent version of Anaconda from Continuum . This will include the following applications
The only application I will be covering in this post is the Anaconda Prompt.
Changing the Jupyter Notebook working directory
The first thing you will want to do is set up your Jupyter Notebook in an appropriate working directory. You can do this by right-clicking on the Anaconda Prompt and run it as an administrator (I’ve moved the application to the taskbar).
When the prompt is displayed (it should say Administrator)
copy+paste the file directory to the folder you want the Jupyter Notebook to open in. In the Anaconda Prompt, type
cd C:\Users\Google Drive\...\ipystata\notebooks
This will change your working directory. After you’re in the correct working directory, start up the Jupyter Notebooks by typing the following command in your Anaconda Prompt,
This should open a new tab in your default web browser.
You can open a new notebook using the tab on the far right of the screen by selecting, “New” >> “Python [default]”
You will need to open a Command Prompt window as an administrator by right-clicking on the application and selecting, “Run as administrator” (*you can search for this application in the windows search bar by typing “cmd”).
from here you need to navigate to your Stata application in your Program Files (usually in the C:\ drive)
copy+paste the file destination and enter it into the Command Prompt window preceded by
cd C:\Program Files (x86)\Stata14
from this location, register the Stata application by typing the name of the .exe file followed by a space and /Register
*No news is good news on this command.
Installing ipystata and pandas
Now go back to your Jupyter Notebook and install pandas. Pandas is an open-source data analysis package for python. Read about it here.
In the first line of your notebook type:
import pandas as pd
To install ipystata, you’ll need to open a Windows PowerShell window (as an administrator) and enter the following command:
pip install ipystata
After the package has been installed, enter the following command in your Jupyter Notebook:
To test if it worked, type a simple
display command preceded by the
%%stata . The output should look like this:
Using the %%stata Magic Commands
Now that you are up and running, the Jupyter Notebook basically replaces your .do file. You will just need to precede the Stata commands with a line containing the
Start by loading a native dataset
%%stata sysuse auto, clear
You can get a quick overview of these data by using
codebook, compact or
Including Graphs in the Output
To include graphs in your output, simply include the
-gr command on the same line as your
Sharing Your Output Online
In my opinion, the best part of using Jupyter Notebooks is the ability to share your work online. You can publish your notebook using the cloud+arrow icon on the toolbar (register your account first).
In fact, this notebook and a complete example of the ipystata package is available online. I think this feature makes the Jupyter Notebook the best option for literate programming and reproducible research in Stata. The complicated setup is definitely worth the time investment because you’ll be able to have an ongoing stream of commands, formatted text summaries, and output all in one place.