-
December 06, 2024
I failed to reproduce my own results from a decade ago
My former PhD advisor emailed me recently regarding a grad student interested in using some experimental data we had collected a decade ago to validate some simulations. The repo from the experiment was still up on GitHub, so I assumed that would be easy, but I was wrong.
I had left instructions in the README for getting started and regenerating the figures:
So I gave it a try. These days I have Mambaforge installed on my machine instead of Anaconda, but the ecosystems are largely compatible with each other, so I figured there was a decent chance this would work. The
pip install
commands worked fine, so I tried runningpython plot.py
, but that actually doesn’t do anything without some additional options (bad documentation, me-from-the-past!) After running the correct command, this was the result:As we can see, this failed because the interface to the
scipy.stats.t.interval
function had changed since the version I was using back then. This isn’t necessarily surprising after 10+ years, but it puts us at a crossroads. We would either need to adapt the code for newer dependencies or attempt to reproduce the environment in which it ran before.But let’s take a step back and ask why we would want to do this at all. I can think of two reasons:
- Reproducing the results can help ensure there are no mistakes,
or that the outputs (figures) genuinely reflect the inputs (data)
and processes (code). There is a chance that I updated the code at some
point but never reran
plot.py
, making the figures out-of-date. - We want to produce a slight variant of one or more figures, adding the results from a simulation for the purposes of validation.
These can be categorized as reproducibility and reusability, respectively, and the grad student’s request was clearly concerned with the latter. However, I wanted to explore both.
Attempting to reproduce the old environment
Before we start, let’s pick a figure from the original paper to focus on reproducing. For this I chose one that’s relatively complex. It shows out-of-plane mean velocity in a turbine wake as contours and in-plane mean velocity as vector arrows, and includes an outline of the turbine’s projected area. It uses LaTeX for the axis labels, and is set to match the true aspect ratio of the measurement plane. Here’s what it looked like published:
I left myself some incomplete instructions in that README for reproducing the old environment: Install a version of Anaconda that uses Python 3.5 and
pip install
two additional packages. Again, bad documentation, me-from-the-past! Given that I already haveconda
installed, I figured I could generate a new environment with Python 3.5 and take it from there.No luck, however. Python 3.5 is not available in the
conda-forge
ormain
channels any longer, at least not for my MacBook’s ARM processor.The next logical option was Docker. There are some old Anaconda Docker images up on Docker Hub, so I thought maybe I could use one of those. The versions don’t correspond to Python versions, however, so I had to search though the Anaconda release notes to find which version I wanted. Anaconda 2.4.0, released on November 2, 2015, was the first version to come with Python 3.5, and that version is available on Docker Hub, so I tried to spin up an interactive container:
To my surprise, this failed. Apparently there was a change in Docker image format at some point and these types are no longer supported. So I searched for the newest Anaconda version with Python 3.5, which was Anaconda 4.2.0, and luckily the image was in the correct format.
At this point I needed to create a new image derived from that one that installed the additional dependencies with
pip
. So I created a new Docker environment for the project and added a build stage to a fresh DVC pipeline with Calkit (a tool I’ve been working on inspired by situations like these):calkit new docker-env \ --name main \ --image rvat-re-dep \ --from continuumio/anaconda3:4.2.0 \ --description "A custom Python 3.5 environment" \ --stage build-docker
In this case, the automatically-generated
Dockerfile
didn’t yet have everything we needed, but it was a start.Simply adding the
pip install
instructions from the README resulted in SSL verification and dependency resolution errors. After some Googling and trial-and-error, I was able to get things installed in the image by adding this command:RUN pip install \ --trusted-host pypi.org \ --trusted-host pypi.python.org \ --trusted-host files.pythonhosted.org \ --no-cache-dir \ progressbar33 \ seaborn==0.7.0 \ pxl==0.0.9 \ --no-deps
Note that I had to pin the
seaborn
version sincepxl
would install a newer version, which would install a newer version ofpandas
, which would fail, hence the--no-deps
option.I also had to set the default Matplotlib backend with:
env MPLBACKEND=Agg
since PyQt4 was apparently missing and Matplotlib was trying to import it by default.
After running the pipeline to build the Docker image, I ran the plotting script in that environment with:
calkit runenv -n main -- python plot.py all_meancontquiv --save
Let’s take a look at the newly-created figure and compare with the original published version:
If you look closely you’ll notice the font for the tick labels is slightly different from that in the original. We could go further and try to build the necessary font into the Docker image, but I’m going to call this a win for now. It took some hunting and finagling, but we reproduced the figure.
But what about reusability?
Looking back at project repo’s README we can see I said absolutely nothing about how one could reuse the materials in a different project. To be fair, at the time the main purpose of open sourcing these materials was to open source the materials. Even that is still somewhat uncommon for research projects, and I do think it’s a good goal in and of itself. However, if we want to ensure our work produces the largest possible impact, we should do a little “product management” and spend some time thinking about how others can derive value from any of the materials, not just the paper we wrote.
I actually used this dataset in a later paper validating some CFD simulations, the repo for which is also on GitHub. Looking in there, the value that this project’s repo provided was:
- CSV files containing aggregated or reduced data.
- A Python package that made it possible to:
- recreate the Matplotlib figures so we didn’t need to copy/paste them,
- instantiate a
WakeMap
Python class that computed various gradients to aid in computing wake recovery quantities to compare against simulation, - inspect the code that generated complex figures so it could be adapted for plotting the CFD results.
Items 2.1 and 2.2 were actually not that easy to do, since the Python package was not installable. In the follow-on project I had to add the folder to
sys.path
to import the package, and since it used relative paths, I had to make the new code change directories in order to load the correct data. These are both not too difficult to fix though.First, we can make the Python package installable by adding a
pyproject.toml
file. Then we can switch to using absolute paths so the data loading and plotting functions can be called from outside.Updating the code for the newer dependencies was not too difficult based on the tracebacks. After getting things running in a more modern Python 3.12 environment, I exported a “lock” file describing all versions used the last time it successfully ran. This is much more descriptive than “Anaconda with Python 3.5.”
Finally, I wanted to add some documentation explaining how to reuse this code and data. I ended up adding two sections to the README: one for reproducing the results and one for reusing the results. I also created an example project reusing this dataset by including it as a Git submodule, which you can also view up on GitHub. Doing this is a good way to put yourself in the users’ shoes and can reveal stumbling blocks. It also gives users a template to copy if they would find that helpful.
It’s important to note here that it’s impossible to predict how others might derive value from these materials, and that’s okay. Take some educated guesses, put it out there, and see what happens. Maybe you’ll want to iterate later, like I’ve done here. That’s much better than not sharing at all.
Conclusions
There are a few takeaways from this exercise. First off, reproducibility is hard, even with all of the code and data available. Software and hardware continue to evolve, and just because the code “runs on my machine” today, doesn’t mean it will a few years (or decades) down the road. On the other hand, maybe reproducibility should have a practical expiration date anyway, since it’s mostly useful around the time of publication to help avoid mistakes and potentially speed up peer review.
Another important but obvious point is that documentation is crucial. Simply providing the code and data without documentation is better than nothing, and many papers don’t even go that far, but we really should go further. Every project should fully describe the steps to reproduce the outputs, and there should be as few steps as possible. This can be beneficial while the project is in progress as well, especially if we have collaborators.
Lastly, reproducibility is not the same thing as reusability. Researchers should do some product management and attempt to maximize the value they can deliver. The “product” of a given research project could be a simple formula for hand calculations, but these days the valuable products will likely include datasets and software.
Publishing a paper with an “in-house” code may be good for your career (for now anyway,) but if your discoveries are useless without a computer program to calculate predictions, the effort others will need to expend to get value from your work will be unnecessarily high, and therefore some potential impact will be unrealized. “It’s not documented well enough” is not a valid excuse either. Like with reproducibility, even if we haven’t molded our research products into the most user-friendly form possible, we should still share all of the relevant materials, so long as it’s not harmful to someone else to do so.
- Reproducing the results can help ensure there are no mistakes,
or that the outputs (figures) genuinely reflect the inputs (data)
and processes (code). There is a chance that I updated the code at some
point but never reran
-
November 14, 2024
Microsoft Word and Excel have no place in a reproducible research workflow... right?
Everyone knows that when you want to get serious about reproducibility you need to stop using Microsoft Word and Excel and become a computer hacker, right? There is some truth to that, that working with simpler, open source, less interactive tools is typically better for producing permanent artifacts and enabling others to reproduce them, but it’s not mandatory.
On the other hand, it’s starting to become more and more common, and will hopefully someday be mandatory to share all code and data when submitting a manuscript to a journal, so that others can reproduce the results. This is a good thing for science overall, but also good for individual researchers, even though it may seem like more work.
Besides the fact that you’ll probably get more citations, which should not necessarily be a goal in and of itself given recent controversies around citation coercion, working reproducibly will keep you more organized and focused, and will allow you to produce higher quality work more quickly. I would also be willing to bet that if reviewers can reproduce your work, your paper will get through the review process faster, shortening the time-to-impact.
Here I want to show that it’s possible to get started working reproducibly without becoming a de facto software engineer, that it’s okay to use whatever tools you prefer so long as you follow the relevant principles. Inspired by the article Ten Simple Rules for Computational Research, we’re going focus on just two:
- Keep all files in version control. This means a real version control system. Adding your initials and a number to the filename is kind of like a version control system, but is messy and error-prone. It should be easy to tell if a file has been modified since the last time is was saved. When you make a change you should have to describe that change, and that record should exist in the log forever. When all files are in a true version-controlled repository, it’s like using “track changes” for an entire folder, and it doesn’t require any discipline to avoid corrupting the history, e.g., by changing a version after it has had its filename suffix added.
- Generate permanent artifacts with a pipeline. Instead of a bunch of manual steps, pointing and clicking, we should be able to repeatedly perform the same single action over and over and get the same output. A good pipeline will allow us to know if our output artifacts, e.g., figures, derived datasets, papers, have become out-of-date and no longer reflect their input data or processing procedures, after which we can run the pipeline and get them up-to-date. It also means we only need to focus on building that pipeline and running it. We don’t need to memorize what steps to perform in what order—just run the pipeline.
We’re going to create our workflow with Calkit, so if you want to follow along, make sure it’s installed per these instructions (you may want to add
--upgrade
to thepip install
command if you have an older version installed; v0.3.3 was used here.) We’ll also need to ensure we have generated and stored a token in our local config.In order to follow rule number 1, we are going to treat our project’s repository, or “repo,” as the one place to store everything. Any file that has anything to do with our work on this project goes in the repo. This will save us time later because there will be no question about where to look for stuff, because the answer is: in the repo.
This repo will use Git for text-based files and DVC for binary files, e.g., our Excel spreadsheets and Word documents. Don’t worry though, we’re not actually going to interact with Git and DVC directly. I know this is a major sticking point for some people, and I totally sympathize. Learning Git is a daunting task. However, here all the Git/DVC stuff will be done for us behind the scenes.
We can start off by creating a Git (and GitHub) repo for our project up on calkit.io.
Next, we’ll do the only command line thing in this whole process and spin up a local Calkit server. This will allow us connect to the web app and enable us to modify the project on our local machine. To start the server, open up a terminal or Miniforge command prompt and run:
calkit local-server
If we navigate to our project page on calkit.io, then go to the local machine page, we see that the repo has never been cloned to our computer, so let’s click the “Clone” button.
By default, it will be cloned somewhere like
C:/Users/your-name/calkit/the-project-name
, which you can see in the status. We can also see that our repo is “clean,” i.e., there are no untracked or modified files in there, and that our local copy is synced with both the Git and DVC remotes, meaning everything is backed up and we have the latest version. We’ll strive to keep it that way.Now that we have our repository cloned locally let’s create our dataset. We are going to do this by adding some rows to an Excel spreadsheet and saving it in our project repo as
data.xlsx
.Back on the Calkit local machine page, if we refresh the status we see that the
data.xlsx
spreadsheet is showing up as an untracked file in the project repo. Let’s add it to the repo by clicking the “Add” button.After adding and committing, Calkit is going to automatically push to the remotes so everything stays backed up, and again we’ll see that our repo is clean and in sync.
Now let’s use Excel to create a figure. If we go in and create a chart inside and save the spreadsheet, we see up on the local machine page that we have a changed file. Let’s commit that change by clicking the “Commit” button and let’s use a commit message like “Add chart to spreadsheet”.
At this point our data is in version control so we’ll know if it ever changes. Now it’s time for rule number 2: Generate important artifacts with a pipeline. At the moment our pipeline is empty, so let’s create a stage that extracts our chart from Excel into an image and denotes it as a figure in the project. On the web interface we see there’s a button to create a new stage, and in there are some optional stage templates. If we select “Figure from Excel”, there will be a few fields to fill out:
- The name of the stage. We’ll use
extract-chart
, but you can call it whatever you like. - The Excel file path relative to the repo (
data.xlsx
). - The desired output file path for our image. We’ll use
figures/chart.png
, but again, you can choose whatever makes sense to you. - The title and description of our figure.
After saving the stage and refreshing the status we’ll see that the pipeline is out-of-date, which makes sense. We added a stage but haven’t yet run the pipeline. So let’s do that by clicking the “Run” button.
After the pipeline has been run we can see there are some uncommitted changes in the repo, so let’s commit them with a message that makes sense, e.g., “Extract figure from data.xlsx”. We should again be in our happy state, with a clean repo synced with the cloud, and a pipeline that’s up-to-date.
To wrap things up, we’re going to use this figure in a paper, written using Microsoft Word. So, find a journal with a Microsoft Word (
.docx
) submission template, download that, and save it in the repo. In this case, I saved the IOP Journal of Physics template generically aspaper.docx
, since in the context of this project, it doesn’t need a special name, unless of coursepaper.docx
would somehow be ambiguous. We can then follow the same process we followed withdata.xlsx
to add and commit the untrackedpaper.docx
file to the repo.Now let’s open up the Word document and insert our PNG image exported from the pipeline. Be sure to use the “link to file” or “insert and link” option, so Word doesn’t duplicate the image data inside the document. This will allow us to update the image externally and not need to reimport into Word.
Again when we refresh we’ll see that
paper.docx
has uncommitted changes, so let’s commit them with a message like “Add figure to paper”.Now let’s complete our pipeline by adding a stage to convert our Word document to PDF, so that can be the main artifact we share with the outside world. There’s a stage template for that on the website, so follow the stage generation steps we used to extract the figure, but this time select the “Word document to PDF” template, filling out the Word document file path, the output PDF path, add
figures/chart.png
to the list of input dependencies, and select “publication” as our artifact type. Fill in the title and description of the publication as well. Addingfigures/chart.png
to the input dependencies will cause our PDF generation stage to be rerun if that orpaper.docx
changes. Otherwise, it will not run, so running the pipeline can be as fast as possible.Again the pipeline will show that it’s out-of-date, so let’s run and commit again, using a message like “Export paper to PDF”. If we open up
paper.pdf
we can see that our figure is there just like we expected.But hold on a second you might say. Why did we go through all this trouble just to create a PDF with an Excel chart in it? This would have been only a few steps to do manually! That would be a valid point if this were a one-off project and nothing about it would ever change. However, for a research project, there will almost certainly be multiple iterations (see again the PhD Comics cartoon above,) and if we need to do each manual step each iteration, it’s going to get costly time-wise, and we could potentially forget which steps need to be taken based on what files were changed. We may end up submitting our paper with a chart that doesn’t reflect the most up-to-date data, which would mean the chart in the paper could not be reproduced by a reader. Imagine if you had multiple datasets, multiple steps of complex data analysis, a dozen figures, and some slides to go along with your paper. Keeping track of all that would consume valuable mental energy that could be better spent on interpreting and communicating the results!
To close the loop and show the value of using version control and a pipeline, let’s go and add a few rows to our dataset, which will in turn change our chart in Excel. If we save the file and look at the status, we can see that this file is different, and that our pipeline is again out-of-date, meaning that our primary output (the PDF of the paper) not longer reflects our input data.
Now with one click we can rerun the pipeline, which is going to update both our figure PNG file and the paper PDF in one shot. We can then create a commit message explaining that we added to the dataset. These messages can be audited later to see when and why something changed, which can come in handy if all of a sudden things aren’t looking right. Having the files in version control also means we can go check out an old version if we made a mistake.
Well, we did it. We created a reproducible workflow using Microsoft Word and Excel, and we didn’t need to learn Git or DVC or become a command line wizard. Now we can iterate on our data, analysis, figures, and writing, and all we need to do to get them all up-to-date and backed up is to run the pipeline and commit any changes. Now we can share our project and others can reproduce the outputs (so long as they have a Microsoft Office license, but that’s a topic for another post.) Everything is still in Git and DVC, so our more command line-oriented colleagues can work in that way if they like using the same project repo. To achieve this, all we had to do was follow the two most important rules:
- All files go in version control.
- Artifacts need to be generated by the pipeline.
If you’d like you can browse through this project up on calkit.io. Also, feel free to shoot me an email if you’d like help setting up something similar for your project.