-
November 14, 2024
Microsoft Word and Excel have no place in a reproducible research workflow... right?
Everyone knows that when you want to get serious about reproducibility you need to stop using Microsoft Word and Excel and become a computer hacker, right? There is some truth to that, that working with simpler, open source, less interactive tools is typically better for producing permanent artifacts and enabling others to reproduce them, but it’s not mandatory.
On the other hand, it’s starting to become more and more common, and will hopefully someday be mandatory to share all code and data when submitting a manuscript to a journal, so that others can reproduce the results. This is a good thing for science overall, but also good for individual researchers, even though it may seem like more work.
Besides the fact that you’ll probably get more citations, which should not necessarily be a goal in and of itself given recent controversies around citation coercion, working reproducibly will keep you more organized and focused, and will allow you to produce higher quality work more quickly. I would also be willing to bet that if reviewers can reproduce your work, your paper will get through the review process faster, shortening the time-to-impact.
Here I want to show that it’s possible to get started working reproducibly without becoming a de facto software engineer, that it’s okay to use whatever tools you prefer so long as you follow the relevant principles. Inspired by the article Ten Simple Rules for Computational Research, we’re going focus on just two:
- Keep all files in version control. This means a real version control system. Adding your initials and a number to the filename is kind of like a version control system, but is messy and error-prone. It should be easy to tell if a file has been modified since the last time is was saved. When you make a change you should have to describe that change, and that record should exist in the log forever. When all files are in a true version-controlled repository, it’s like using “track changes” for an entire folder, and it doesn’t require any discipline to avoid corrupting the history, e.g., by changing a version after it has had its filename suffix added.
- Generate permanent artifacts with a pipeline. Instead of a bunch of manual steps, pointing and clicking, we should be able to repeatedly perform the same single action over and over and get the same output. A good pipeline will allow us to know if our output artifacts, e.g., figures, derived datasets, papers, have become out-of-date and no longer reflect their input data or processing procedures, after which we can run the pipeline and get them up-to-date. It also means we only need to focus on building that pipeline and running it. We don’t need to memorize what steps to perform in what order—just run the pipeline.
We’re going to create our workflow with Calkit, so if you want to follow along, make sure it’s installed per these instructions (you may want to add
--upgrade
to thepip install
command if you have an older version installed; v0.3.3 was used here.) We’ll also need to ensure we have generated and stored a token in our local config.In order to follow rule number 1, we are going to treat our project’s repository, or “repo,” as the one place to store everything. Any file that has anything to do with our work on this project goes in the repo. This will save us time later because there will be no question about where to look for stuff, because the answer is: in the repo.
This repo will use Git for text-based files and DVC for binary files, e.g., our Excel spreadsheets and Word documents. Don’t worry though, we’re not actually going to interact with Git and DVC directly. I know this is a major sticking point for some people, and I totally sympathize. Learning Git is a daunting task. However, here all the Git/DVC stuff will be done for us behind the scenes.
We can start off by creating a Git (and GitHub) repo for our project up on calkit.io.
Next, we’ll do the only command line thing in this whole process and spin up a local Calkit server. This will allow us connect to the web app and enable us to modify the project on our local machine. To start the server, open up a terminal or Miniforge command prompt and run:
calkit local-server
If we navigate to our project page on calkit.io, then go to the local machine page, we see that the repo has never been cloned to our computer, so let’s click the “Clone” button.
By default, it will be cloned somewhere like
C:/Users/your-name/calkit/the-project-name
, which you can see in the status. We can also see that our repo is “clean,” i.e., there are no untracked or modified files in there, and that our local copy is synced with both the Git and DVC remotes, meaning everything is backed up and we have the latest version. We’ll strive to keep it that way.Now that we have our repository cloned locally let’s create our dataset. We are going to do this by adding some rows to an Excel spreadsheet and saving it in our project repo as
data.xlsx
.Back on the Calkit local machine page, if we refresh the status we see that the
data.xlsx
spreadsheet is showing up as an untracked file in the project repo. Let’s add it to the repo by clicking the “Add” button.After adding and committing, Calkit is going to automatically push to the remotes so everything stays backed up, and again we’ll see that our repo is clean and in sync.
Now let’s use Excel to create a figure. If we go in and create a chart inside and save the spreadsheet, we see up on the local machine page that we have a changed file. Let’s commit that change by clicking the “Commit” button and let’s use a commit message like “Add chart to spreadsheet”.
At this point our data is in version control so we’ll know if it ever changes. Now it’s time for rule number 2: Generate important artifacts with a pipeline. At the moment our pipeline is empty, so let’s create a stage that extracts our chart from Excel into an image and denotes it as a figure in the project. On the web interface we see there’s a button to create a new stage, and in there are some optional stage templates. If we select “Figure from Excel”, there will be a few fields to fill out:
- The name of the stage. We’ll use
extract-chart
, but you can call it whatever you like. - The Excel file path relative to the repo (
data.xlsx
). - The desired output file path for our image. We’ll use
figures/chart.png
, but again, you can choose whatever makes sense to you. - The title and description of our figure.
After saving the stage and refreshing the status we’ll see that the pipeline is out-of-date, which makes sense. We added a stage but haven’t yet run the pipeline. So let’s do that by clicking the “Run” button.
After the pipeline has been run we can see there are some uncommitted changes in the repo, so let’s commit them with a message that makes sense, e.g., “Extract figure from data.xlsx”. We should again be in our happy state, with a clean repo synced with the cloud, and a pipeline that’s up-to-date.
To wrap things up, we’re going to use this figure in a paper, written using Microsoft Word. So, find a journal with a Microsoft Word (
.docx
) submission template, download that, and save it in the repo. In this case, I saved the IOP Journal of Physics template generically aspaper.docx
, since in the context of this project, it doesn’t need a special name, unless of coursepaper.docx
would somehow be ambiguous. We can then follow the same process we followed withdata.xlsx
to add and commit the untrackedpaper.docx
file to the repo.Now let’s open up the Word document and insert our PNG image exported from the pipeline. Be sure to use the “link to file” or “insert and link” option, so Word doesn’t duplicate the image data inside the document. This will allow us to update the image externally and not need to reimport into Word.
Again when we refresh we’ll see that
paper.docx
has uncommitted changes, so let’s commit them with a message like “Add figure to paper”.Now let’s complete our pipeline by adding a stage to convert our Word document to PDF, so that can be the main artifact we share with the outside world. There’s a stage template for that on the website, so follow the stage generation steps we used to extract the figure, but this time select the “Word document to PDF” template, filling out the Word document file path, the output PDF path, add
figures/chart.png
to the list of input dependencies, and select “publication” as our artifact type. Fill in the title and description of the publication as well. Addingfigures/chart.png
to the input dependencies will cause our PDF generation stage to be rerun if that orpaper.docx
changes. Otherwise, it will not run, so running the pipeline can be as fast as possible.Again the pipeline will show that it’s out-of-date, so let’s run and commit again, using a message like “Export paper to PDF”. If we open up
paper.pdf
we can see that our figure is there just like we expected.But hold on a second you might say. Why did we go through all this trouble just to create a PDF with an Excel chart in it? This would have been only a few steps to do manually! That would be a valid point if this were a one-off project and nothing about it would ever change. However, for a research project, there will almost certainly be multiple iterations (see again the PhD Comics cartoon above,) and if we need to do each manual step each iteration, it’s going to get costly time-wise, and we could potentially forget which steps need to be taken based on what files were changed. We may end up submitting our paper with a chart that doesn’t reflect the most up-to-date data, which would mean the chart in the paper could not be reproduced by a reader. Imagine if you had multiple datasets, multiple steps of complex data analysis, a dozen figures, and some slides to go along with your paper. Keeping track of all that would consume valuable mental energy that could be better spent on interpreting and communicating the results!
To close the loop and show the value of using version control and a pipeline, let’s go and add a few rows to our dataset, which will in turn change our chart in Excel. If we save the file and look at the status, we can see that this file is different, and that our pipeline is again out-of-date, meaning that our primary output (the PDF of the paper) not longer reflects our input data.
Now with one click we can rerun the pipeline, which is going to update both our figure PNG file and the paper PDF in one shot. We can then create a commit message explaining that we added to the dataset. These messages can be audited later to see when and why something changed, which can come in handy if all of a sudden things aren’t looking right. Having the files in version control also means we can go check out an old version if we made a mistake.
Well, we did it. We created a reproducible workflow using Microsoft Word and Excel, and we didn’t need to learn Git or DVC or become a command line wizard. Now we can iterate on our data, analysis, figures, and writing, and all we need to do to get them all up-to-date and backed up is to run the pipeline and commit any changes. Now we can share our project and others can reproduce the outputs (so long as they have a Microsoft Office license, but that’s a topic for another post.) Everything is still in Git and DVC, so our more command line-oriented colleagues can work in that way if they like using the same project repo. To achieve this, all we had to do was follow the two most important rules:
- All files go in version control.
- Artifacts need to be generated by the pipeline.
If you’d like you can browse through this project up on calkit.io. Also, feel free to shoot me an email if you’d like help setting up something similar for your project.
-
October 23, 2024
Reproducible OpenFOAM simulations with Calkit
Have you ever been here before? You’ve done a bunch of work to get a simulation to run, created some figures, and submitted a paper to a journal. A month or two later you get the reviews back and you’re asked by the dreaded Reviewer 2 to make some minor modifications to a figure. There’s one small problem, however: You don’t remember how that figure was created, or you’ve upgraded your laptop and now the script won’t run. Maybe you were able to clone the correct Git repo with the code, but you don’t remember where the data is supposed to be stored. In other words, your project is not reproducible.
Here we are going to show how to make a research project reproducible with Calkit, a tool I’ve been working on that ties together and simplifies a few lower-level tools to help with reproducibility:
- Git
- GitHub
- DVC (Data Version Control)
- Docker
- Cloud storage
We’re going to use OpenFOAM and Python to generate our outputs since these both can be pretty tricky to get setup correctly, never mind setup correctly twice!
Getting setup
You will need to have Git installed, and you’ll need to have a GitHub account. You’ll also need to have Python (Mambaforge recommended) and Docker installed. After that, install the Calkit Python package with:
pip install calkit-python
Creating and cloning the project repo
Head over to calkit.io and log in with your GitHub account. Click the button to create a new project. Let’s title ours “RANS boundary layer validation”, since for our example, we’re going to create a project that attempts to answer the question:
What RANS model works best for a simple boundary layer?
We’ll keep this private for now (though in general it’s good to work openly.) Note that creating a project on Calkit also creates the project Git repo on GitHub.
We’re going to need a token to interact with the Calkit API, so head over to your user settings, generate one for use with the API, and copy it to your clipboard.
Then we can set that token in our Calkit configuration with:
calkit config set token YOUR_TOKEN_HERE
Next, clone the repo to your local machine with (filling in your username):
calkit clone https://github.com/YOUR_USERNAME/rans-boundary-layer-validation.git
Note you can modify the URL above to use SSH if that’s how you interact with GitHub.
calkit clone
is a simple wrapper aroundgit clone
that sets up the necessary configuration to use the Calkit Cloud as a DVC remote, the place where we’re going to push our data, while our code goes to GitHub.Getting some validation data
We want to validate these RANS models, so we’ll need some data for comparison. It just so happens that there is already a boundary layer direct numerical simulation (DNS) dataset on Calkit downloaded from the Johns Hopkins Turbulence Databases (JHTDB), so we can simply import that with (after
cd
ing into the project directory):calkit import dataset \ petebachant/boundary-layer-turbulence-modeling/data/jhtdb-transitional-bl/time-ave-profiles.h5 \ data/jhtdb-profiles.h5
If we run
calkit status
we can see that there is one commit that has been made but not pushed toorigin/main
(on GitHub), so we can make sure everything is backed up there withcalkit push
.calkit status
andcalkit push
behave very similarly togit status
andgit push
. In fact, those commands are run alongside some additional DVC commands, the importance of which we will see shortly.We can now see our imported dataset as part of the project datasets on the Calkit website. We can also see the file is present, but ignored by Git, since it’s managed by DVC. Because the dataset was imported, it does not take up any of this project’s storage space, but will be present when the repo is cloned.
Creating a reproducible OpenFOAM environment with Docker
If you’ve never worked with Docker, it can sound a bit daunting, but Calkit has some tooling to make it a bit easier. Basically, Docker is going to let us create isolated reproducible environments in which to run commands and will keep track of which environments belong to this project in the
calkit.yaml
file.Let’s create an OpenFOAM-based Docker environment and build stage with:
calkit new docker-env \ --name foam \ --image openfoam-2406-foampy \ --stage build-docker \ --from microfluidica/openfoam:2406 \ --add-layer mambaforge \ --add-layer foampy \ --description "OpenFOAM v2406 with foamPy."
This command will create the necessary Dockerfile, the environment in our project metadata, and will add a stage to our DVC pipeline to build the image before running any other commands.
If we run
calkit status
again, we see that there’s another commit to be pushed to GitHub, and our pipeline is showing some changed dependencies and outputs.This is a signal that it is out-of-date, and should be executed with
calkit run
. When we do that we see some output from Docker as an image is built, andcalkit status
then shows we have a newdvc.lock
file staged and an untrackedDockerfile-lock.json
file.We’ll need to add that
Dockerfile-lock.json
file to our repo and then commit and push to the cloud so it is backed up and accessible to our collaborators. We can do this with thecalkit save
command:calkit save dvc.lock Dockerfile-lock.json -m "Run pipeline to build Docker image"
which does the
add
,commit
,push
steps all in one, deciding which files to store in DVC versus Git, and pushing to the respective locations to save time and cognitive overhead. However, if desired, you can of course run those individually for full control.Finally, let’s check that we can run something in the environment, e.g., print the help output of
blockMesh
:calkit runenv -- blockMesh -help
Now we’re good to go. We didn’t need to install OpenFOAM, and neither will our collaborators. We’re now ready to start setting up the cases.
Adding the simulation runs to the pipeline
We can run things interactively and make sure things work, but it’s not a good idea to rely on interactive or ad hoc processes to produce a permanent artifact. Any time you get to a point where you do want to save something permanent, the pipeline should be updated to produce that artifact, so let’s add some simulation runs to ours.
We want to run the simulation to validate a few different turbulence models:
- Laminar (no turbulence model)
- $k$–$\epsilon$
- $k$–$\omega$
I’ve setup this project to use foamPy to create and run variants of a case with a “templatized”
turbulenceProperties
file via a scriptrun.py
, which we’re going to run in our Docker environment.To simulate the same case for multiple turbulence models, we’re going to create a “foreach” DVC stage to run our script over a sequence of values. When setup properly, DVC will be smart enough to cache the results and not rerun simulations when they don’t need to be rerun. We can create this stage with:
calkit new foreach-stage \ --cmd "calkit runenv python run.py --turbulence-model {var} -f" \ --name run-sim \ --dep system \ --dep constant/transportProperties \ --dep run.py \ --dep Dockerfile-lock.json \ --out "cases/{var}/postProcessing" \ "laminar" "k-epsilon" "k-omega"
Another call to
calkit status
shows our pipeline needs to be run, which makes sense, so let’s give it anothercalkit run
. You’ll note at the very start our Docker image build stage is not rerun thanks to DVC tracking and caching the inputs and outputs.The output of
calkit status
now shows something we haven’t seen yet: we have some new data produced as part of those simulation runs. We can do anothercalkit save
to ensure everything is committed and pushed:calkit save -am "Run simulations"
In this case, the outputs of the simulations were pushed up to the DVC remote in the Calkit Cloud.
We are defining an output for each simulation as the
postProcessing
folder, which we will cache and push to the cloud for backup, so others (including our future self), can pull down the results and work with them without needing to rerun the simulations. We are also defining dependencies for the simulations. What this means is that if anything in thesystem
folder, therun.py
script,constant/transportProperties
, or our Dockerfile changes, DVC will know it needs to rerun the simulations. Conversely, if those haven’t changed and we already have results cached, there’s no need to rerun. This is nice because to produce our outputs we basically only need to remember one command and keep running it, and that’s simplycalkit run
, which is a wrapper arounddvc repro
that parses some additional metadata to define certain special objects, e.g., datasets or figures.Creating a figure to visualize our results
We want to compare the OpenFOAM results to the DNS data, for which we can plot the mean streamwise velocity profiles, for example. We can create a new figure with its own pipeline stage with:
calkit new figure \ figures/mean-velocity-profiles.png \ --title "Mean velocity profiles" \ --description "Mean velocity profiles from DNS and RANS models." \ --stage plot-mean-velocity-profiles \ --cmd "calkit runenv python scripts/plot-mean-velocity-profiles.py" \ --dep scripts/plot-mean-velocity-profiles.py \ --dep data/jhtdb-profiles.h5 \ --deps-from-stage-outs run-sim
The last line there is going to automatically create dependencies based on the outputs of our
run-sim
stage, saving us the trouble of typing out all of those directories manually.Another call to
calkit status
shows we need to run the pipeline again, and another call tocalkit run
thencalkit save -m "Run pipeline to create figure"
will create this figure and push it to the repo. This figure is now viewable as its own object up on the website:Solving the problem
Now let’s show the value of making our project reproducible, addressing the problem we laid out in the introduction, assuming Reviewer 2’s request was something like:
The legend labels should be updated to use mathematical symbols.
We’re going to pretend we were forced to start from scratch, so let’s delete the local copy of our project repo and clone it again using same
calkit clone
command we ran above.After moving into the freshly cloned repo, you’ll notice our imported dataset, the
cases/*/postProcessing
directories, and our figure PNG file were all downloaded, which would not have happened with a simplegit clone
, since those files are kept out of Git.Running
calkit status
andcalkit run
again shows that what we’ve cloned is fully up-to-date.Next we edit our plotting script to make the relevant changes. Then we execute
calkit run
. Again, notice how the simulations were not rerun thanks to the DVC cache.If we run
calkit status
we see there are some differences, so we runcalkit save -am "Change x-axis label for reviewer 2"
to get those saved and backed up. If we go visit the project on the Calkit website, we see our figure is up-to-date and has the requested changes in the legend. Reviewer 2 will be so happy 😀Conclusions and next steps
We created a project that runs OpenFOAM simulations reproducibly, produces a figure comparing against an imported dataset, and ensures these are kept in version control and backed up to the cloud. This is of course a simplified example for demonstration, but you could imagine expanding the pipeline to include more operations, such as:
- Running a mesh independence study.
- Building a LaTeX document as a report or paper.
Since DVC pipelines can run any arbitrary command, you’re not locked into a specific language or framework. You could run shell scripts, MATLAB programs, etc., all in a single pipeline.
You can view this project up on Calkit and GitHub if you’d like to get started doing something similar. Also, please comment below if you’re using a different tool or framework to make your own work reproducible.