Keeping mathematical software working and available

Computer programs are increasingly used in mathematical research. They perform computations in algebraic geometry which would be too laborious for people to execute by hand (think Gröbner bases), search through vast discrete spaces for (counter)examples (SAT solvers) or verify properties (like the emptiness of a polyhedron using LP solvers). Sometimes computations are even the centerpiece of a paper.

As a PhD student, I developed a framework called CInet::Tools to do discrete and polyhedral computations with conditional independence structures. I have used it in half a dozen papers by now but its installation is quite complex and error-prone (as I’ve been told). It is written in Perl but ships and (statically!) compiles C and C++ libraries and external programs (SAT and LP solvers) which it uses for the heavy lifting.

This is bad because it makes it hard for other researchers to reproduce or verify my data, build on top of it or design their own computations in this area of mathematics. This negatively affects the reach of my reseach and for a long time I had the vision to make a container image for this software (see container technology on Wikipedia). I will talk about Containerfiles and podman in this article. podman is a tool which mimics the more widely known Docker but does not require a daemon and root privileges to operate, and thanks to image formats being an open standard now, Containerfile is, as far as I know, just a more neutral term for Dockerfile — the syntax is the same and Docker images can be used verbatim with podman.

The next few sections contain my reasons for why this practice should be adopted for mathematical research data. Aside from encapsulating computations, making computing environments sharable and keeping my real computer clean of experimental software, there are other benefits to containers. These relate to mathematical research data as a time-bound responsibility. I want to publish data correctly right now so that problems do not arise years later.

Availability of software

With a ready-made image, I can easily provide my software to others. No more complicated and time-consuming installation processes (after getting containers to run initially). And it’s not just for my own software, but any software bundle can be shared with precisely the same configuration and system environment among collaborators or the whole group. This minimizes incompatibilities while working on joint projects.

I also consider my future self a collaborator. The system in the container is blissfully unaware of all the updates my real computer has to go through. Years later I can still boot up an old image and the software it contains will work exactly as before.

However, the container technology on which podman is based is Linux-specific. To use it on Windows or MacOS, according to the documentation, a virtual machine is required which might deter some collaborators.

Documentation and repeatability of computations

Up until now, when I produced research data in a project which would support results and which could not be verified independently, I would always rent a server in the cloud for a day and repeat the computations there. This would allow me to pinpoint software versions, collect dependencies, configurations and installation instructions as well as the actual order of computations, so that I can clearly document the entire process. I would then publish the data and this documentation.

The Containerfile documents the installation procedure in a way which humans can read and reason about (if they are trained in shell scripting and administrating Linux systems) and which computers can repeat automatically and consistently. The combination of the two makes containers more pleasant and less error-prone than hand-written installation manuals. Although it does require a certain amount of (shell) scripting experience.

I personally enjoy the built-in checkpointing functionality of container engines: every step in the Containerfile leads to a new “layer” of the image being produced. The layers are cached and a new layer is not created unless the next step finishes. This means that if my script has been running well until step 40 out of 50 but then a typo causes it to fail, I can fix the typo and restart the procedure. It will automatically jump to step 40 to continue the image building process. There are no remains of the failed step 41 from the previous run. This way I can be sure that, even if I make mistakes and change things on the way, the resulting image will be as if everything worked right from the start and ran in one single pass.

The Containerfile itself is research data as well. Starting from a shared but rudimentary base, it can be developed until all the required computations for a project are part of the image specification. At this point it is complete and becomes an asset which should be published alongside the paper. Now everyone can repeat the computation with no more than a podman build.

Auditing of research data across time

The Everyone in the previous sentence includes myself in 2 years! What if a colleague finds a counterexample to the theorem whose proof rests on a computation? If I keep the image on which the computation was performed, the situation can be confidently investigated. This in particular gives me incredible peace of mind!

Suppose you do a quick computation one morning and it leads to a conjecture. You investigate it, find that it’s wrong in a way which is surprising given your initial computation. You redo the computation (more likely, some variant of the original computation) 2 weeks later and you find that the original result you based the conjecture on was wrong. But where is the error? Did you make an error just now or 2 weeks ago? Is the problem even in your code or a dependency which was updated in the meantime? To some extent, this is normal in experimental mathematics, but it should not happen with published results. And if it does happen, having the Containerfile with the exact procedure and the old image with the exact system state allows you to hunt down the source of the disagreement. This way, you can know who is wrong and why and whether a corrigendum needs to be published.

Archival of research data and system environment

The image that comes out of a Containerfile is a “living” archive of all the software, its dependencies and their exact versions and source code which powered the computation. It is often tedious to collect all of this information (unless your software is entirely written in Julia with the assistance of its package manager). It may not even be clear to which extent you have to collect it: is it enough to say which version of soplex I used or do I need the gmp version and the libc version as well? So instead of collecting it, saving the image saves the source of this information so that it can be collected when required.

This archive is “living” because I can log into the system stored in the image at any time and do more computations in the same environment, even years later. In the meantime, I would probably have either abandoned or updated my own computer or changed it in many other ways. Changes are undesirable on systems that store or handle mathematical research data, and with containers I do not have to worry about it. I keep the sensitive parts in a container and archive the image.

However, one caveat that has to be mentioned here is that with podman, the container runs on the host system’s kernel. Hence, the virtualization is only partial. While this is good for performance, it means that the image’s system is not completely independent. On the other hand, changing the behavior of userspace programs is considered a bug in the Linux kernel developer community, so one can assume a good amount of stability. The kernel does permit API and ABI changes, so the situation is not ideal.

To be on the safe side, I would always install a kernel on the image while building it. That kernel will most likely match the kernel you are currently running sufficiently well so that results recorded in the container will be consistent. In the far future, if the archived image cannot be run anymore under your current kernel, it is possible to extract the container’s filesystem (including the old kernel) and to create a qemu-compatible image out of it. This can then be run under full emulation if the need should arise. See this StackOverflow post for the steps of converting a container image to a qemu image.

Smoke-testing of mathematical software

Some previous sections operated under the assumption that something was wrong in the data published previously. But benefits also exist the other way around: if the computation is believed to be correct, then creating new images from the same old Containerfile should always yield the same results. First, the program should still compile, install and run. And second, it should produce the same output. In this way, the production of mathematical research data can contribute test cases for mathematical software.

A dream of mine is to run a “spring cleaning” on a repository like MathRepo. Imagine every project on there had an accompanying Containerfile. One could run through them once a year and see if the programs still run and produce consistent results with whatever new versions of their dependencies are downloaded. This could highlight incompatibilities between software projects which were recently introduced and in any case it can be used to draw attention to mathematical research data which is no longer reproducible with the steps provided.

Training young mathematicians

For containers to catch on, some computer literacy is required. I personally have been running Arch Linux systems (with and without X servers) since I was in 10th grade. I also find myself setting up servers on an irregular but frequent basis. Consequently, I had no trouble getting into podman. It took me one afternoon to get comfortable with the mode of operation and to get my first image with the CInet::Tools together.

I realize that the task of writing a Containerfile would be overwhelming for the majority of mathematicians as they have probably never set up a Linux server. However, our mathematical software runs on computers and our data is digital. In order to manage all of this better, mathematicians have to master these technologies or enlist the help of system administrators who understand their needs and care about mathematics. Whichever is easier.