Allgemein

Docker versus Nix: The quest for true reproducibility

Von [email protected] 07.02.2026 Loading...

When conducting performance benchmarks, the ultimate goal is an apples-to-apples comparison. Docker, widely recognized as one of the most brilliant inventions of the last 15 years, offers a level of reproducibility that allows engineers to share applications running on a test bed. Consequently, containers have become a practicable standard for replicating tests.

However, significant caveats remain. When producing different applications for testing, containers often remain dependent on the host operating system or the kernel updates being used. This phenomenon, often termed “configuration drift,” means that runtimes are not always reproducible to the extent necessary for accurate comparative benchmarks. This is particularly relevant when analyzing complex applications like ScyllaDB or Cassandra. While a ScyllaDB package for Nix is not yet available, examining Cassandra through the lens of Nix offers insight into how we might achieve benchmarks that are mathematically more accurate than current standards.

Reusable vs. reproducible

While many industry professionals rely on Docker for its portability, a distinction must be made between a container that is simply reusable and one that is truly reproducible. Michael Stahnke said that the standard Docker build process often introduces non-deterministic elements that compromise rigorous benchmarking.

“I don’t think I would call Docker reproducible to start with. It’s reusable in that you can reuse the image and build multiple containers from it. But reproducing that image is actually pretty difficult to do. Most images have lines in their Dockerfile that do things like apt-get upgrade or yum upgrade or the equivalent,” Stahnke said.

“That’s non-deterministic because if you did it two weeks ago versus doing it today, you’re going to end up with different bits in the image. There are ways you could pin everything, but even if you pin something, like for example with apt, you pin the leaf node of the tree, but you don’t pin every dependency underneath it. So if you install a Python application, it still may upgrade libc underneath that, and so you don’t have guaranteed reproducibility unless you’re working really, really hard to pin every package.”

From a benchmarking standpoint, Stahnke said that engineers might attempt to mitigate this by creating a single, monolithic container for all tests.

“From a benchmarking standpoint, what you may be able to do is have one container that has all the things you would ever benchmark in it. That way, you’re using the exact same image for every set of benchmarks,” Stahnke said. “You just turn off the database that you’re not using, or something like that. I’m not sure that that’s apples-to-apples, but most benchmarks, if you actually peer into them, are not that great from that perspective.”

Architect the ‘clean room’

The advances NixOS has made vis-à-vis Docker containers focus heavily on how packages are processed. Usability and reproducibility have been default mechanisms of Nix since its inception, representing a fundamental shift in design philosophy compared to standard containerization. Stahnke said that true reproducibility requires a strict, isolated build environment that prevents external variables — like timestamps or internet access — from altering the final artifact.

“When you architect a system with reproducibility design from the ground up, you have to make some pretty different choices. For example, packages that are built using Nix are built in a clean room environment or a sandbox,” Stahnke told The New Stack. “That means no network connections, no access to files outside of the exact build environment, and that means that you basically have a pure artifact that is built and reproducible because there’s no reaching out for side effects or things that can’t be guaranteed, such as reaching out to a language package manager provider or reaching out to the internet to go grab information or documentation or whatever. Like the cryptographic hash of the inputs to your build is guaranteed to be reproducible.”

Stahnke said that this strict control extends to metadata, ensuring that even time-based variables do not cause drift.

“Building further on that point, even things like timestamps are set appropriately so that they’re always set at the epoch. That way you have guaranteed reproducibility because something with a different timestamp would have different hash sums and such,” Stahnke said. “About a decade ago, I believe Debian got pretty into doing reproducibility, and they ended up copying a lot of the facilities that Nix had had underneath. Even the Debian reproducibility effort, I think it was a special interest group and didn’t really take off and permeate throughout the entire Debian ecosystem.”

Stahnke said that common Docker practices, specifically using mutable tags, often cause issues for operators.

“As far as Docker goes, there’s another concept that ends up often hurting operators: using latest as the version of something that you’re using. So if you have an AWS ELB with nodes in the backend, and you spin up a new node and it just pulls the image that you’re using at latest, is that the same version that’s running on the other nodes in that ELB cluster? Those are things that are difficult to know,” Stahnke said. “Some of that isn’t necessarily a problem with Docker specifically, but more of the implementation and the common patterns of using things like a latest tag, which is fully mutable and moves when you have new images available.”

Bridge the usability gap with Flox

Despite its technical superiority in reproducibility, Nix has historically been viewed as having a steep learning curve. However, new layers running on top of Nix, such as Flox, are increasing accessibility. These tools allow commercial environments to leverage the principles of Nix without the academic friction, facilitating a transition from “works on my machine” to “works on your machine” regardless of the underlying operating system. Stahnke said that while Nix is powerful, its academic roots require an additional layer of tooling to make it viable for fast-paced commercial environments.

“I think tooling built on principles and processes of Nix is great. Nix was clearly an academic setup when it was built, and it has some things that I would say are rough around the edges for everyday users, particularly in fast-learning and in commercial environments where the tech isn’t there for the sake of tech; it’s there to enable the business to have different outcomes,” Stahnke said. “Nix isn’t super easy to use, and there’s a lot of variability and there’s a lot of power in there. So you have to figure out what workflows are common to what users need, what parts and principles of the Nix ecosystem are important, and then what can you build on top of it to make it easier to use, easier to understand, easier to read about, easier to share with your colleagues, easier to teach?”

For Flox specifically, Stahnke said that the tool uses environment primitives to ensure that software behaves identically across different hardware architectures.

“For Flox specifically, that came down to: We have a primitive called the Flox environment and that includes all of your packages, it includes service management instructions, it includes environment variables and configuration and all that. You can just do a flox activate. That’s calculated to work across platforms, so if I’m on Linux x86 but my colleague is on a Mac with an Apple silicon chipset, every time I make a change in my environment it’s calculating what the changes are for those other targets as well,” Stahnke told The New Stack.

“Therefore, when my colleague pulls this environment onto their laptop they run a simple command: flox activate and they have the exact same bits that I do. The exact same software versions from the exact same builds from the exact same input obviously built to the target that they’re running on but those are the calculations we perform under the hood for every environmental transaction. The primary goal there is to move from ‘works on my machine’ to ‘works on your machine.’”

Stahnke said that this reproducibility extends beyond development, offering significant advantages in production environments by eliminating the need for redundant testing.

“This also means that there are facilities that allow this to work better in production once you’re past the development lifecycle. In production, you can say, ‘Okay, we develop on Macs but we run on Linux.’ Alright, we’ve already calculated all that transaction,” Stahnke said. “If you’ve already run tests as part of your build, those tests don’t need to be re-run because it’s mathematically provable that you are testing the same artifact or you’ve already tested the exact same artifact previously, so you don’t have to re-run tests like that.”

Stahnke said that the inherent bookkeeping of the Nix ecosystem provides automatic software provenance, removing the need for external scanning tools.

“Also, because we understand all the inputs and outputs coming into building a package or into building a Flox environment, it means you can get software build materials by default. There’s no add-on that you have to run afterwards, there’s no scan you have to do or special tooling. It’s just part of the ecosystem,” Stahnke said. “The bookkeeping is part of what makes Nix Nix, and that’s part of what makes Flox Flox. So you get the bookkeeping and complete software provenance, software build materials and understanding of what’s inside of every package.”

The post Docker versus Nix: The quest for true reproducibility appeared first on The New Stack.

Reusable vs. reproducible

Architect the ‘clean room’

Bridge the usability gap with Flox

Verwandte Beitraege

A Failed SwitchBot Plug Mini and Cooking Electrolytics

How WebAssembly and Web Workers prevent UI freezes

How GSD turns Claude into a self-steering developer

Leave a Reply Cancel reply

Discuss with AI