Freexian Collaborators: How files are stored by Debusine (by Stefano Rivera)
Debusine is a tool designed for Debian developers and Operating
System developers in general. This post describes how Debusine stores
and manages files.
Debusine has been designed to run a network of “workers” that can
perform various “tasks” that consume and produce “artifacts”.
The artifact itself is a collection of files structured into an
ontology of artifact types.
This generic architecture should be suited to many sorts of build & CI
problems.
We have implemented artifacts to support building a Debian-like
distribution, but the foundations of Debusine aim to be more general
than that.
For example a package build task takes a debian:source-package as
input and produces some debian:binary-packages and a
debian:package-build-log as output.
This generalized approach is quite different to traditional Debian APT
archive implementations, which typically required having the archive
contents on the filesystem.
Traditionally, most Debian distribution management tasks happen within
bespoke applications that cannot share much common infrastructure.
File Stores
Debusine’s files themselves are stored by the File
Store layer.
There can be multiple file stores configured, with different policies.
Local storage is useful as the initial destination for uploads to
Debusine, but it has to be backed up manually and might not scale to
sufficiently large volumes of data.
Remote storage such as S3 is also available.
It is possible to serve a file from any store, with policies for which
one to prefer for downloads and uploads.
Administrators can set policies for which file stores to use at the
scope level, as well as policies for populating and draining stores
of files.
Artifacts
As mentioned above, files are collected into Artifacts. They combine:
- a set of files with names (including potentially parent directories)
- a category, e.g. debian:source-package
- key-value data in a schema specified by the category and stored as a
JSON-encoded dictionary.
Within the stores, files are content-addressed: a file with a given
SHA-256 digest is only stored once in any given store, and may be
retrieved by that digest.
When a new artifact is created, its files are uploaded to Debusine as
needed.
Some of the files may already be present in the Debusine instance.
In that case, if the file is already part of the artifact’s workspace,
then the client will not need to re-upload the file.
But if not, it must be reuploaded to avoid users obtaining unauthorized
access to existing file contents in another private workspace or
multi-tenant scope.
Because the content-addressing makes storing duplicates cheap, it’s
common to have artifacts that overlap files.
For example a debian:upload will contain some of the same files as
the related debian:source-package as well as the .changes file.
Looking at the debusine.debian.net instance that we run, we can see
a content-addressing savings of 629 GiB across our (currently) 2 TiB
file store.
This is somewhat inflated by the Debian Archive import, that did not
need to bother to share artifacts between suites.
But it still shows reasonable real-world savings.
APT Repository Representation
Unlike a traditional Debian APT repository management tool, the source
package and binary packages are not stored directly in the “pool” of an
APT repository on disk on the debusine server.
Instead we abstract the repository into a debian:suite
collection within the Debusine database.
The collection contains the artifacts that make up the APT repository.
To ensure that it can be safely represented as a valid URL structure (or
files on disk) the suite collection maintains an index of the pool
filenames of its artifacts.
Suite collections can combine into a debian:archive collection that
shares a common file pool.
Debusine collections can keep an historical record of when things were
added and removed. This, combined with the database-backed
collection-driven repository representation makes it very easy to
provide APT-consumable snapshot views to every point in a repository’s
history.
Expiry
While a published distribution probably wants to keep the full history
of all its package builds, we don’t need to retain all of the output of
all QA tasks that were run. Artifacts can have an expiration delay
or inherit one from their workspace.
Once this delay has expired, artifacts which are not being held in any
collection are eligible to be automatically cleaned up.
QA work that is done in a workspace that has automatic artifact expiry,
and isn’t publishing the results to an APT suite, will safely
automatically expire.
Daily Vacuum
A daily vacuum task handles all of the file periodic maintenance for
file stores.
It does some cleanup of working areas, a scan for unreferenced & missing
files, and enforces file store policies.
The policy work could be copying files for backup or moving files
between stores to keep them within size limits (e.g. from a local upload
store into a general cloud store).
In Conclusion
Debusine provides abstractions for low-level file storage and object
collections.
This allows storage to be scalable beyond a single filesystem and highly
available.
Using content-addressed storage minimizes data duplication within a
Debusine instance.
For Debian distributions, storing the archive metadata entirely in a
database made providing built-in snapshot support easy in Debusine.
