In Merkely, we use Artifact Binary Provenance as the foundation for our audit trails. Artifact Binary Provenance is a fancy term, but the idea behind it is really quite simple. All it means is that we can identify the software we have running in production. Let’s take a closer look 👀
How should we identify software?
There’s lots of ways to identify software. In our industry we’ve tried different approaches to version-numbers like semantic versioning and release names. These are human-centered approaches that involve applying a name to a specific piece of software.
This approach is called version labeling
The downside of this approach is that it is fallible. Any label can be applied to any software package, so it’s easy to see how mistakes can be made. For example, the version number could be incorrectly bumped, or errors in copying and distributing software could cause a misapplication of identity.
Version labeling also creates a security threat. A malicious actor could label their software in a way that makes a system believe it is running qualified software, but is instead running compromised software.
For compliance and security reasons we need a more reliable approach.
Content Addressable Storage
In high security environments we need a tamper-proof identity scheme. In plain talk, if the software changes we want it to have a different identity.
Luckily, this is a solved problem in computer science. The solution is Content Addressable Storage.
How this works is really simple. Instead of using a label to define software identity, you use the cryptographic hash of the software itself.
This means that if a single byte in the software changes it will have a different identity.
Can’t I just use the git commit SHA to identify the software?Git commits define a content addressable snapshot of the source code (and its history). If you are distributing the source repo as your artifact this could be a valid method of identity.
However, in most cases software is not distributed as source but rather as binaries (typically through compilation, packaging, or Docker images). This translation process is often non-reproducible or nondeterministic, removing a hard trace from source to binary. In other words, the binary package could be labelled with a source commit that is invalid.
For this reason we use a Secure Hash Algorithm (SHA) to identify the binary.
Storing the provenance
Now that we have a method for identifying software, wouldn’t it be great if we could look this up on demand from our DevOps tools?
A compliance System of Record provides a secure database to store claims to the identity (we have a strong opinion on what that should be 😇). When we create a binary in our secure CI build process we store the identity information in a journal.
As each binary progresses through the value stream you can record evidence against it such as:
- Source commit
- Build url
- Test results
- Security analysis
And the information is as easy to look up as it is to store. Our deployment processes can perform risk controls to ensure deployments are based on known approved binaries and verified processes. This is why we believe Artifact Binary Provenance is the basis for any compliance-based DevOps approach. It makes it impossible to qualify one piece of software and deploy another.
What about the humans?
Does this mean SemVer is dead? That you shouldn’t use git SHAs to identify your software? Not at all!
These are very useful ways for humans to navigate identity through version control and CI systems. However, since they are fallible, we still need the primary key of identity to be the content-addressable storage, linked to the labels. Labels are for humans and SHAs are for machines.