Trusting Caches

jakehamilton · May 12, 2024, 6:34am

Prior art exists in an early form as Trustix: GitHub - nix-community/trustix: Trustix: Distributed trust and reproducibility tracking for binary caches [maintainer=@adisbladis]

Is there any interest from @committee_security on this topic? We will have to tackle binary caching at some point and doing so safely, reliably, and in a scalable fashion will be important. Importantly, Trustix would enable additional caches to be run and safely used by community members (eg. p2p caching).

srtcd424 · May 12, 2024, 7:54am

It would be great to confirm what the current status of the project is, if nothing else. It seems have been stalled for a while, but it might still be in a useful state even if the authors don’t consider it feature complete.

wamserma · May 12, 2024, 3:30pm

Unfortunately Trustix has gone dormant pretty soon after it’s announcement.

It might indeed be interesting to pick it up in the context of Aux, but more as a long term goal.

From what I understand, Trustix is can be useful to collect community provided builds and create a “community cache”. Aside from that Auxolotl should have a project-controlled cache, similar to cache.nixos.org. On the way there we should try to make use of the lessons learned from cache.nixos.org:

On the user-facing side we need some low-latency & high-bandwidth distribution mechanism. This should use some caching but treated as stateless: effectively a CDN and using a commercial provider like Fastly is probably to most budget-friendly way to have this.
Behind that we need (load-balanced) read-only instances of the actual cache, that serve to the CDN but can also be contacted directly. Attic is a good candidate. Like the CDN the focus is on bandwidth cost for these hosts.
Everything behind the cache should reside on a non-public network, e.g. self-hosted ZeroTier. This provides some resistance against DoS-attacks and allows to shift nodes around physical hosts without too much need for reconfiguration. On this network we need an Attic instance to write to the cache and some machines providing storage for Attic (e.g. with Garage.io). The cache is fed from a trusted CI-pipeline.
The storage backend (or the luke-warm cache vs. the hot CDN cache) should have a focus on integrity, storage cost and avoidance of vendor-lock-in (getting the data should be easy and cheap but doesn’t have to be fast - no S3 bucket here).

Overall we also should keep an eye on popularity of storage paths to adjust what is pre-built for the cache and what is left for the users (or Trustix-network).

Beyond that I with for more flexibility of the substituter-handling (in Lix): Marking substituters as “not permanently online”, defining public keys along with substituters and maybe some simple PKI support to decouple signing keys for store paths from identities and easy key rotation for substituters.

dfh · May 12, 2024, 5:35pm

I personally have interest and I’m happy to take this question to the 2024-05-16 COMSEC meeting to figure out the groups interest.

From a security principles perspective I can already say that the way trust is defined is a top-down approach. For the scenario of caching this is the first set of questions that come to my mind (sorted top-to-bottom):

What trust guarantees are we aiming for?
e.g. how much vetting does a binary “require” to be allowed on our cache?
What risks are we aiming to mitigate?
What constrains do we need to work around? e.g. storage/ compute/ bandwidth
How & where are we building software, e.g. dedicated buildfarm owned by auxolotl vs community provided build machines with diverse ownerships
Caching style: Centralized but redundant vs p2p - some possibilities have been described in Binary Cache thoughts

In top-down design processes making technology decisions happens rather late, because the non-technical goals need to be well enough understood to be able to qualify/disqualify a software.

dfh · May 12, 2024, 10:28pm

OK, my brain didn’t wanna let go of this so here’s a new proposal how to look at this issue.

We should consider widening the scope, aka making the problem bigger to construct it from the big picture to the details.

I feel the big picture could be named “Build and Distribution Pipeline” - let me explain.

As there is now a new roadmap that has made a decision about source-code storage (forgejo) the broader issue looks to me that we need to figure out how we get code from forgejo compiled and distributed to the users of Auxolotl.

In my view (spoiler I’m specialized in Infrastructure security) this is an engineering issue that tries to connect 1 source-code repo on the top of a pyramid with “unlimited” (as in more than we might know or can reliably count) consumers on the bottom on the bottom of a pyramid that expect binaries in a specific format that easily plugs into their flake configuration.

Between the top and bottom of the pyramid seems to be a question mark and a few conversations (1, 2, 3, 4, 5, 6, 7, 8, 9) have talked about aspects of what that question mark could look like.

I wanna introduce the idea of “constraints” with the hope that it helps us identify solutions. The constraints I have identified so far (please feel free to add more):

Compute - for building packages from source
Storage - to store binaries
Bandwidth - to deliver copies of binaries
Trust - how/ when is the output of the building process considered “correct enough”
IMO part (but not limited to) of the original question regarding trustix
Complexity - What services can we currently manage given our headcount

Constraints 1 - 3 can IMO be summarized into one category: financials in order to allow for comparing different solutions in terms of cost.

Those constraints can be applied to the different mandatory pieces for this pipeline

A. The forge
Constraints Storage and Trust apply
B. The build farm
Constraints Compute and Trust apply
C. The cache
Constraints Bandwidth and possibly Trust apply

The constraint complexity and finances applies to each individually as well as in sum.

The limit for each of the constraints should be defined as explicit as possible.
E.g. for finances this might take the form of “0” as in for free,

I expect complexity and trust to be a bit harder to define though, maybe this needs a combination of community feedback and risk assessment.

Should we look at the issue this way, my prediction is that we might wanna start off with a rather simple setup (assuming we don’t need to compile & store a repo of the size of nixpkgs) that is more centralized but easy to maintain.

Future versions of the build + cache system might be build on more distributed approaches (tiered caches, volunteer compile farms + trustix, p2p/torrent based cache downloads).

Personally I would put the primary focus on a version1 that serves the current/ next step of our Roadmap best.

This proposal definitely has some rough edges and I might change a few more of my thoughts, but I wanted to throw it out there to see if the general perspective resonates?