An attempt at using IPFS as a binary cache

OtherBookmarks · May 19, 2024, 10:09pm

Related to the discussion about a binary cache, I went off for a few days and tried to find out how to use IPFS as a binary cache. It was nearly successful, but finally failed due to a bug in IPFS itself.

For posterity, the project is nix-subsubd.

Details

nix-subsubd (nss) a proxy for nix binary cache substituters

At the moment, the only target for the binary cache is IPFS, but that can be expanded later.

Status This is idea is currently dead in the water due to a bug in IPFS
hindering the retrieval of content from IPFS using just the content’s SHA256 hash.

How it works

Substituters allow nix to retrieve prebuilt packages
and insert them into the store. The most used one is https://cache.nixos.org which is a
centralized cache of build artifacts.
Build artifacts are stored as compressed (xz) NAR (Nix ARchive) files.

nss acts a proxy between nix and a binary HTTP cache. The flow is simple

nix requests information about the NARs (narinfo) from the binary cache
1. nss forwards that to binary cache
2. binary cache responds
3. nss stores the information in an SQLite db, with the pertinent information
  being the SHA256 hashsum of the compressed NAR file (.nar.xz)
nix requests the .nar.xz
1. nss tries to retrieve file from IPFS fails
2. if file isn’t on IPFS nss forwards request to binary cache to be implemented

If anybody knows go and would like to debug IPFS to fix the bug, or point me to an alternative that:

is distributed
self-hosted
preferably non-crypto, but that’s not a hard requirement
MOST IMPORTANTLY is content addressable with SHA256 (it’s what cache.nixos.org uses for their .nar.xz)

please do comment

marshmallow · May 20, 2024, 1:44am

What’s download speeds feel like with IPFS? I have pretty bad internet down under. For many small downloads can it really meet speeds like a torrent might?

OtherBookmarks · May 20, 2024, 8:22am

I got max 2MiB/s on an 11MiB/s line while getting the Project Apollo Archive and the XKCD archives. Not fantastic.

The WebUI unfortunately doesn’t indicate what is being downloaded, from whom, nor how fast, like torrents.

But, unless I’m mistaken, torrents aren’t a viable tech for this proxy as I don’t think it’s possible to use the SHA256 hashsum of a file and find a corresponding torrent. I don’t know if it’s possible to find a corresponding host using the DHT though Not sure what it stores. It’s either a file hash to host mapping or a file chunk to host; the latter being the most likely as from what I remember the torrent stores the hashes of the chunks, but I might be wrong.

I’d be very glad to be wrong, actually, because that could mean somebody could write a program that queries the DHT for hosts and uses the bittorrent swarm to download said file, thus repurposing torrents as a content addressable filesystem.

Curse you @marshmallow, I was supposed to be working on another project, but now this piqued my interested

marshmallow · May 20, 2024, 8:32am

Lol! I was just interested, not saying we should use torrents for a binary cache :3

OtherBookmarks · May 20, 2024, 8:48am

Who knows! It just might work.

I just checked and the DHT is an infohash to peer mapping (peer = client/server that implements the bittorrent protocol). Just out of curiosity, the question is now in the libtorrent forum. Somebody might know how…

srtcd424 · May 20, 2024, 9:00am

Hah, I was trying hard not to get distracted by ‘proper’ p2p stuff yet, but thanks very much for looking at this

As someone pointed out on matrix the other night, we don’t necessarily need to aim for fully decentralized yet, a centralized tracker and/or KV store mapping nix-friendly hashes to whatever the underlying tech uses might not be the end of the world. You might even be able to do something mad with a shared git repo for the mappings, at least initially…

Slow download speeds are a reasonable problem, but for devs and ‘power users’ a local hydra instance or similar could probably abused to effectively prefetch things one might want to use…?

OtherBookmarks · May 20, 2024, 9:08am

I would like to update that now the download speed is a whopping 8MiB/s! I guess it needed some time to ramp up and find peers.

OtherBookmarks · May 20, 2024, 9:12am

that does seem like a good intermediate step. At the moment, I have to focus on another project with friends that I promised I’d do. So if somebody’s comfortable with rust and would like to create a merge request (or fork the project, it’s GPLv3) to map SHA256 of .nar.xz files to content addresses or torrents, it’d be very much appreciated!

That’s very similar to what I was thinking. The goal was for users to run a service on the side that would populate IPFS (or whatever other service there is) similar to this

{
  services.p2pcache = {
    enabled = true;
    # stuff from the nix store one would like to "donate" to the network
    # naming things is hard
    donations = environment.systemPackages;
  }
}

The once the service starts, it does whatever’s necessary to add stuff to the P2P service e.g ipfs add everything in the donations.

marshmallow · May 20, 2024, 10:34am

that is exactly why I’m worried because the nix store is many small files and may never get the time to ramp up

srtcd424 · May 20, 2024, 10:02pm

I had a poke around with bittorent today, successfully created a manky shell script to generate torrent files for all the nars in a given closure from my s3 storage. Unfortunately failed at the next hurdle, couldn’t get transmission to reliably create a simple swarm between two containers on the same host!

Will pick up another day and try different combination of tracker and torrent client. Can’t believe this was the step I got stuck at!

srtcd424 · May 21, 2024, 9:29pm

So I think it worked:

# git clone  https://github.com/auxolotl/core.git
# cd core
# nix-shell -I nixpkgs=. -p '(import ./default.nix {}).pkgs.stdenv'
these 48 paths will be fetched (73.10 MiB download, 342.45 MiB unpacked):
[..]
[nix-shell:~/core]#  exit
# nix-store --gc
[..]
# nix --extra-experimental-features nix-command copy --from file:///root/Downloads/nix /nix/store/9wnvhjyxjykwn5y06xc9a2h8rs5fbfia-stdenv-linux
# nix-shell -I nixpkgs=. -p '(import ./default.nix {}).pkgs.stdenv'
these 3 paths will be fetched (1.69 MiB download, 10.87 MiB unpacked):
[..]

(/root/Downloads was where qbittorrent happened to dump the NARs etc.)

srtcd424 · May 21, 2024, 10:51pm

Mad, late night, drug addled thought: if we’re contemplating storing store path → nar file mappings in git, could we use radicle? It doesn’t seem to cope with large repos yet, but that shouldn’t be a problem at least initially.

This would give us more decentralisation, and more interestingly radicle seems to support a quorum system for accepting changes: Radicle Protocol Guide

This way updates could be checked by multiple builders and only accepted once a certain number agree?

OtherBookmarks · May 22, 2024, 9:58am

I like the idea of using radicle. My only concern is packaging it now. It’s flake only from what I can tell and for non-flakers like me, it wasn’t possible to add to my config (probably lack of knowledge).

The quorum system does look very interesting for reproducible builds. Nice find.

OtherBookmarks · May 22, 2024, 10:20am

In order to use torrents these options come to mind

Direct streaming using substituter proxy

nix calls proxy
proxy starts downloading torrent (via libtorrent) sequentially
stream torrent response

The advantage here is that nothing is stored on the disk by the proxy and memory consumption stays low. The downside is that libtorrent doesn’t have up to date rust bindings (it does have python, java, nodejs, and golang bindings though). It could be worked around if a torrent client exists that can output a torrent’s content to stdout.
Additionally, I’m not sure how fast sequential downloads are Maybe chunks from different nodes are buffered in memory and served sequentially?

Download then stream using proxy

nix calls proxy
proxy downloads torrent to filesystem (can be tmpfs or elsewhere)
proxy streams downloaded file

Advantage is probably simplicity? Call an external binary to download a torrent, wait until it exists, and serve the downloaded file, maybe clean up afterwards.
Disadvantage would be (to me) an external dependency and (depending on how it’s done), waiting for the download to complete, then serving it. The total time to service would thus be download_time + serve_time.

Pre-eval config

By evaluating nix ahead of time and collecting the store paths, one could download all the content, then use @srtcd424 's findings to copy them to the store. Once it’s time for nix to “build”, it should detect what is already in the store and go on its merry way to generate the rest of the system configuration (or whatever it is one evaluated).

I don’t know how to get nix to spit out all the packages it will have to download to the store. Maybe there’s some kind of dry-run mode? But if that’s solved, maybe this is the simplest option for somebody who can figure out the first step of nix eval

Thanks for looking into this @srtcd424 ! Gives me hope that there won’t be any need to fumble about with nix’s source code and try to get a PR merged into the repo or maintain a patched version thereof. It’s all mostly (completely?) out of band.

srtcd424 · May 22, 2024, 10:51am

This is next on my list to look at - a simple “subscription” shell script / daemon / cron job that can take a list of closures we’re always interested in, e.g. stdenv, minimal, <your favourite language toolchain here> etc and grab them in the middle of the night should be straight-forward but also enough to be useful. And then that proof of concept might inspire people with more brain than me to make something more realtime work.

srtcd424 · May 22, 2024, 2:19pm

$ nix --extra-experimental-features nix-command derivation show -f . stdenv | jq -r 'first(.[]).outputs.out.path'
/nix/store/9wnvhjyxjykwn5y06xc9a2h8rs5fbfia-stdenv-linux

seems to be the magic to get a store path for a closure you can pass to nix copy from a aux core checkout in the current directory. Progress!

Edit: though of course we need the whole closure, don’t we? Hmm…

srtcd424 · May 22, 2024, 10:11pm

Which is possibly a little non-trivial. If the whole closure we want is already available we can chase through the narinfo References: lines; if it’s not I guess we would have to settle for the build inputs, which are obviously going to be a superset of what we need. Suboptimal but probably good enough for the moment.

Anyway, I have some disgusting python to do the narinfo chasing. Next step is to wire it up to aria2 or something …

Edit: (as I’m not allowed to keep replying to myself, >sigh<)
I think I’ve got a Minimum Viable Lash-up: GitHub - srd424/nibbit: Preload nix stores from bittorrent

Needs lots and other bits and pieces, and there are probably other priorities, but it shows something really basic and simple (albeit not ‘on demand’) can be done.

dfh · May 25, 2024, 1:25pm

I’ve been following along the cache discussions and one aspect I feel we need to answer for any p2p based caches is how can we give a guarantee for data availability?

E.g. when using IPFS a node providing a hash can just silently drop offline possibly making a hash unavailable if it hasn’t been also been stored by another node which IIRC only happens if the hash was requested via a second node. From my understanding its why ipfs-cluster exists.

srtcd424 · May 25, 2024, 8:50pm

Yeah, that would be a big issue if something like IPFS was being used as the primary binary cache solution. I think I’ve stumbled across paid IPFS “pinning” solutions, and presumably one could also self-host something similar.

Using something P2P to take some of the bandwidth load off a conventional binary cache might still be useful though.

I think I’ve said before my current focus is thinking about helping small groups of trusted devs, and maybe some early-adopter type hobbyist users, in the early stages of a project. In that case, a cache miss isn’t a disaster, as those sort of users are likely to be more comfortable building from source.

Jeff · May 27, 2024, 4:24pm

I love IPFS, so I’m glad to see this topic, and glad that someone is spearheading this early on.

I haven’t read everything here, I just want to briefly say: I’d like IPFS as a * fallback *. Central servers can be really fast.

So long as the IPFS hash is stored somewhere for derivaitions we can do a lot with external tools. Fetching the normal cache URL and the IPFS url, then just using whichever resolves first would let the system be fast and reliable.