SIG Repos: How should they work?

srxl · May 4, 2024, 11:49am

Yesterday, I threw together a working POC for the SIG repo structure discussed in this thread. It works, but we’ve also recently seen this PR come through, proposing a different structure with a different workflow (see this message in particular). I think there’s some good arguments made here, and this sort of structure is worth considering.

I think we need to take some time to properly hash out how we’ll be structuring SIG repos, and how they’ll be consumed. We’ve had some ideas, and some POCs, but I’m not sure we can get much further without committing to a structure that we all agree on. In particular, I think we need answers to:

What belongs in Core, and what belongs in SIG repos?
How can SIG repos depend on each other?
Should SIG repos be standalone, or will they need a top-level repo to tie them all together?
How do we manage keeping SIG repos up to date with each other?

I’d like to get responses from as much of the community as possible on this one, since the decisions made here are going to make up the bulk of what contributing to and consuming Auxolotl repos will look like. Particularly, I’m interested in input from the Steering Committee (@jakehamilton @isabel), as I think this sort of topic falls in their wheelhouse (correct me if I’m wrong).

liketechnik · May 4, 2024, 12:12pm

I fully agree with the 4 questions you raise, but I think there’s an underlying question to discuss too: What do we want to achieve with the SIG repo structure?

I think this has two perspectives, the user and the maintenance perspective; for which I interpret the motivations as the following (in no particular order):

user: group language specific packages (i.e., packages that are relevant for a particular language only) into different repositories, so that users have the choice to only pull in the parts of aux packages they need
maintainer: group the whole language support into different repositories for organizational purposes, so that each SIG is able to maintain their part of aux packages as they see fit (e.g.: changing the format of package organisation)

isabel · May 4, 2024, 12:43pm

Most of my work up until now has on maintaining the soft forks rather then actually focusing on the future of aux but here is a better place then any to get started.

My first thoughts is that @getchoo raised some extremely interesting points. So to get right into the first point.

circular dependencies for little reason

top-level depending on this repo and vice versa doesn’t make much sense and adds unnecessary complication

anything we want to re-use across repos should be in core – as it’s described

Is something I can agree on and it would be important because how many packages depend on other. Take openssl for example, thats needed by at least a few hundred. And furthermore to this we seem to have added top-level as an additional level of abstraction that was never really needed and core could handle that for us.

this repo should be able to be used individually

And I agree with this even more. If a user needs only JavaScript packages they should be able to only pull down JavaScript package.

in contrast, with the setup here:

PR package b to core, wait for merge
PR to python with package a, along with the lockfile update to core (same point as above. lockfiles will > need to always be updated somewhere given this isn’t a monorepo)
this is simpler to understand from a contributor perspective, much less work on maintainers, and leaves less room for error between each repo’s lockfile.

This also is very important I feel a big part of aux was to make life easier for contributors and this makes sense to me. Removing top-level and having less moving parts will inevitably help in the burden of maintainers.

So now to address your questions.

Core in my opinion should house, stdenv, lib, and packages that are not easily separated into categories. What do I mean by not “easily separated into categories”, I feel a good example is vikunja currently broken into two packages vikunja, the backend, and vikunja-fe, the front end. These are two programs that go hand in hand and don’t use any specified builder i.e. buildGoModule. Another one might be openssl as a mentioned before several packages depend on that.

I think the should standalone. Building on my prior answer, we only need to have each repo “import” the core repo such that they have access to shared resources such as the aforementioned openssl, stdenv and lib.

My best idea here is having CI update them all at regular intervals that is merged into master automatically. In my mind these workflows would trickle down from core to the lower repos e.g. javascript and would mean that each repo depends on the same version of core. The biggest concern here would be what if one were to fail? My answer to that is this is why we are going to need a staging workflow. I plan to create a post on this later today hopefully to address some of the pain points we noticed in Our first unstable release. Hopefully, this means that by the time we ship our unstable releases and maybe more we have matching hashes on each of our repos.

srxl · May 4, 2024, 1:21pm

I’m a little on the fence with this one. It definitely makes sense from a maintainer perspective, I’m on board there - but from an end user one I’m not entirely convinced. As far as the user is concerned, this is a pretty big departure from how Nixpkgs works.

Previously, users had to setup one channel, or one flake input with Nixpkgs, and everything would be available - the full set of packages, NixOS modules, builders. What we’re suggesting now is that users need to pick and choose what they want to include in their projects. While this reduces the size of repos we ship to the user, it also means the user has to be consciously aware of what repos they need to consume. This means users now need to ask themselves “Where is the thing I want?” - because the answer isn’t “Nixpkgs” anymore, it varies depending on what they want.

I’ll use something like a NixOS module for GDM as an example. Where might a user find this? Could it be in the GNOME repo? Is it in Core? Maybe it’s in some Xorg or generic Desktop repo (granted, SIGs that don’t exist yet, but could reasonably later on down the line IMO)? This is an entirely new class of question users will need to be prepared to answer, and especially for new users, I think this approach raises some non-negligible friction. Good documentation could help to alleviate this by making it very clear where things are, and making it easy to find things, but it is still extra overhead that does not exist in Nixpkgs.

I think there may actually be a bigger problem here - circular dependencies between SIG repos is something that doesn’t seem out of the realm of possibility (for example, maybe Javascript and Python both introduce an dependency on each other - Node brings in Python for node-gyp, Python brings in node for it’s Jupyter Notebook tooling, maybe? Unsure if that’s how it plays out in practice). If both depend on Core (which they almost certainly will), who gets’ updated first? Python would update core, then Javascript would update Python, and Python would update Javascript, and… The flow doesn’t really terminate in this case. top-level solved this by handling dependency resolution for everyone - Core updates, top-level gets updated, and passes the new core to everyone.

isabel · May 4, 2024, 2:37pm

Not entirely since they could still import core to get all the packages we provide. Or at least thats how I interpret it.

Seems like a GNOME repo thing since it GDM is owned by gnome.

This is true but for the large part it should be clear where each and every package is by “default” in my opinion.

This raises an excellent point that I thought of myself too after creating my original post. But the only answer I can think of leads us right back to the start of having a mono repo. I figure this is solved by top-level since they would all be evaluated there. But I feel this is still a imperfect solution. I would be happy to listen to other peoples thoughts on this.

VlinkZ · May 4, 2024, 3:38pm

I also commented this on the linked pr, but I think an important consideration is the following:

Python package A is in SIG Python repo
Node package B is in SIG Javascript repo
Someone decides to package an application C that depends on A and B.

What should be done? Where should the new package be stored? If we don’t allow any cross-dependence over SIG repos, then all three packages would need to move to another repo, be that Core or Top-level.
Edit: Or we could move package C to top-level, but that would essentially make top-level a monorepo with a few packages without cross-dependence being spread across other repos

imadnyc · May 5, 2024, 12:25am

Can I ask what the benefit of end-users being able to different SIG repos is? I feel like the only argument is one of space, since you don’t need to clone entire monorepo, but seeing as how Nix is already super space hungry saving a few mb can’t matter that much, right? You have the same level of granular control as getting the different repos, no? I was envisioning the SIG repos as just a tool for maintainers to more rapidly iterate.

I’m not sure about circular imports and how to handle that, but as for @VlinkZ comments of C depending on A and B, I think we should make it so the creator of C decides which repo A or B they put it in. And if they say put it in A’s repo, it’s A’s responsibility to check for bugs and determine if it needs to be pushed to B. The issue I see with this though is that things could get stale being pushed from one group to another.

isabel · May 5, 2024, 12:29am

The biggest reason I can think of is eval time. Pulling only Go for example will certainly speed up eval time.
Another reason is “how much you need” there is not much point tracking a whole bunch of packages you really don’t need to track.
Ideally reduce the maintainer burden, each set of packages are in their separate locations and in theory easier to find.

imadnyc · May 5, 2024, 12:38am

My knee-jerk reaction is to say that eval time can’t be that much better (like enough to matter over the course of a day), but I also have pretty decent hardware. I have an old t430 to test this in older hardware, but I don’t know how to go about that given that the repos don’t exist yet.

I agree with the how much you need part, but in my experience it’s never been much of an issue because I just pull what I need, though I’m pretty sure everyone else will have different experiences.

As for maintainer burden, isn’t it the same? I assume we’re going to setup a search for packages like nix does eventually.

isabel · May 5, 2024, 12:41am

From what I have been told the eval time for legacyPackages is horrible.

I think its less, since each are grouped by clear distinctions. And yes I most certainly want a search.nixos.org replica.

imadnyc · May 5, 2024, 12:44am

Ah, in that case that makes sense.

I promise I’m not trying to sea-lion, but genuinely don’t understand. I don’t get how grouping the packages helps maintainers, isn’t it net neutral? Like the user has to find it regardless and not interact with the maintainers at all?

isabel · May 5, 2024, 12:57am

Well in theory we can manage permissions around certain repos and files better.

I don’t want to get too technical here. But here we go. Grouping things leads to “doors” in your mind and each time you use said “door” the stronger it gets. So each time a maintainer creates or updates a package they will have the built up the strength of that given “door” and work there. In this case the “doors” are analogous to repos and subsections of each repo. So the maintainers each time they create a PR will get faster and faster, but at a improved rate to the way nixpkgs does it. Why you might be thinking does nixpkgs not have the “by-name” structure. And to that the answer is those “doors” are weaker because they are based on looser groupings, and part of that is because not all languages have the same alphabet. And the “by-name” structure does not lean into logical groupings that work better with this trail of thought because their correlation is low particularly by comparison to the SIG structure.

Well ideally users will have almost 0 interaction. More experienced users will hopefully know exactly what repos they will need. Otherwise they can all just use the top-level repo.

getchoo · May 5, 2024, 3:20am

they don’t assuming any decent documentation. most of the time, a regular user making a home manager or nixos configuration will just want to pull in top-level similar to nixpkgs now

this is more aimed towards developers and power users. a good example here would be flakes not meant for end users, but to be consumed by other flakes (i.e., libraries such as pre-commit-hooks, flake-parts, snowfall, etc.). only pulling in core for example would help guarantee any api stability with aux/nixpkgs at a much lower cost compared to the current method of pulling in all of nixpkgs. likewise, individual packages from flakes could be used and binary caches could be taken advantage of without worrying about yet another 40mb+ nixpkgs tarball being downloaded just for it

space is a big one; i would this is an understatement, though. nixpkgs itself has gone from 30mb to 40mb within the last year or so, and seems to only be getting exponentially larger. this is turn forces users to be very careful about how many versions of nixpkgs end up in your flake, as even just two extra versions cause nixpkgs to take up 120mb+ alone. this is only going to get worse with time

also as isabel said, evaluation time is a concern. i experience issues with evaluating nearly anything on a fresh nixpkgs clone on the only macbook i have access to test darwin packages. smaller repos remedy this problem greatly. this is also not to mention full top level evaluations for tools like nixpkgs-review that can easily get killed by an oom daemon

srxl · May 5, 2024, 3:38am

Assuming we don’t drop top-level, then yeah, I think this makes sense to me. Most users would depend on top-level and just have everything ready to go for them, and anyone who really wants the modularity can bring in individual modules if they want. That sounds like an ideal situation to me.

Nixpkgs is a very high traffic repo. Lots of commits, lots of issues, lots of pull requests. The latter two especially are things that require triaging to ensure they get sent to the right team. And when your repo is as high traffic as Nixpkgs is, you’re inevitably going to miss some of it. This leads to issues and PRs getting lost amongst what is basically vast amount of noise to maintainers. Most issues and PRs are going to be ones that individual maintainers don’t care about - they’re mostly concerned with issues that affect the parts they maintain. Having the separate repos allows maintainers to break away from all their noise, and manage their own issue report log, their own PR queue, and their own committing/development process without having to worry about tripping over all the noise.

VlinkZ · May 5, 2024, 3:42am

After reading over @getchoo proposal, I’m pretty convinced that a tiered approach would work better than a circular dependency mess.

@getchoo’s diagram for reference
What do people here think of this?

My next question is, how do we want to organize packages within each SIG repos themselves? So far I’ve been testing out something similar to by-name such that packages are imported automatically based on directory placement, but also allowing for multiple package implementations. Any preferences? Prefer the by-name approach, grouped by category similarly to most of nixpkgs, or something else?

srxl · May 5, 2024, 3:58am

I think this is up to SIGs to answer for themselves - as long as flake outputs are standardized, I don’t see too much of an issue with SIGs deciding what the best layout for their requirements is.

My concern is that I don’t think circular dependencies are avoidable. There’s probably going to be quite a fair few repos in this setup, and I’m not convinced we can expect them all to fall out in a strictly acyclical dependency graph. I think this does represent an ideal structure though, and it’s something I’d be happy to base things off of, but I’d like to have a plan in place for what we do if/when we introduce a circle.

imadnyc · May 5, 2024, 4:01am

Yeah, sorry, I phrased it wrong, I meant how does importing the split repos help maintainers. I’m all for the split repos.

imadnyc · May 5, 2024, 4:29am

One thing that could happen is we could have overlays within each SIG and we have automation tools to backport it through to other SIGs.

Here are the two sigs, and this is the circular import. Correct me if I’m wrong, but only the last two elements of a circular chain can be outdated, since if App D in the JS camp relied on C which relied on B, which you have access over, then you would have to update B yourself anyway.

Python SIG has C but B is a blocker since it’s outdated, so it asks the CI bot to create an overlay. The overlay contains B’s derivation, but also metadata about where it came from. Python SIG updates it and pushes C no problem.

Once JS’s maintainers are ready, it notifies the CI bot, which consumes the overlay, pulling into JS, and edits C’s dependencies to link back to JS’s SIG.

A couple issues I’m seeing already

How are JS’s maintaners giong to know that it’s ready and not double the work?
- We could make it a simple tag or something that newcomers could do? Ostensibly it’s already checked by the python SIG’s maintainers, so there shouldn’t be too much checking to do afterwards.
What about chains of multiple sigs?

image1098×473 23.5 KB
- We could have the bot propagate overlays through multiple? I feel like that could get messy quick though.

srxl · May 5, 2024, 4:53am

Correct me if I’m wrong, but this solution is intended to resolve circular deps at the derivation level, right? I think this is too low a level for this structure - loops in derivation inputs necessitate a loop at the flake input level, and I think if the loop can be solved there, then we don’t really have issues at the derivation level. Overlays stop being a concern since each derivation only needs to worry about grabbing derivations from the flake input.

imadnyc · May 5, 2024, 5:14am

One really slow solution (if I’m understanding correctly) is to expose every package as a flake, and each input is itself a flake from a package from a different repo. This way, the top level of each SIG stays clear, and the loops are now all package (derivation) level. So you could still import the top level of a SIG which would pull packages from other sigs, but not have to pull everything and essentially create a monorepo. This would probably KILL eval time and storage though.