Bugless #76
k1-crash
Description
We would like to get off of Mr. k0's wild ride.
As a way to do this, we are investigating spinning up a new Kubernetes cluster, k1.
k1 would start its life as a 'crash cluster', that is one which might (and probably should) be running some slice of production workloads, but would otherwise be used as a testing ground for cluster rollouts. As k1 matures, it would shed its' 'crash' meaning and take on all the k0 jobs, meanwhile old capacity might be used to spin up a k2-crash.
We have at least three spare nodes (in the dcr03s19 chassis) that could hold this cluster. But we'll probably start out with it either on one node directly, or one node node acting as a hypervisor as we actually get things started (as to not waste resources).
Here's a shortlist of design choices we should consider for k1(-crash):
- Run Ceph directly on metal. This would finally resolve issue b.hswaw.net/6. This could be done with mons running on arbitrary NixOS nodes (either k0 nodes or k1 nodes or ???) and OSDs next to dcr01s22/24.
- Reconsider metallb/calico. They're good individually but together it's a mess.
- Figure out if we can also build some way to run a simulacrum of a cluster locally on a developer machine for testing.
No data to display