Project

General

Profile

Bugless #28

k0: fix rook getting stuck in discover tue to python2.7/ulimit badness

Added by q3k about 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
hscloud

Description

rook-discover pods in rook-ceph-systems use `ceph-volume inventory` which gets stuck in subprocess.Popen due to a very high ulimit in the container:

[root@rook-discover-9b8r6 /]# ulimit -n
1073741816

This is like https://github.com/coreos/fedora-coreos-tracker/issues/329, but we already set the containerd limit and that seems to have helped in the past for some other containers (iirc some in ceph-waw3, osd-prepares I think?). This might be getting triggered here because this particular container is privileged (and likely has CAP_SYS_RESOURCE), and something bumps up ulimit -n up to fs.nr_open. But I'm not sure.

I've temporarily bypassed this by setting fs.nr_open to a lower limit on affected nodes and manually restarting the discovery pods, but this needs to be fixed better.

#1

Updated by q3k almost 2 years ago

  • Category set to hscloud

Also available in: Atom PDF