When PVE 8.4 docker broke — runc was the real culprit

One dist-upgrade across 11 LXCs killed vaultwarden, nextcloud, keycloak — every docker container in the homelab. I assumed docker was at fault. The actual culprit was runc, dragged along as a transitive bump.

All 11 LXCs failed at once after the patch#

The Proxmox 8.4 host had 11 unprivileged LXCs running docker workloads, all overdue for security patches. I ran apt dist-upgrade -y inside each one — about 800 packages updated per container, including docker-ce 28.1.1 → 28.5.2 and containerd.io 1.7 → 2.2.3 as major jumps.

After the patch every container reported the same error:

1
Error response from daemon: failed to create task for container:
2
  failed to create shim task: OCI runtime create failed:
3
  runc create failed: unable to start container process:
4
  error during container init:
5
  open sysctl net.ipv4.ip_unprivileged_port_start file:
6
  reopen fd 8: permission denied

vaultwarden, nextcloud, keycloak, every portainer instance — all dead with the same message.

The cause was runc 1.3’s sysctl reset, not docker#

The first thirty minutes I stared at docker daemon logs. Got nowhere. Re-reading the OCI message itself, the relevant line was “sysctl … permission denied” — failure happens below docker, in the runtime layer.

Tracking the dependency closure, containerd.io 2.x bundles runc 1.3. The 1.1 → 1.3 jump implements OCI runtime spec 1.2.1, which includes a behavior change: at container init, runc resets net.ipv4.ip_unprivileged_port_start to the default value.

On a normal Linux host this is harmless. But PVE 8.4’s unprivileged LXC mounts /proc/sys read-only as a security policy. New runc tries to write → denied → init fails → the entire container dies.

In short: docker is fine. The runc bumped along with it cannot perform its new init step inside an unprivileged LXC with read-only sysfs.

Emergency fix — six packages downgraded and held#

I tried downgrading just docker-ce to 28.1 first. Same error. docker 28 still requires containerd.io 2.x, which bundles runc 1.3. Both have to come down together.

1
docker-ce                  5:28.1.1-1~ubuntu.24.04~noble
2
docker-ce-cli              5:28.1.1-1~ubuntu.24.04~noble
3
docker-ce-rootless-extras  5:28.1.1-1~ubuntu.24.04~noble
4
containerd.io              1.7.18-1     # bundles runc 1.1.13
5
docker-buildx-plugin       0.14.x
6
docker-compose-plugin      2.27.x

Six packages held with apt-mark hold. Containers came back up. Total time from patch to recovery: one hour thirty.

PVE 9 makes the same combination work#

Holding packages is a tax that surfaces every patch cycle. The real fix had to live on the host side. PVE 9.1.9 had been stable for a month at this point.

pve8to9 flagged one FAIL — an aging ceph cluster. Looking at it: 0 OSDs, 0 pools, 0 objects. Someone had set up mon/mgr ages ago and forgotten about it. pveceph mon destroy + pveceph purge cleared it. 0 FAIL after that.

vzdump backed up all 12 guests (11 LXC + 1 VM, 17GB zstd), then dist-upgrade. 800+ packages, host reboot, kernel 7.0.0-3-pve up.

Here’s where it got interesting. On PVE 9 the docker holds can be released. Same unprivileged LXCs, but PVE 9’s LXC 4.x userns now permits sysctl writes that PVE 8.4 forbade. docker 29.4.2 + containerd.io 2.2.3 + runc 1.3.3 — all 11 containers up, no holds, no errors. The regression resolved itself with the host major upgrade.

Lesson — suspect the bundle, not the headline package#

What looked like a docker regression was actually a three-way mismatch: a host policy (PVE 8.4 read-only sysfs), a spec evolution (OCI 1.2.1), and a transitive dependency (runc inside containerd.io). None of the three has a bug. They just don’t combine.

This kind of regression tends to come from two sources:

Major bumps inside transitive dependencies. apt dist-upgrade pulls the dependency tree forward, not just the packages you installed by name. Watching docker tells you nothing about what containerd.io and runc are doing under it.
Container runtime + host isolation policy. OCI spec changes that work fine on a regular Linux host can need new privileges in isolated environments. Unprivileged LXC, gVisor, Kata — anywhere the kernel-side namespace policy diverges from a stock host.

The biggest takeaway is that postponing major upgrades isn’t the safe choice it looks like. A year of postponed upgrades comes back as a year of regressions in one afternoon. Patching frequently makes regressions easier to localize because there are fewer simultaneous moving parts. The 90 minutes here was tuition for that lesson.