runtime/qemu: guard VM liveness against PID reuse

pidOf trusted any live PID from the pidfile. After a VM is killed (e.g. an
agent-restart cgroup kill) its stale pidfile can hold a PID the kernel has
reused for an unrelated process, so the VM falsely reported "running" —
Recover then skipped it and it stayed dead in drift. pidOf now confirms
/proc/<pid>/cmdline references the VM's state dir before trusting it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Kyle Isom
2026-06-11 13:05:26 -07:00
parent 98b166fa7b
commit fd88ccb9d6

View File

@@ -485,6 +485,16 @@ func (q *QEMU) pidOf(name string) int {
if err := syscall.Kill(pid, 0); err != nil {
return 0
}
// Guard against PID reuse: a stale pidfile from a VM that was killed (e.g.
// by an agent-restart cgroup kill) may hold a PID that the kernel has since
// reused for an unrelated process. Confirm the live process is in fact this
// VM's QEMU by checking its cmdline references the VM's state dir (every
// launch passes -pidfile/-serial/-qmp paths under vmDir). Without this, a
// dead VM reports "running" and is never recovered.
cmdline, err := os.ReadFile(fmt.Sprintf("/proc/%d/cmdline", pid))
if err != nil || !strings.Contains(string(cmdline), q.vmDir(name)) {
return 0
}
return pid
}