agent: recover down components on startup when no boot sequence

A unikernel VM has no runtime restart policy, so if it exits — including when an agent restart's cgroup kill takes it down — nothing restarts it, and it sits in drift. Recover() already handles this (and unikernels, via runtimeFor), but only ran inside RunBootSequence, which is gated on a [boot] sequence that worker nodes don't define. Now the agent also runs Recover once in the background on startup when there is no boot sequence, so desired=running components (VMs especially) come back after an agent or host restart without delaying registration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 12:58:50 -07:00
parent 4a55972455
commit 98b166fa7b
1 changed files with 15 additions and 0 deletions
--- a/internal/agent/agent.go
+++ b/internal/agent/agent.go
@@ -11,6 +11,7 @@ import (
 	"os/signal"
 	"path/filepath"
 	"syscall"
+	"time"

 	mcpv1 "git.wntrmute.dev/mc/mcp/gen/mcp/v1"
 	"git.wntrmute.dev/mc/mcp/internal/auth"
@@ -155,6 +156,20 @@ func Run(cfg *config.AgentConfig, version string) error {
 			logger.Error("boot sequence failed", "err", err)
 			// Continue starting the gRPC server — partial boot is better than no agent.
 		}
+	} else {
+		// No ordered boot sequence: still reconcile once, in the background, so
+		// desired=running components are brought back up after an agent or host
+		// restart without delaying registration. This matters most for
+		// unikernel VMs: unlike podman containers (which have a restart policy),
+		// a VM that exits — including when an agent restart's cgroup kill takes
+		// it down — has nothing to restart it.
+		go func() {
+			recCtx, recCancel := context.WithTimeout(context.Background(), 10*time.Minute)
+			defer recCancel()
+			if err := a.Recover(recCtx); err != nil {
+				logger.Error("startup recover failed", "err", err)
+			}
+		}()
 	}

 	// Start heartbeat client (registers with master and sends heartbeats).