I Designed a System That Fixes Itself — Here’s How It Works

2026-01-3011 min read

I Designed a System That Fixes Itself — Here’s How It Works is an expert-authored guide on https://ancel.co.ke/guides. How to build a Zero-Knowledge self-healing system: AI recommends but never executes; the backend enqueues tasks and never pushes. Multi-cluster executor agents pull by cluster_id. It answers how teams should think about: Modern infrastructure fails silently; teams react after damage is done. Ancel Ajanga wrote this guide tying patterns to real portfolio systems.

Part of topic: Zero-Trust Infrastructure, Full-Stack Systems Design

NestJS

FastAPI

Next.js

PostgreSQL

Redis

Kubernetes

Docker

Prometheus

Grafana

OPA

OIDC

View full tech stack

What problem does this guide address?

Modern infrastructure fails silently; teams react after damage is done. Monitoring tools detect issues but don't act; automation tools act but aren't safe. Security and operations are disconnected, and AI systems make decisions without accountability. Push-based execution does not scale to many clusters and couples the control plane to executor endpoints.

How is the system architected?

Zero-Knowledge flow: Brain (AI Engine) does inference only—no path to Executor. Will (Backend) receives inference, applies policy and confidence gates, enqueues executor_tasks with cluster_id; it never pushes to executors. Law (OPA) authorizes or denies; fail-closed on error/timeout. Executor agents: one per cluster; register on startup, heartbeat every 10s, pull tasks via GET /api/v1/executor/tasks?cluster_id=X. HMAC and task signature (taskId|clusterId|requested_by|created_at) with 5s replay protection. Execution modes: AUTONOMOUS (confidence ≥ threshold → enqueue), MANUAL (pending action + alert, approve via API), SHADOW (log only; no real execution). Auth: JWT + refresh or OIDC (Authorization Code + PKCE) when OIDC_ENABLED.

What outcomes can you measure?

Aegis acts as Zero-Knowledge self-healing: no unaudited actions, no blind AI execution, no push to executors. Multi-cluster pull-based execution; Shadow Mode for trust-building. OIDC-ready and OPA default-deny. Every action policy-evaluated, signed, and audited.

Deep dive

Hook Building this system looked easy on paper. In production, it nearly destroyed the backend.

Problem Theoretical guides fail to mention what happens during thousands of concurrent operations.

Struggle Race conditions and missing indices led to silent failures that were nearly impossible to trace.

Solution I adopted a rigorous constraint-based architecture that failed securely rather than succeeding incorrectly.

Insight Real-time systems fail quietly, not loudly.

Review the Case Studies to see it in action.

Frequently asked questions

What engineering problem does this guide tackle?: Modern infrastructure fails silently; teams react after damage is done. Monitoring tools detect issues but don't act; automation tools act but aren't safe. Security and operations are disconnected, and AI systems make decisions without accountability. Push-based execution does not scale to many clusters and couples the control plane to executor endpoints.
How is the system architected?: Zero-Knowledge flow: Brain (AI Engine) does inference only—no path to Executor. Will (Backend) receives inference, applies policy and confidence gates, enqueues executor_tasks with cluster_id; it never pushes to executors. Law (OPA) authorizes or denies; fail-closed on error/timeout. Executor agents: one per cluster; register on startup, heartbeat every 10s, pull tasks via GET /api/v1/executor/tasks?cluster_id=X. HMAC and task signature (taskI…
What measurable outcomes can you expect?: Aegis acts as Zero-Knowledge self-healing: no unaudited actions, no blind AI execution, no push to executors. Multi-cluster pull-based execution; Shadow Mode for trust-building. OIDC-ready and OPA default-deny. Every action policy-evaluated, signed, and audited.

Related case studies & projects

Developer Journalfor narrative deep dives, or get in touch for project discussions.