Back to Home

The Architecture of Resilience: Moving Beyond "It Works on My Machine"

I am Ancel Ajanga (Duke), a Staff Software Engineer and Lead Systems Architect based in Nairobi. My engineering ethos is simple: building software isn't about stringing APIs together. It's about designing systems that can survive failure, scale under extreme pressure, and maintain strict transactional integrity.

The Genesis of a Systems Engineer

When I first began coding in 2021, my focus was entirely on the surface layer. I was building interactive UIs, stitching together simple database queries, and relying heavily on the default configurations of tools like React and Express. If a feature worked in staging, I shipped it.

However, as I began tackling enterprise-level problems, the fragility of standard MVC applications became terrifyingly apparent. A dropped database connection during a financial transaction didn't just crash a server—it corrupted ledgers. A sudden spike in user registration didn't just slow down a UI—it overwhelmed the thread pool, causing cascading failures across loosely coupled microservices.

This realization marked my transition from a traditional "Fullstack Developer" to a Systems Architect. I stopped asking "How do I build this feature?" and started asking "How does this feature fail, and how can the system heal itself when it does?"

Defining the "Blind Trust" Problem

One of the most profound challenges in modern distributed architecture is what I call the "Blind Trust" problem. In standard RESTful microservices, Service A sends a payload to Service B, receives a 200 OK, and blindly trusts that Service B has successfully persisted the data.

What happens if Service B's database connection is severed exactly after it sends the 200 OK? What if the message broker drops the event?

I tackled this aggressively when architecting NestFi, a decentralized financial engineering platform. In financial systems, a dropped record isn't a UI bug; it's lost capital. To solve this, I moved away from standard request-response cycles and implemented a strict, immutable append-only ledger pattern.

  • Every mutation generates a cryptographically unique UUID v7.
  • Write operations are placed into an idempotent Event Queue (Kafka/BullMQ).
  • The system employs an Outbox Pattern—ensuring that an event is never dispatched unless the local database transaction has successfully committed.

This ensures that even if a node is wiped from existence mid-request, the system state remains perfectly consistent and reconcilable.

Self-Healing Infrastructure

Resilience isn't just about data; it's about network topography. When building Aegis, a high-performance machine learning operational gateway, the primary threat vector wasn't bad data—it was bad traffic. Model inference is notoriously expensive. A standard DDoS attack or even an accidental stampeding herd of legitimate users can exhaust GPU compute in seconds.

My approach to Aegis was to build infrastructure that acts as a fortress. I employed Redis Enterprise acting not just as a cache, but as a rigid rate-limiting barricade implementing Token Bucket algorithms.

"A system is only as strong as its ability to reject bad work quickly."

Before a request ever hits the inference engine, Aegis evaluates the JWT against OPA (Open Policy Agent) rules, checks the Redis quota, and inspects the payload signature. If the system detects strain, it automatically triggers Circuit Breakers, returning cached historical inferences or graceful degradation messages rather than allowing the core database to lock up.

Data Sovereignty and Zero-Knowledge Contexts

As distributed systems grow, so does the attack surface. In my work on Inkly, the requirement was absolute data sovereignty. Users needed collaborative spaces where even the server administrators could not read the underlying data.

This required pivoting from standard TLS encryption to a strict Zero-Knowledge (ZK) architecture. I engineered a lifecycle where cryptographic keys are generated and held exclusively on the client. Data payloads are encrypted in the browser using AES-GCM before ever touching a network socket. The backend (Node.js/PostgreSQL) simply acts as a blind relay, storing encrypted blobs and validating multi-tenant access via cryptographically signed intents.

Closing Thoughts: The Future is Distributed

The era of the "simple web app" is dying. Today's applications are highly concurrent, globally distributed ecosystems that demand rigorous engineering at every layer—from responsive, hardware-accelerated UIs down to idempotent database clustering.

My goal as an architect is to continue designing these systems. Whether it's optimizing a React render cycle to maintain 60fps under heavy load, or sharding a PostgreSQL database to handle tens of thousands of concurrent writes, my focus remains singular: building software that survives.