Kovanex Cluster Architecture

The Problem

Traditional HA requires a clustered database (etcd, CockroachDB, Consul+Raft). This adds operational complexity, latency, and a new failure domain. Kovanex takes a different approach: each node keeps its own database, and a gossip protocol shares only what matters — health scores and data locality.

Core Idea: Health Matrix

Every node emits a vector of metrics (CPU, memory, disk, queue depth, runner slots, DB latency). Quantized over time, this becomes a matrix. A convolution of the matrix produces a single health score (0.0–1.0). This score becomes the weight of a vertex in the cluster graph.

# Health matrix for a node (5-second windows)

t0 t1 t2 t3 t4
cpu_pct [ 45.2, 48.1, 52.3, 49.7, 47.0 ]
mem_pct [ 62.0, 62.5, 63.1, 62.8, 62.3 ]
db_lat_ms [ 2.1, 2.3, 2.5, 2.2, 2.0 ]
queue_depth[ 3, 5, 8, 4, 2 ]
runner_free[ 2, 1, 0, 1, 2 ]

# Convolution → health_score = 0.74 (healthy)

Cluster Graph

Every node sees the entire system as a weighted graph. AI agents and the scheduler use this graph to make placement decisions.

[Server A: 0.87] <——> [Server B: 0.34] <——> [Server C: 0.91]
labels: prod labels: staging labels: dev
repos: 12 repos: 12 (replica) repos: 3
runners: 3/4 runners: 0/2 runners: 2/2
| | |
[Runner 1: 0.95] [Runner 2: 0.12] [DB: 0.78]

How It Works

Gossip Protocol

SWIM-based protocol (like HashiCorp Serf). Each node picks 3 random peers every 5 seconds and exchanges state. Full convergence in O(log N) rounds — 100 nodes converge in ~35 seconds.

No heartbeat → suspected (30s) → dead (60s) → tasks rescheduled.

Scheduling

Tasks are placed on nodes by a scoring function:

score = health × capacity × locality × label_match

Locality: 1.0 if node owns repo, 0.7 if replica exists, 0.3 if remote clone needed. Four strategies: best-fit, locality, round-robin, spread.

Git as State Machine

A task is a git branch. The branch contains the prompt, context, scripts, and code changes. If a node dies mid-task, another node pulls the branch, reads the last commit, and continues.

No separate state store needed — git IS the state.

Replication

Per-repo, three levels:

  • None — dev sandbox, not critical
  • Async — mirror after commit (default)
  • Sync — mirror before ack to client (critical)

Auto-Scaling

The cluster continuously monitors pressure:

runner_pressure = active_jobs / total_slots
queue_pressure = pending_tasks / available_runners

# Asymmetric: scale up fast, scale down slow
Scale UP: pressure > 0.8 sustained for 2 minutes
Scale DOWN: pressure < 0.2 sustained for 10 minutes

The GetScaleDecision RPC returns the current recommendation with triggered rules. An external orchestrator or the AI agent itself calls ApplyScaleAction to execute.

Environments = Graph Topology

Environments are not database records — they are labels on nodes. Promoting a release is traversing the graph:

# Release promotion follows graph edges
dev (Node C) → staging (Node B) → prod (Node A)

# QA gates evaluated at each hop
promote v6.0.0 from dev → staging
✓ unit_tests passed
✓ integration passed
✓ security_scan clean
→ deployed to staging

Plugin API = Capabilities

A plugin is a module that registers capabilities on a node. The scheduler sees capabilities in the graph and routes tasks accordingly.

# Node registers a security scanning module
NodeModule {
  name: "security-scan"
  version: "1.2.0"
  endpoint: "localhost:9100"
  provides: ["sast", "dependency-audit"]
}

# Task requiring "sast" is automatically routed to this node

Node Roles

RoleWhat it doesUse case
FULLServer + runner + gitSingle-binary, small teams
SERVERAPI + scheduler onlyDedicated control plane
RUNNERCompute onlyElastic build capacity
STORAGEGit + registry onlyDedicated storage tier

Comparison

SystemConsensusState storeCluster DB needed
KubernetesRaft (etcd)etcdYes
NomadRaft + GossipRaft logYes (built-in)
Docker SwarmRaftRaft logYes (built-in)
KovanexGossip onlyGit + local DBNo

gRPC API

ClusterService — 17th gRPC service (18 total with AgentMemoryService), 17 RPCs:

# Node lifecycle
JoinCluster, LeaveCluster, ListNodes, GetNode, RemoveNode

# Health & metrics
ReportHealth, GetClusterGraph, GetHealthMatrix

# Gossip (node-to-node, public)
Gossip, SyncState (bidirectional stream)

# Scheduling
ScheduleTask, RescheduleTask, GetTaskPlacement

# Replication & scaling
SetReplicationLevel, GetReplicationStatus
GetScaleDecision, ApplyScaleAction

Full proto definitions: kovanex.proto

Minimum Viable Cluster

Phase 1 — Observable: gossip between nodes, health matrix, cluster graph in CLI/UI.
Phase 2 — Smart: graph-based scheduling, repo replication, task reschedule.
Phase 3 — Enterprise: sync replication, multi-region, plugin API, auto-scale executor.