The Problem
Traditional HA requires a clustered database (etcd, CockroachDB, Consul+Raft). This adds operational complexity, latency, and a new failure domain. Kovanex takes a different approach: each node keeps its own database, and a gossip protocol shares only what matters — health scores and data locality.
Core Idea: Health Matrix
Every node emits a vector of metrics (CPU, memory, disk, queue depth, runner slots, DB latency). Quantized over time, this becomes a matrix. A convolution of the matrix produces a single health score (0.0–1.0). This score becomes the weight of a vertex in the cluster graph.
t0 t1 t2 t3 t4
cpu_pct [ 45.2, 48.1, 52.3, 49.7, 47.0 ]
mem_pct [ 62.0, 62.5, 63.1, 62.8, 62.3 ]
db_lat_ms [ 2.1, 2.3, 2.5, 2.2, 2.0 ]
queue_depth[ 3, 5, 8, 4, 2 ]
runner_free[ 2, 1, 0, 1, 2 ]
# Convolution → health_score = 0.74 (healthy)
Cluster Graph
Every node sees the entire system as a weighted graph. AI agents and the scheduler use this graph to make placement decisions.
labels: prod labels: staging labels: dev
repos: 12 repos: 12 (replica) repos: 3
runners: 3/4 runners: 0/2 runners: 2/2
| | |
[Runner 1: 0.95] [Runner 2: 0.12] [DB: 0.78]
How It Works
Gossip Protocol
SWIM-based protocol (like HashiCorp Serf). Each node picks 3 random peers every 5 seconds and exchanges state. Full convergence in O(log N) rounds — 100 nodes converge in ~35 seconds.
No heartbeat → suspected (30s) → dead (60s) → tasks rescheduled.
Scheduling
Tasks are placed on nodes by a scoring function:
score = health × capacity × locality × label_match
Locality: 1.0 if node owns repo, 0.7 if replica exists, 0.3 if remote clone needed. Four strategies: best-fit, locality, round-robin, spread.
Git as State Machine
A task is a git branch. The branch contains the prompt, context, scripts, and code changes. If a node dies mid-task, another node pulls the branch, reads the last commit, and continues.
No separate state store needed — git IS the state.
Replication
Per-repo, three levels:
- None — dev sandbox, not critical
- Async — mirror after commit (default)
- Sync — mirror before ack to client (critical)
Auto-Scaling
The cluster continuously monitors pressure:
queue_pressure = pending_tasks / available_runners
# Asymmetric: scale up fast, scale down slow
Scale UP: pressure > 0.8 sustained for 2 minutes
Scale DOWN: pressure < 0.2 sustained for 10 minutes
The GetScaleDecision RPC returns the current recommendation with triggered rules.
An external orchestrator or the AI agent itself calls ApplyScaleAction to execute.
Environments = Graph Topology
Environments are not database records — they are labels on nodes. Promoting a release is traversing the graph:
dev (Node C) → staging (Node B) → prod (Node A)
# QA gates evaluated at each hop
promote v6.0.0 from dev → staging
✓ unit_tests passed
✓ integration passed
✓ security_scan clean
→ deployed to staging
Plugin API = Capabilities
A plugin is a module that registers capabilities on a node. The scheduler sees capabilities in the graph and routes tasks accordingly.
NodeModule {
name: "security-scan"
version: "1.2.0"
endpoint: "localhost:9100"
provides: ["sast", "dependency-audit"]
}
# Task requiring "sast" is automatically routed to this node
Node Roles
| Role | What it does | Use case |
|---|---|---|
| FULL | Server + runner + git | Single-binary, small teams |
| SERVER | API + scheduler only | Dedicated control plane |
| RUNNER | Compute only | Elastic build capacity |
| STORAGE | Git + registry only | Dedicated storage tier |
Comparison
| System | Consensus | State store | Cluster DB needed |
|---|---|---|---|
| Kubernetes | Raft (etcd) | etcd | Yes |
| Nomad | Raft + Gossip | Raft log | Yes (built-in) |
| Docker Swarm | Raft | Raft log | Yes (built-in) |
| Kovanex | Gossip only | Git + local DB | No |
gRPC API
ClusterService — 17th gRPC service (18 total with AgentMemoryService), 17 RPCs:
JoinCluster, LeaveCluster, ListNodes, GetNode, RemoveNode
# Health & metrics
ReportHealth, GetClusterGraph, GetHealthMatrix
# Gossip (node-to-node, public)
Gossip, SyncState (bidirectional stream)
# Scheduling
ScheduleTask, RescheduleTask, GetTaskPlacement
# Replication & scaling
SetReplicationLevel, GetReplicationStatus
GetScaleDecision, ApplyScaleAction
Full proto definitions: kovanex.proto
Minimum Viable Cluster
Phase 1 — Observable: gossip between nodes, health matrix, cluster graph in CLI/UI.
Phase 2 — Smart: graph-based scheduling, repo replication, task reschedule.
Phase 3 — Enterprise: sync replication, multi-region, plugin API, auto-scale executor.