Distributed Orchestration

WFL/WAG Distributed Workflow

Joel Johnston 2026-04-06 Pre-stroke design

WFL/WAG Distributed Workflow

Author: Joel Johnston Date: 2026-04-06 Domain: Distributed Orchestration Stroke Timeline: Pre-stroke design

Abstract

WFL (Workflow) and WAG (Waggler) are the orchestration primitives for roboNet. Workflows are directed acyclic graphs of tasks. Wagglers are worker-role coordinators that manage pools of workers for specific capability classes. Together they implement a scatter-gather orchestration model that routes work to capable nodes, survives coordinator failure, and guarantees no work is lost.

Core Primitives

WFL — Workflow

A workflow is a directed acyclic graph (DAG) of tasks with defined:

Inputs: data or signals required to start the workflow
Tasks: individual units of work, each with capability requirements
Dependencies: edges in the DAG — task B cannot start until task A completes
Outputs: the results produced when all terminal tasks complete
Failure policy: what happens when a task fails (retry, skip, abort)

Workflows are declarative. The workflow definition specifies what needs to happen and in what order. It does not specify which node runs each task. Routing is the routing engine's responsibility.

WAG — Waggler

A Waggler is a coordinator for a specific capability class. It manages a pool of workers that share a capability (e.g., AI inference, code execution, network scanning).

A Waggler:

Maintains a registry of available workers in its capability class
Accepts task dispatch requests from the workflow engine
Selects a worker based on availability and load
Monitors task execution and reports results back to the workflow engine
Handles worker failures (reassigns to another worker in the pool)

The name "Waggler" is intentional: it manages the waggling — the signaling between the workflow engine and the workers. The Waggler is not a worker. It does not execute tasks. It coordinates.

Scatter-Gather Orchestration

The orchestration model follows the scatter-gather pattern:

Workflow Engine
      |
      | scatter tasks
      ↓
WAG (capability A)    WAG (capability B)    WAG (capability C)
      |                      |                      |
      | dispatch             | dispatch             | dispatch
      ↓                      ↓                      ↓
Worker A1  A2  A3    Worker B1  B2          Worker C1
      |                      |                      |
      | results              | results              | results
      ↓                      ↓                      ↓
WAG (capability A)    WAG (capability B)    WAG (capability C)
      |                      |                      |
      | gather               | gather               | gather
      ↓
Workflow Engine
      |
      | unified result
      ↓
Requester

Scatter: The workflow engine identifies the tasks that are ready to execute (their dependencies are satisfied). It routes each task to the appropriate Waggler based on capability requirements.

Execute: Each Waggler dispatches its task to an available worker. Workers execute in isolation. Worker state is ephemeral.

Gather: Workers return results to their Waggler. The Waggler aggregates and forwards to the workflow engine. The workflow engine checks whether the result satisfies the task's acceptance criteria. If yes, the task is complete and its dependents become eligible to run.

State Machines

WFL State Machine

PENDING → RUNNING → GATHERING → COMPLETE
                              ↘ FAILED

State	Description
PENDING	Workflow created, waiting for trigger or dependency
RUNNING	At least one task is actively executing
GATHERING	All tasks dispatched, waiting for final results
COMPLETE	All terminal tasks completed successfully
FAILED	At least one task failed and failure policy = abort

PENDING → RUNNING: triggered when workflow inputs are available and at least one task has no unsatisfied dependencies.

RUNNING → GATHERING: last task dispatched, waiting for results.

GATHERING → COMPLETE: all terminal tasks report success.

GATHERING → FAILED: a task reports failure and the workflow failure policy is abort.

With failure policy skip, a failed task marks itself failed and the workflow continues with the remaining tasks. With failure policy retry, the workflow engine requeues the task (up to configurable max retries) before failing.

WAG State Machine

IDLE → DISPATCHING → MONITORING → COLLECTING → IDLE
                                ↘ FAILOVER

State	Description
IDLE	No active tasks
DISPATCHING	Selecting a worker and sending the task
MONITORING	Task is executing, Waggler is tracking heartbeats
COLLECTING	Task complete, collecting results
FAILOVER	Worker failed, reassigning to backup worker

FAILOVER → DISPATCHING: Waggler reassigns the task to a different worker in its pool. Task state is preserved in the workflow engine — the new worker starts from the same input state.

Heartbeat Chains

Workers send heartbeats to their Waggler at configurable intervals (default: 5 seconds). The Waggler expects a heartbeat within a deadline window (default: 15 seconds = 3 missed heartbeats).

Missed heartbeat sequence:

1 missed heartbeat: log, continue monitoring
2 missed heartbeats: alert, begin selecting backup worker
3 missed heartbeats: declare worker dead, enter FAILOVER state

The heartbeat is not just a liveness signal — it carries a task progress report:

Tasks completed within the current task (for multi-step tasks)
Estimated time to completion
Current resource consumption

This progress data feeds the workflow engine's ETA calculations and dashboard reporting.

Heartbeat Chain Continuity

Heartbeats are sequenced (monotonically increasing sequence number). A heartbeat with a sequence number lower than expected indicates a replay or network reorder. A gap in the heartbeat sequence (sequence 5 received after sequence 3, with sequence 4 never arriving) is treated as one missed heartbeat. This prevents false failovers from brief network disruptions.

WAG Failover

When a Waggler fails, its in-progress tasks are at risk. WAG failover ensures no work is lost.

Detection: Other nodes detect Waggler failure through the mesh's heartbeat monitoring (each Waggler is itself a mesh node subject to HCTH trust evaluation and Sentinel monitoring).

Failover trigger: A Waggler that stops sending heartbeats to the mesh (not to its workers — the mesh-level heartbeat) triggers the WAG failover protocol.

Failover execution:

Workflow engine queries remaining Wagglers for capability match
Surviving Waggler with matching capability and available capacity is selected as successor
Workflow engine transfers all pending tasks from the failed Waggler's queue to the successor
In-flight tasks (tasks a worker was actively executing at failover time) are re-evaluated: if the worker is still alive and sending heartbeats to the mesh, the workflow engine sends worker contact information to the successor Waggler, which inherits the monitoring relationship

Task state persistence: Task state lives in the workflow engine, not the Waggler. This is the key design decision. A Waggler that fails loses its routing state, but the workflow engine retains all task state. Failover is coordination recovery, not data recovery.

Thread-Based Ephemeral Workers

Workers are thread-based and ephemeral. Each task spawns a worker thread. The worker executes the task in isolation, reports the result, and terminates. There is no persistent worker process.

Benefits:

No state accumulation: a worker cannot be corrupted by previous tasks
Clean blast radius: a worker that crashes affects only its current task
Trivial scaling: adding worker capacity means allowing more concurrent threads (up to the node's resource limits)
Simple failover: a dead worker has no state to recover — just re-dispatch the task

Tradeoff: thread startup overhead per task. For short tasks (under ~100ms), thread overhead is significant. roboNet mitigates this with a thread pool per capability class — idle threads wait for tasks rather than spawning fresh per-task.

Integration Points

With Capability-Based Routing

The workflow engine does not know which Waggler to send a task to. It knows the task's capability requirements. The capability routing layer resolves capability requirements to the set of Wagglers that can satisfy them. The workflow engine then selects from the returned Waggler set based on availability.

With HCTH Trust

Task dispatch to a Waggler requires minimum trust threshold (configurable, default: Established tier). A Waggler with low trust receives fewer task dispatches and eventually none until trust is rebuilt. This prevents compromised Wagglers from collecting sensitive tasks.

With Sentinel

Wagglers are mesh nodes subject to Sentinel behavioral monitoring. A Waggler that begins exhibiting anomalous behavior (unusual task acceptance patterns, modified results, anomalous resource usage) is quarantined. Quarantined Wagglers receive no new task dispatches. In-progress task assignments are migrated through the failover protocol.

With MVC Query Engine

The MVC query engine uses WFL/WAG for its scatter-gather execution. A query is a workflow with one task per capable node (scatter) and a synthesis task that depends on all scatter tasks completing (gather). The WAG layer handles the actual dispatch to capable nodes.