WFL/WAG Distributed Workflow
WFL/WAG Distributed Workflow
Author: Joel Johnston Date: 2026-04-06 Domain: Distributed Orchestration Stroke Timeline: Pre-stroke design
Abstract
WFL (Workflow) and WAG (Waggler) are the orchestration primitives for roboNet. Workflows are directed acyclic graphs of tasks. Wagglers are worker-role coordinators that manage pools of workers for specific capability classes. Together they implement a scatter-gather orchestration model that routes work to capable nodes, survives coordinator failure, and guarantees no work is lost.
Core Primitives
WFL — Workflow
A workflow is a directed acyclic graph (DAG) of tasks with defined:
- Inputs: data or signals required to start the workflow
- Tasks: individual units of work, each with capability requirements
- Dependencies: edges in the DAG — task B cannot start until task A completes
- Outputs: the results produced when all terminal tasks complete
- Failure policy: what happens when a task fails (retry, skip, abort)
Workflows are declarative. The workflow definition specifies what needs to happen and in what order. It does not specify which node runs each task. Routing is the routing engine's responsibility.
WAG — Waggler
A Waggler is a coordinator for a specific capability class. It manages a pool of workers that share a capability (e.g., AI inference, code execution, network scanning).
A Waggler:
- Maintains a registry of available workers in its capability class
- Accepts task dispatch requests from the workflow engine
- Selects a worker based on availability and load
- Monitors task execution and reports results back to the workflow engine
- Handles worker failures (reassigns to another worker in the pool)
The name "Waggler" is intentional: it manages the waggling — the signaling between the workflow engine and the workers. The Waggler is not a worker. It does not execute tasks. It coordinates.
Scatter-Gather Orchestration
The orchestration model follows the scatter-gather pattern:
Workflow Engine
|
| scatter tasks
↓
WAG (capability A) WAG (capability B) WAG (capability C)
| | |
| dispatch | dispatch | dispatch
↓ ↓ ↓
Worker A1 A2 A3 Worker B1 B2 Worker C1
| | |
| results | results | results
↓ ↓ ↓
WAG (capability A) WAG (capability B) WAG (capability C)
| | |
| gather | gather | gather
↓
Workflow Engine
|
| unified result
↓
Requester
Scatter: The workflow engine identifies the tasks that are ready to execute (their dependencies are satisfied). It routes each task to the appropriate Waggler based on capability requirements.
Execute: Each Waggler dispatches its task to an available worker. Workers execute in isolation. Worker state is ephemeral.
Gather: Workers return results to their Waggler. The Waggler aggregates and forwards to the workflow engine. The workflow engine checks whether the result satisfies the task's acceptance criteria. If yes, the task is complete and its dependents become eligible to run.
State Machines
WFL State Machine
PENDING → RUNNING → GATHERING → COMPLETE
↘ FAILED
| State | Description |
|---|---|
| PENDING | Workflow created, waiting for trigger or dependency |
| RUNNING | At least one task is actively executing |
| GATHERING | All tasks dispatched, waiting for final results |
| COMPLETE | All terminal tasks completed successfully |
| FAILED | At least one task failed and failure policy = abort |
PENDING → RUNNING: triggered when workflow inputs are available and at least one task has no unsatisfied dependencies.
RUNNING → GATHERING: last task dispatched, waiting for results.
GATHERING → COMPLETE: all terminal tasks report success.
GATHERING → FAILED: a task reports failure and the workflow failure policy is abort.
With failure policy skip, a failed task marks itself failed and the workflow continues with the remaining tasks. With failure policy retry, the workflow engine requeues the task (up to configurable max retries) before failing.
WAG State Machine
IDLE → DISPATCHING → MONITORING → COLLECTING → IDLE
↘ FAILOVER
| State | Description |
|---|---|
| IDLE | No active tasks |
| DISPATCHING | Selecting a worker and sending the task |
| MONITORING | Task is executing, Waggler is tracking heartbeats |
| COLLECTING | Task complete, collecting results |
| FAILOVER | Worker failed, reassigning to backup worker |
FAILOVER → DISPATCHING: Waggler reassigns the task to a different worker in its pool. Task state is preserved in the workflow engine — the new worker starts from the same input state.
Heartbeat Chains
Workers send heartbeats to their Waggler at configurable intervals (default: 5 seconds). The Waggler expects a heartbeat within a deadline window (default: 15 seconds = 3 missed heartbeats).
Missed heartbeat sequence:
- 1 missed heartbeat: log, continue monitoring
- 2 missed heartbeats: alert, begin selecting backup worker
- 3 missed heartbeats: declare worker dead, enter FAILOVER state
The heartbeat is not just a liveness signal — it carries a task progress report:
- Tasks completed within the current task (for multi-step tasks)
- Estimated time to completion
- Current resource consumption
This progress data feeds the workflow engine's ETA calculations and dashboard reporting.
Heartbeat Chain Continuity
Heartbeats are sequenced (monotonically increasing sequence number). A heartbeat with a sequence number lower than expected indicates a replay or network reorder. A gap in the heartbeat sequence (sequence 5 received after sequence 3, with sequence 4 never arriving) is treated as one missed heartbeat. This prevents false failovers from brief network disruptions.
WAG Failover
When a Waggler fails, its in-progress tasks are at risk. WAG failover ensures no work is lost.
Detection: Other nodes detect Waggler failure through the mesh's heartbeat monitoring (each Waggler is itself a mesh node subject to HCTH trust evaluation and Sentinel monitoring).
Failover trigger: A Waggler that stops sending heartbeats to the mesh (not to its workers — the mesh-level heartbeat) triggers the WAG failover protocol.
Failover execution:
- Workflow engine queries remaining Wagglers for capability match
- Surviving Waggler with matching capability and available capacity is selected as successor
- Workflow engine transfers all pending tasks from the failed Waggler's queue to the successor
- In-flight tasks (tasks a worker was actively executing at failover time) are re-evaluated: if the worker is still alive and sending heartbeats to the mesh, the workflow engine sends worker contact information to the successor Waggler, which inherits the monitoring relationship
Task state persistence: Task state lives in the workflow engine, not the Waggler. This is the key design decision. A Waggler that fails loses its routing state, but the workflow engine retains all task state. Failover is coordination recovery, not data recovery.
Thread-Based Ephemeral Workers
Workers are thread-based and ephemeral. Each task spawns a worker thread. The worker executes the task in isolation, reports the result, and terminates. There is no persistent worker process.
Benefits:
- No state accumulation: a worker cannot be corrupted by previous tasks
- Clean blast radius: a worker that crashes affects only its current task
- Trivial scaling: adding worker capacity means allowing more concurrent threads (up to the node's resource limits)
- Simple failover: a dead worker has no state to recover — just re-dispatch the task
Tradeoff: thread startup overhead per task. For short tasks (under ~100ms), thread overhead is significant. roboNet mitigates this with a thread pool per capability class — idle threads wait for tasks rather than spawning fresh per-task.
Integration Points
With Capability-Based Routing
The workflow engine does not know which Waggler to send a task to. It knows the task's capability requirements. The capability routing layer resolves capability requirements to the set of Wagglers that can satisfy them. The workflow engine then selects from the returned Waggler set based on availability.
With HCTH Trust
Task dispatch to a Waggler requires minimum trust threshold (configurable, default: Established tier). A Waggler with low trust receives fewer task dispatches and eventually none until trust is rebuilt. This prevents compromised Wagglers from collecting sensitive tasks.
With Sentinel
Wagglers are mesh nodes subject to Sentinel behavioral monitoring. A Waggler that begins exhibiting anomalous behavior (unusual task acceptance patterns, modified results, anomalous resource usage) is quarantined. Quarantined Wagglers receive no new task dispatches. In-progress task assignments are migrated through the failover protocol.
With MVC Query Engine
The MVC query engine uses WFL/WAG for its scatter-gather execution. A query is a workflow with one task per capable node (scatter) and a synthesis task that depends on all scatter tasks completing (gather). The WAG layer handles the actual dispatch to capable nodes.