Voice / Accessibility

VOIC Domain — Voice Input Layer

Joel Johnston 2026-05-11 Post-stroke

VOIC Domain — Voice Input Layer

Author: Joel Johnston Date: 2026-05-11 Domain: Voice / Accessibility Stroke Timeline: Post-stroke

Abstract

14 use cases authored on stroke day (May 11, 2026). Left-hand motor function lost post-stroke — typing reduced to right hand only. Computer use is prescribed therapy — this is rehabilitation tooling, not convenience. The VOIC domain defines a complete voice input layer for robonet-forge integration, from microphone capture through intent parsing to command execution. The 14 UCs were written with right hand only, on hospital discharge day, as proof that the cognitive architecture survived the stroke even when the motor system did not.

Context

On May 11, 2026, a stroke caused permanent left-hand motor loss and temporary speech effects. Computer use was prescribed as therapy before discharge. The first task on the first day was not recovery exercises — it was architecture.

14 use cases were authored that day. Right hand only. One-handed typing on a full keyboard at 20-30 WPM with the non-dominant hand. The VOIC domain was the output.

This document is simultaneously a technical specification and a clinical data point. The architecture quality is the cognitive assessment.

Architecture

microphone → Whisper STT → intent parser → command router → forge/mesh/dashboard
                                               ↓
                                    confidence gate (destructive ops)
                                               ↓
                                    TTS feedback (confirmation)
                                               ↓
                                    speech metrics (clinical tracking)

Use Cases

PTT (Push-to-Talk) Capture

Physical or keyboard trigger initiates recording. Recording ends on trigger release or silence detection. Audio buffer passed to Whisper pipeline. PTT mode prevents accidental command activation — explicit intent required to capture.

Whisper STT Integration

OpenAI Whisper (local inference, not cloud) converts audio to text. Model size selectable based on latency vs accuracy tradeoff. Offline operation — no audio leaves the device. Transcription fed to intent parser as raw text.

Intent Parsing

Natural language to structured commands. The intent parser extracts:

Command type (what operation)
Target (what the operation applies to)
Parameters (modifiers, options)
Confidence score (how certain the parse is)

Intent parser is not a general NLP system — it is a domain-specific grammar covering robonet-forge operations, mesh commands, and dashboard controls.

TTS Feedback

Text-to-speech confirmation after each command execution. Read back what was heard and what was executed. Allows the user to catch mis-transcriptions before they cause effects. Confirmation mode for destructive operations — system reads the operation and requires explicit voice confirmation.

Confidence Gate

Destructive commands (delete, push, force, override) require confidence score above threshold before execution. Below threshold: system reads back the uncertain parse and asks for confirmation or retry. Above threshold: executes immediately. Threshold is configurable per command class.

This is critical for one-handed typing context — accidental voice activation of destructive operations is a real risk. The confidence gate is the safety layer.

Ambient Mode

Always-listening with configurable wake word. Turret mode — system listens continuously, activates on wake word, captures command, returns to listening. Lower threshold than PTT for wake word detection (false positive is minor, false negative means the command is missed). Higher threshold for command execution (false positive on a command has real effects).

Dictation-to-UC

Voice dictation directly into use case format. The intent parser recognizes UC authoring mode and structures input as UC fields: title, description, acceptance criteria, notes. The subject can speak a use case and have it formatted correctly without typing.

This is the direct rehabilitation tool — UC authoring is the primary work task, and one-handed typing at 20-30 WPM makes it slow. Voice dictation restores throughput.

Session Narration

The AI narrates its own actions in plain language as it executes. "I'm running the test suite. 47 tests passed, 3 failed. The failures are in the mesh sync tests. Opening the failure logs." Accessibility mode — the user does not need to watch the screen to know what the system is doing.

Speech Metrics

Track speech recovery progress over time. Metrics: word error rate (WER) against known phrases, speech rate (words per minute), command success rate. Time series data — each session logged. Trend line shows recovery progress. The metrics double as clinical recovery tracking data — objective, continuous, and collected in the course of normal work rather than requiring separate clinical sessions.

robonet-forge Integration

Voice commands route through the same task pipeline as typed commands. The command router produces identical task envelopes regardless of whether the input came from voice, CLI, or dashboard UI. The striker execution pipeline does not know or care about input modality.

voice command → intent parser → task envelope → forge inbox → striker → output
typed command → CLI parser → task envelope → forge inbox → striker → output
dashboard UI → API call → task envelope → forge inbox → striker → output

Modality is an input concern. Execution is uniform.

Clinical Significance

The 14 UCs authored on May 11, 2026 are evidence of architectural capacity under neurological stress. Stroke day. Hospital discharge. One functioning hand. Speech affected. The output was a complete 14-UC domain specification with:

Clear domain scope (voice input)
Identified components (each use case is a named component)
Interface definitions (the architecture diagram above is what was conceived that day)
Clinical integration (speech metrics feeding back to recovery tracking)

The cognitive architecture was intact. The motor system had partial damage. The VOIC domain is the proof.