What tends to work better than a natural language decision tree:
- Explicit capability grants: agent starts with zero authority, specific actions are granted not inferred - Threshold rules over judgment calls: not 'financial decisions' but '$X or more, always ask' (deterministic) - Audit-first for new capabilities: first N times an agent exercises a new type of authority, log for review before executing - Veto primitives: a way to interrupt mid-execution, not just pre-approve
The subtle failure mode to watch: an agent that gradually expands its interpretation of what's in scope because context accumulates and past decisions look like permission. It doesn't ask because prior runs didn't require asking.
The heartbeat/orchestration pattern you're using (30-min loop, sub-agents by function) is solid architecture. The authorization layer is usually what causes the hard-to-debug incidents. What did 'broke' look like when it happened?