Common failure modes I've seen: no visibility into what the agent did step-by-step, surprise LLM bills from untracked token usage, risky outputs going undetected, and no audit trail for post-mortems.
I've been building AgentShield (https://useagentshield.com) — an observability SDK for AI agents. It does execution tracing, risk detection on outputs, cost tracking per agent/model, and human-in-the-loop approval for high-risk actions. Plugs into LangChain, CrewAI, and OpenAI Agents SDK with a 2-line integration.
Curious what others are using. Rolling your own monitoring? LangSmith? Langfuse? Or just hoping for the best?
llm does not act on production. he build scripts, and you take the greatest care of theses scripts.
Clone you customer data and run evertything blank.
Just uses the llm tool as dangerous tool: considere that it will fail each time it's able to.
even will all theses llm specific habitus, you still get a x100 productivity.
because each of theses advise can ben implemented by llms, for llms, by many way. it's almost free. just plan it.
Some of the things you mention are more often addressed by guardrails. Some of the others (quality) require some evaluation for that measure, but results can go into the same monitory stack.
The gap isn't monitoring. It's what happens automatically when degradation gets detected. Right now the answer for every team I've talked to is "page a human." That human reads logs, guesses, deploys a fix. The system already shifted while they were debugging.