Sentiox

Synthetic monitoring for AI phone agents — test calls on a schedule, judged by Claude, verified by callback.

5Cloudflare Workers in the pipeline5D1 tables behind them6failure categories the judge can cite

The problem

Pharmacies and clinics are handing their front desks to AI voice agents. When one starts failing — telling a patient a refill went through when it didn't — nobody finds out until the patient does. I work in voice support every day, and this is the failure mode that worries me most: the agent sounds confident, the dashboards stay green, and the answer is wrong.

Sentiox is my answer: Datadog for AI phone agents. A Cloudflare Workers pipeline places scheduled test calls to the customer's voice agent through Telnyx, records and transcribes them, then has Claude judge each transcript against expected behavior — pass, warn, or fail, with a score and evidence quotes tied to exact transcript lines.

The part I care about most is Tier 2 callback verification. Twenty minutes after a test call, Sentiox calls back and asks the agent whether the promised action actually happened. An agent that confirms a refill and never files it gets caught within the hour — not at the pharmacy counter a week later.

How it works

Every five minutes a cron Worker wakes up and finds the scenarios that are due — hourly, daily, or weekly cadence, per scenario.
The call-engine places a real phone call to the customer's voice agent through Telnyx. Dual-channel recording, transcription, run IDs like SX-20260704-D4K2Q9.
The evaluator hands the transcript to Claude with a strict JSON rubric: pass, warn, or fail; a 0–1 score; six failure categories (hallucination, wrong info, no answer, no escalation, overpromise, drift); evidence quotes pinned to exact transcript lines.
The alerter fires on regressions — before a patient hears the mistake.
Tier 2 is the part I care about most. About 20 minutes later, a second call asks the agent whether the thing it promised actually happened. Confirmed, denied, or inconclusive. Hallucinated confirmations don't survive it.

The whole run costs about $0.03. Five Cloudflare Workers, wired with service bindings, doing the QA a pharmacy's patients would otherwise do for free.

See it working

A scripted replay of a real failure mode on fabricated data — invented pharmacy, 555 number, fake run ID. The rubric and the verdict shape mirror the production evaluator.

SX-20260704-D4K2Q9Magnolia Family Pharmacyscenario: Weekend hours + refill confirmationdemo data

Scroll into view — or press Next — to start the replay.

Claude verdict

Listening… evaluating against “Weekend hours + refill confirmation”.

Open the demo dashboard →

Under the hood

Claude as judge

The evaluator Worker holds the agent to a strict JSON rubric: pass/warn/fail, a 0–1 score, six failure categories (hallucination, wrong info, no answer, no escalation, overpromise, drift), and evidence quotes pinned to exact transcript lines. Every verdict cites its receipts.

Tier 2 callback verification

A second call about 20 minutes later asks the agent whether the promised action really happened. The schema models it end to end — callback run IDs, confirmed/denied/inconclusive results — so hallucinated confirmations become a first-class, alertable failure.

Five-Worker pipeline on service bindings

A cron scheduler works out which scenarios are due every five minutes; the call-engine dispatches Telnyx calls with dual-channel recording and run IDs like SX-20260409-xxxxxx; the evaluator judges; the alerter escalates — all behind a Hono API gateway running real D1 SQL analytics: 30-day uptime, daily accuracy, per-scenario pass rates.

Unit economics designed in, not discovered later

I costed the loop before building it: about $0.03 per test run across the Telnyx call, recording, transcription, and the Claude evaluation. That works out to roughly a 97.7% gross margin at the $199/month tier.

TypeScript
Cloudflare Workers
Cloudflare D1
Hono
Claude API
Telnyx Call Control
React 19
Vite
Tailwind CSS 4

By the numbers

Cloudflare Workers in the pipeline: 0
D1 tables behind them: 0
failure categories the judge can cite: 0
designed REST routes: ~0
per test run, all-in: $0.00
gross margin at the $199 tier: 0.0%
pricing tiers, $199 to $999: 0

Next project

net-check →