Sentiox
Synthetic monitoring for AI phone agents — test calls on a schedule, judged by Claude, verified by callback.
01
The problem
Pharmacies and clinics are handing their front desks to AI voice agents. When one starts failing — telling a patient a refill went through when it didn't — nobody finds out until the patient does. I work in voice support every day, and this is the failure mode that worries me most: the agent sounds confident, the dashboards stay green, and the answer is wrong.
Sentiox is my answer: Datadog for AI phone agents. A Cloudflare Workers pipeline places scheduled test calls to the customer's voice agent through Telnyx, records and transcribes them, then has Claude judge each transcript against expected behavior — pass, warn, or fail, with a score and evidence quotes tied to exact transcript lines.
The part I care about most is Tier 2 callback verification. Twenty minutes after a test call, Sentiox calls back and asks the agent whether the promised action actually happened. An agent that confirms a refill and never files it gets caught within the hour — not at the pharmacy counter a week later.
02
How it works
- Every five minutes a cron Worker wakes up and finds the scenarios that are due — hourly, daily, or weekly cadence, per scenario.
- The call-engine places a real phone call to the customer's voice agent through Telnyx. Dual-channel recording, transcription, run IDs like SX-20260704-D4K2Q9.
- The evaluator hands the transcript to Claude with a strict JSON rubric: pass, warn, or fail; a 0–1 score; six failure categories (hallucination, wrong info, no answer, no escalation, overpromise, drift); evidence quotes pinned to exact transcript lines.
- The alerter fires on regressions — before a patient hears the mistake.
- Tier 2 is the part I care about most. About 20 minutes later, a second call asks the agent whether the thing it promised actually happened. Confirmed, denied, or inconclusive. Hallucinated confirmations don't survive it.
The whole run costs about $0.03. Five Cloudflare Workers, wired with service bindings, doing the QA a pharmacy's patients would otherwise do for free.
03
See it working
A scripted replay of a real failure mode on fabricated data — invented pharmacy, 555 number, fake run ID. The rubric and the verdict shape mirror the production evaluator.
Scroll into view — or press Next — to start the replay.
Claude verdict
Listening… evaluating against “Weekend hours + refill confirmation”.
04
Under the hood
Claude as judge
The evaluator Worker holds the agent to a strict JSON rubric: pass/warn/fail, a 0–1 score, six failure categories (hallucination, wrong info, no answer, no escalation, overpromise, drift), and evidence quotes pinned to exact transcript lines. Every verdict cites its receipts.
Tier 2 callback verification
A second call about 20 minutes later asks the agent whether the promised action really happened. The schema models it end to end — callback run IDs, confirmed/denied/inconclusive results — so hallucinated confirmations become a first-class, alertable failure.
Five-Worker pipeline on service bindings
A cron scheduler works out which scenarios are due every five minutes; the call-engine dispatches Telnyx calls with dual-channel recording and run IDs like SX-20260409-xxxxxx; the evaluator judges; the alerter escalates — all behind a Hono API gateway running real D1 SQL analytics: 30-day uptime, daily accuracy, per-scenario pass rates.
Unit economics designed in, not discovered later
I costed the loop before building it: about $0.03 per test run across the Telnyx call, recording, transcription, and the Claude evaluation. That works out to roughly a 97.7% gross margin at the $199/month tier.
- TypeScript
- Cloudflare Workers
- Cloudflare D1
- Hono
- Claude API
- Telnyx Call Control
- React 19
- Vite
- Tailwind CSS 4
05
By the numbers
- Cloudflare Workers in the pipeline
- D1 tables behind them
- failure categories the judge can cite
- designed REST routes
- per test run, all-in
- gross margin at the $199 tier
- pricing tiers, $199 to $999
Next project
net-check →