Product Requirements (PRD)
Vision, personas, the three launch skills, functional & non-functional requirements, KPIs and roadmap.
01 — Product Requirements Document (PRD)
AutoSpeak B2B Voice AI Platform · v1.0 · 2026-06-02 Read 00 — Master Index first. Architecture detail lives in 02. This is a product overview, not legal advice; directional and subject to change.
1. Vision & problem
1.1 Problem
Every business with a phone number loses money and goodwill on calls:
- Missed calls = missed revenue. SMBs miss a large share of inbound calls (after hours, busy, understaffed). Each missed call at a clinic, salon, or hotel is a lost booking.
- Repetitive calls burn expensive humans. Front desks, recruiters, and reservation agents spend most of their day on the same 20 questions and routine scheduling.
- Legacy IVR ("press 1 for…") is hated. It's rigid, can't understand free speech, and can't do anything beyond routing.
- Hiring/training/retaining phone staff is hard and doesn't scale to spikes (seasonal, campaign-driven, overnight).
1.2 Vision
A human-quality AI voice agent that any business can configure in minutes to answer and make calls, take actions, and hand off to a person when needed — in any language, 24/7, at a fraction of the cost of a human seat. Not a smarter IVR; a colleague on the phone.
1.3 What makes it different
- Sounds human — emotive, low-latency, cloned/branded voice (your best receptionist's voice, or a custom brand voice).
- Does things, not just talks — books, screens, routes, takes payment, updates systems of record.
- Horizontal runtime, vertical skills — one engine, pre-built solutions for Reception, Recruiting, Hospitality, and an open "build-your-own-agent" designer.
- Compliance-native — AI disclosure, consent, recording controls, and data residency are first-class features, not afterthoughts.
2. Goals & non-goals
2.1 Goals (v1 → Phase 2)
- Multi-tenant platform: a business self-serves an AI phone agent end-to-end.
- Three production vertical skills on one runtime: Reception, Recruit, Stay.
- Sub-second, natural, interruptible conversations in ≥3 languages.
- Actions via integrations: calendars, ATS, PMS, CRM, payments, SMS/email follow-up.
- A dashboard for setup, live monitoring, transcripts, analytics, and billing.
- Compliance controls (disclosure, consent, recording, residency) configurable per tenant/region.
- Metered + subscription billing.
2.2 Non-goals (explicitly out of scope for v1)
- Full contact-center suite (workforce mgmt, omnichannel chat/email ticketing) — integrate, don't rebuild.
- Outbound mass marketing/telemarketing campaigns (regulatory landmine; we gate outbound tightly).
- On-device/offline inference.
- Real-time video avatar (Phase 2).
- Building our own carrier network (we ride established telephony providers).
3. Personas
3.1 Buyers (who pays)
| Persona | Vertical | Pain | Buying trigger |
|---|---|---|---|
| Priya, SMB owner (clinic/salon/agency, 1–20 staff) | Reception | Missing calls = lost bookings; can't afford a receptionist | Wants 24/7 coverage cheaply |
| Rahul, Talent Acquisition lead (50–5000 hires/yr) | Recruit | Recruiters waste hours on phone screens; slow funnel | Volume hiring season, cost per hire |
| Sara, Hotel GM / Front-office mgr (independent or chain) | Stay | Front desk overwhelmed; after-hours bookings lost; multilingual guests | OTA commission fatigue, staffing gaps |
| Tom, Head of CX / Ops (mid-market) | Horizontal | Call volume spikes, IVR hated, CSAT low | Cost + CSAT mandate |
3.2 Users (who interacts)
- Admin — configures agents, integrations, billing.
- Agent designer / ops — writes prompts/flows, reviews transcripts, tunes.
- Human fallback agent — receives escalations/handoffs.
- The caller / candidate / guest — the end person on the phone (must always be respected: disclosure, opt-out, human escalation).
4. The product: one runtime, three skills
4.1 Shared Voice Agent Runtime (the platform core)
Every skill is a configuration over the same runtime. The runtime provides:
- Telephony I/O — inbound answer, outbound dial, transfer, DTMF, voicemail (carrier media streams).
- Real-time loop — VAD → STT → dialogue manager (LLM + state) → TTS, with barge-in (caller can interrupt) and <800ms turn latency target.
- Dialogue manager — system persona + business knowledge (RAG over the tenant's docs/FAQ) + tools (function calling) + guardrails.
- Actions/tools — calendar, ATS, PMS, CRM, payments, send SMS/email, lookup/booking APIs, warm/cold transfer to human.
- Memory — per-caller context, conversation history, CRM enrichment.
- Knowledge — tenant uploads docs/URLs → chunked + embedded → retrieved at call time.
- Voice — pick a stock voice, clone a consented human voice, or a brand voice; per-tenant.
- Languages — auto-detect + multilingual response; configurable allow-list.
- Compliance layer — AI disclosure line, consent capture, recording on/off, redaction, residency routing.
- Observability — live transcript, recording, summary, sentiment, QA scoring, analytics.
4.2 Skill: AutoSpeak Reception (general business / SMB)
Job: Be the front desk. Answer the main line; understand intent; handle or route.
Capabilities:
- Greet with business branding + AI disclosure.
- Answer FAQs from the tenant's knowledge base (hours, location, services, pricing).
- Qualify and route ("Are you a new or existing patient?" → correct destination).
- Book / reschedule / cancel appointments (Google/Microsoft/Calendly, or vertical scheduler).
- Take a message + send structured summary (SMS/email/CRM) when it can't resolve.
- Warm transfer to a human with spoken context handoff.
- After-hours mode, overflow mode (only when humans are busy), and full-time mode.
Success: % calls fully resolved by AI, bookings created, missed-call rate ↓, after-hours capture.
4.3 Skill: AutoSpeak Recruit (HR screening)
Job: Conduct structured first-round phone screens at scale (outbound scheduled or inbound), score, and schedule next steps.
Capabilities:
- Outbound (consented) or inbound ("call this number to complete your screening") — outbound gated by consent + quiet-hours + DLT/TCPA controls.
- Structured interview from a question template per role (skills, availability, salary expectation, work authorization, scenario questions).
- Adaptive follow-ups — ask clarifying questions, but stay on-script for fairness/auditability.
- Scoring & summary against a rubric; flags, transcript, and recommendation to recruiter.
- Schedule the human round (calendar slot) for passers; polite decline/hold for others.
- ATS sync (Greenhouse, Lever, Workday, Zoho Recruit) — write back results.
- Bias/fairness guardrails (see 03 §HR): no protected-class questions, consistent questions, human-in-the-loop decisioning, candidate notice + opt-out to a human.
Success: screens completed, recruiter hours saved, time-to-schedule ↓, candidate CSAT, adverse-impact monitoring.
⚠️ Highest-regulation skill. Employment screening is "high-risk" under the EU AI Act and covered by NYC LL144 / Illinois AIVIA / EEOC. The AI assists, humans decide. Detail in 03.
4.4 Skill: AutoSpeak Stay (hotels / hospitality)
Job: Be the reservations + front-desk voice agent for a property.
Capabilities:
- Reservations — check availability, quote rates, create/modify/cancel bookings via PMS/channel manager (Cloudbeds, Mews, Opera/OHIP, SiteMinder).
- FAQs & concierge — check-in/out times, amenities, directions, restaurant hours, local recommendations.
- Multilingual — greet/serve guests in their language (key differentiator for hospitality).
- Take payment / deposit — secure pay-by-link or PCI-compliant capture (never raw card on our servers; see 03 §PCI).
- Upsell (room upgrades, late checkout) within configured rules.
- Escalate to front desk for complex/VIP/complaint cases.
Success: direct (commission-free) bookings captured, after-hours reservations, call deflection, guest CSAT, languages served.
4.5 Skill: Build-your-own agent (horizontal, self-serve)
A no-/low-code Agent Designer: define persona, knowledge, questions/flows, tools, voice, languages, and compliance — for use cases we didn't pre-build (dental, real estate, logistics dispatch, debt-collection-with-care, surveys, appointment reminders). This is what makes the platform horizontal.
The broader menu of candidate verticals — organized into reusable interaction patterns and prioritized into tiers — is in 07 — Vertical Opportunity Map. The three launch skills above are the deliberate beachhead; the menu is the expansion path.
5. Functional requirements
5.1 Onboarding & tenant setup
- FR-1: Self-serve signup → create Organization (tenant) with isolated data.
- FR-2: Provision/port a phone number or forward an existing line; per-country number support.
- FR-3: Pick a skill template (Reception/Recruit/Stay/Blank) to pre-fill config.
- FR-4: Upload knowledge (docs, URLs, FAQ) → indexed for retrieval.
- FR-5: Choose/clone a voice (with consent capture for cloning a real person).
- FR-6: Connect integrations via OAuth (calendar, ATS, PMS, CRM, payments).
- FR-7: Set compliance profile by region (disclosure script, recording on/off, consent prompts, quiet hours, data residency).
5.2 Agent configuration
- FR-8: Persona & instructions (brand tone, do/don't, escalation rules).
- FR-9: Tools/actions catalog with per-tool enable + parameters.
- FR-10: Conversation guardrails (topics to avoid, max call length, profanity/abuse handling, hallucination guards, "I don't know → human" rules).
- FR-11: Routing rules (intents → destinations / humans / hours).
- FR-12: Multilingual settings (allowed languages, default).
- FR-13: Versioning + test bench ("call the agent" simulator + transcript review before go-live).
5.3 Runtime / call handling
- FR-14: Answer inbound within 1–2 rings; place consented outbound.
- FR-15: Speak AI disclosure per compliance profile at call start.
- FR-16: Real-time STT→LLM→TTS with barge-in and <800ms median turn latency.
- FR-17: Execute tools mid-call (book, look up, pay-by-link) and confirm verbally.
- FR-18: Warm/cold transfer to human with context; voicemail capture if no answer.
- FR-19: Graceful failure: on uncertainty/abuse/STT failure → fallback script → human/message.
- FR-20: Capture consent + (where enabled) recording; redact sensitive fields (card, etc.).
5.4 Post-call
- FR-21: Transcript, recording (if enabled), structured summary, extracted entities (name, intent, booking, outcome).
- FR-22: Write-back to system of record (CRM/ATS/PMS) + notifications (SMS/email/Slack).
- FR-23: Per-call QA score + sentiment + disposition.
- FR-24: Analytics dashboards (volume, resolution rate, bookings, CSAT, language mix).
5.5 Dashboard (web)
- FR-25: Live call monitor ("calls happening now" + listen/whisper/barge for supervisors).
- FR-26: Call history with search/filter, transcripts, recordings, summaries.
- FR-27: Analytics & reports (per skill); export.
- FR-28: Agent designer + test bench.
- FR-29: Integrations manager, number management, team/roles (RBAC), audit log.
- FR-30: Billing & usage (plan, minutes consumed, invoices).
5.6 Admin / platform
- FR-31: Multi-tenant isolation, RBAC, SSO (Phase 3: SAML/SCIM).
- FR-32: Usage metering per tenant (minutes, calls, actions) → billing.
- FR-33: Rate limits, fraud/abuse controls, concurrency caps per plan.
- FR-34: Audit trail for compliance (who changed what, consent/disclosure logs, recording access).
5.7 Mobile (operator companion)
- FR-35: A mobile operator companion app: live calls, transcripts, summaries, approvals (e.g., approve an escalation), push notifications — not the primary admin surface. (A solopreneur/SMB "answer my line" mode remains available as a niche path.)
6. Non-functional requirements
| Category | Requirement |
|---|---|
| Latency | Median end-of-speech → start-of-speech < 800 ms; p95 < 1.5 s. Barge-in stop < 200 ms. |
| Availability | 99.9% for call path (Phase 1) → 99.95%+ with SLAs (Phase 3). Telephony failover across carriers. |
| Scale | Thousands of concurrent calls; horizontal autoscale of media/inference workers. |
| Reliability | No dropped audio; graceful degradation (if LLM slow → filler/hold phrase; if STT fails → reprompt/human). |
| Security | Encryption in transit + at rest; secrets vault; least-privilege; tenant isolation; PII/PCI redaction. (See 03.) |
| Privacy/residency | Per-region data storage (EU/India/US); configurable retention + deletion. |
| Observability | Tracing per call (latency budget per stage), error budgets, alerting, call-quality scoring. |
| Accessibility | Disclosure + easy human escalation always available ("talk to a person"). |
| Internationalization | Multi-language STT/TTS/LLM; locale-aware dates/currency; right-to-left text in dashboard later. |
6.1 The latency budget (why this is hard)
A natural turn must fit ~800 ms end-to-end:
caller stops speaking
→ VAD endpointing ~100–200 ms
→ STT final transcript ~100–300 ms (streaming)
→ LLM first token + response ~150–400 ms (fast inference)
→ TTS first audio chunk ~100–300 ms (streaming TTS)
→ network/jitter buffer ~50–150 ms
≈ 500–900 ms perceived
Design implications: stream everything, start TTS on first LLM tokens, use fast inference for the LLM, pre-warm models, and co-locate inference near the media server. Detail in 02.
7. Conversation design principles
- Disclose early, disclose clearly — "Hi, you've reached <business>. I'm an AI assistant — I can help with X, or connect you to a person anytime."
- Short turns — the AI speaks in 1–2 sentences, then yields. No monologues.
- Always interruptible (barge-in). Humans interrupt; the agent must stop instantly.
- Confirm before acting — read back bookings/payments. "I'll book Tuesday 3pm under Priya — correct?"
- Graceful unknowns — never invent. "I'm not sure about that — let me take a message / connect you."
- Easy exit to human — at any time, "talk to a person" works.
- Match pace & language — detect and switch language; mirror formality.
- Respect the caller — quiet hours (outbound), opt-out honored, no pressure tactics.
8. Success metrics (KPIs)
8.1 Product/runtime
- Containment / resolution rate (% calls fully handled by AI without human).
- Median turn latency and p95.
- Barge-in correctness, ASR word error rate, interruption/over-talk rate.
- Call completion vs. drop/abandon rate.
- CSAT (post-call survey / sentiment proxy).
8.2 Vertical outcomes
- Reception: bookings created, missed-call rate ↓, after-hours capture, transfer rate.
- Recruit: screens completed, recruiter-hours saved, time-to-schedule, pass-through accuracy, adverse-impact metrics.
- Stay: direct bookings, deflection rate, languages served, revenue/upsell.
8.3 Business
- Activation (signup → first live call), time-to-value.
- Minutes processed, retention (logo & revenue), and customer growth.
9. Roadmap (phased overview)
Phases are sequential and gated; each ends with a clear go/no-go gate. Timing depends on team size.
Phase 0 — Productionize the core (multi-tenant)
- Extract the existing STT→LLM→TTS pipeline behind a clean provider-abstraction interface (swap vendors via config).
- Introduce tenant/org model + isolation; move secrets/config to managed configuration.
- Usage metering + structured logging/tracing per call.
- Gate: a single tenant can be provisioned by config and take a real call with disclosure + transcript + summary.
Phase 1 — Reception MVP + design partners
- Dashboard (setup, call history, transcripts, analytics), Agent Designer v1, test bench.
- Reception skill: FAQ/RAG, booking (calendar), routing, warm transfer, after-hours.
- Compliance profile v1 (disclosure, recording toggle, consent), billing (subscription + metered).
- Onboard a small cohort of design partners (pilots).
- Gate: committed design partners with live call volume + positive resolution rate.
Phase 1b — Horizontal: Recruit + Stay
- Recruit: question templates, scoring rubric, ATS integration (start with 1–2), outbound-with-consent + fairness guardrails, candidate opt-out.
- Stay: PMS integration (start with 1–2), multilingual, pay-by-link, modify/cancel.
- Build-your-own Agent template generalized from the three.
- Gate: at least one paying customer per vertical.
Phase 2 — Scale, Hybrid AI, Video avatar
- Hybrid AI: introduce self-hosted inference behind the provider abstraction while keeping premium voices, to improve efficiency at higher volume.
- Reliability: carrier failover, autoscaling media/inference, 99.95% target.
- SOC 2 Type II / ISO 27001 kickoff; enterprise security review pack.
- Video/avatar module (Phase 2 feature) — see §10.
- Gate: improved efficiency at scale, first mid-market/enterprise logo, audit started.
Phase 3 — Own models + enterprise
- Self-host/fine-tune STT+LLM+TTS; in-house voice cloning (privacy + best efficiency at scale).
- Enterprise: SSO (SAML/SCIM), data residency guarantees, SLAs, private deployment options, on-prem/VPC for regulated buyers.
- Marketplace of skills/integrations; partner/reseller program.
- Gate: majority of volume on owned models; enterprise contracts.
10. Phase 2 deep-dive: Video / talking-avatar module
Concept: the same brain (STT→LLM→TTS) drives a photoreal talking human on a screen — a "video receptionist." Use cases:
- Web concierge (avatar on the business's website/booking page).
- Lobby/kiosk (hotel check-in avatar, clinic front desk).
- Video voicemail / personalized outreach (pre-rendered).
Two technical modes:
- Pre-rendered (easy): generate a video from a script after the fact. Good for voicemail/marketing, not live conversation.
- Real-time avatar (hard): live lip-synced talking head streaming as the AI speaks. Latency + cost are the challenges; layer it on top of the existing voice loop.
Recommendation (Phase 2): start with a real-time avatar provider gated to web/kiosk where latency tolerance is higher than phone; evaluate self-hosting only at scale. Architecture detail in 02 §Video. Keep v1 voice-only.
11. Assumptions, risks & mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Latency feels robotic | Core UX fails | Stream everything; fast LLM; filler phrases; rigorous latency budget + monitoring |
| AI hallucinates wrong info (price, availability) | Trust/legal | RAG + "don't know → human"; confirm before acting; read-backs; tool-grounded answers only |
| Regulatory breach (recording/robocall/disclosure) | Fines, shutdown | Compliance-native runtime; per-region profiles; outbound gating; legal review (see 03) |
| Voice-clone misuse / deepfake concerns | Brand/legal | Consent capture for clones; watermark/disclosure; restrict to consented voices |
| HR bias claims | Legal, reputational | Human-in-the-loop; consistent questions; adverse-impact monitoring; documented; opt-out |
| Vendor lock-in / price hikes | Margin | Provider abstraction; Phase 2/3 self-host path; multi-vendor |
| Telephony deliverability (spam labeling, blocked outbound) | Calls fail | STIR/SHAKEN, branded caller ID, registered numbers, reputation mgmt |
12. Market context (orientation)
Voice AI is an active and fast-growing market with both horizontal voice-agent platforms and vertical incumbents (human-led answering services, HR screening tools, hospitality voice assistants). AutoSpeak's approach is to provide one platform that spans Reception, Recruit, and Stay with vertical depth, compliance-by-default, multilingual support, and strong integrations — while leading sales with one wedge at a time.
Build the horizontal runtime so a single-vertical competitor can't out-feature you, but lead sales with one wedge at a time.
Continue to → 02 — Technical Architecture & Build Plan