All documents
Product

Product Requirements (PRD)

Vision, personas, the three launch skills, functional & non-functional requirements, KPIs and roadmap.

01 — Product Requirements Document (PRD)

AutoSpeak B2B Voice AI Platform · v1.0 · 2026-06-02 Read 00 — Master Index first. Architecture detail lives in 02. This is a product overview, not legal advice; directional and subject to change.


1. Vision & problem

1.1 Problem

Every business with a phone number loses money and goodwill on calls:

  • Missed calls = missed revenue. SMBs miss a large share of inbound calls (after hours, busy, understaffed). Each missed call at a clinic, salon, or hotel is a lost booking.
  • Repetitive calls burn expensive humans. Front desks, recruiters, and reservation agents spend most of their day on the same 20 questions and routine scheduling.
  • Legacy IVR ("press 1 for…") is hated. It's rigid, can't understand free speech, and can't do anything beyond routing.
  • Hiring/training/retaining phone staff is hard and doesn't scale to spikes (seasonal, campaign-driven, overnight).

1.2 Vision

A human-quality AI voice agent that any business can configure in minutes to answer and make calls, take actions, and hand off to a person when needed — in any language, 24/7, at a fraction of the cost of a human seat. Not a smarter IVR; a colleague on the phone.

1.3 What makes it different

  • Sounds human — emotive, low-latency, cloned/branded voice (your best receptionist's voice, or a custom brand voice).
  • Does things, not just talks — books, screens, routes, takes payment, updates systems of record.
  • Horizontal runtime, vertical skills — one engine, pre-built solutions for Reception, Recruiting, Hospitality, and an open "build-your-own-agent" designer.
  • Compliance-native — AI disclosure, consent, recording controls, and data residency are first-class features, not afterthoughts.

2. Goals & non-goals

2.1 Goals (v1 → Phase 2)

  1. Multi-tenant platform: a business self-serves an AI phone agent end-to-end.
  2. Three production vertical skills on one runtime: Reception, Recruit, Stay.
  3. Sub-second, natural, interruptible conversations in ≥3 languages.
  4. Actions via integrations: calendars, ATS, PMS, CRM, payments, SMS/email follow-up.
  5. A dashboard for setup, live monitoring, transcripts, analytics, and billing.
  6. Compliance controls (disclosure, consent, recording, residency) configurable per tenant/region.
  7. Metered + subscription billing.

2.2 Non-goals (explicitly out of scope for v1)

  • Full contact-center suite (workforce mgmt, omnichannel chat/email ticketing) — integrate, don't rebuild.
  • Outbound mass marketing/telemarketing campaigns (regulatory landmine; we gate outbound tightly).
  • On-device/offline inference.
  • Real-time video avatar (Phase 2).
  • Building our own carrier network (we ride established telephony providers).

3. Personas

3.1 Buyers (who pays)

PersonaVerticalPainBuying trigger
Priya, SMB owner (clinic/salon/agency, 1–20 staff)ReceptionMissing calls = lost bookings; can't afford a receptionistWants 24/7 coverage cheaply
Rahul, Talent Acquisition lead (50–5000 hires/yr)RecruitRecruiters waste hours on phone screens; slow funnelVolume hiring season, cost per hire
Sara, Hotel GM / Front-office mgr (independent or chain)StayFront desk overwhelmed; after-hours bookings lost; multilingual guestsOTA commission fatigue, staffing gaps
Tom, Head of CX / Ops (mid-market)HorizontalCall volume spikes, IVR hated, CSAT lowCost + CSAT mandate

3.2 Users (who interacts)

  • Admin — configures agents, integrations, billing.
  • Agent designer / ops — writes prompts/flows, reviews transcripts, tunes.
  • Human fallback agent — receives escalations/handoffs.
  • The caller / candidate / guest — the end person on the phone (must always be respected: disclosure, opt-out, human escalation).

4. The product: one runtime, three skills

4.1 Shared Voice Agent Runtime (the platform core)

Every skill is a configuration over the same runtime. The runtime provides:

  • Telephony I/O — inbound answer, outbound dial, transfer, DTMF, voicemail (carrier media streams).
  • Real-time loop — VAD → STT → dialogue manager (LLM + state) → TTS, with barge-in (caller can interrupt) and <800ms turn latency target.
  • Dialogue manager — system persona + business knowledge (RAG over the tenant's docs/FAQ) + tools (function calling) + guardrails.
  • Actions/tools — calendar, ATS, PMS, CRM, payments, send SMS/email, lookup/booking APIs, warm/cold transfer to human.
  • Memory — per-caller context, conversation history, CRM enrichment.
  • Knowledge — tenant uploads docs/URLs → chunked + embedded → retrieved at call time.
  • Voice — pick a stock voice, clone a consented human voice, or a brand voice; per-tenant.
  • Languages — auto-detect + multilingual response; configurable allow-list.
  • Compliance layer — AI disclosure line, consent capture, recording on/off, redaction, residency routing.
  • Observability — live transcript, recording, summary, sentiment, QA scoring, analytics.

4.2 Skill: AutoSpeak Reception (general business / SMB)

Job: Be the front desk. Answer the main line; understand intent; handle or route.

Capabilities:

  • Greet with business branding + AI disclosure.
  • Answer FAQs from the tenant's knowledge base (hours, location, services, pricing).
  • Qualify and route ("Are you a new or existing patient?" → correct destination).
  • Book / reschedule / cancel appointments (Google/Microsoft/Calendly, or vertical scheduler).
  • Take a message + send structured summary (SMS/email/CRM) when it can't resolve.
  • Warm transfer to a human with spoken context handoff.
  • After-hours mode, overflow mode (only when humans are busy), and full-time mode.

Success: % calls fully resolved by AI, bookings created, missed-call rate ↓, after-hours capture.

4.3 Skill: AutoSpeak Recruit (HR screening)

Job: Conduct structured first-round phone screens at scale (outbound scheduled or inbound), score, and schedule next steps.

Capabilities:

  • Outbound (consented) or inbound ("call this number to complete your screening") — outbound gated by consent + quiet-hours + DLT/TCPA controls.
  • Structured interview from a question template per role (skills, availability, salary expectation, work authorization, scenario questions).
  • Adaptive follow-ups — ask clarifying questions, but stay on-script for fairness/auditability.
  • Scoring & summary against a rubric; flags, transcript, and recommendation to recruiter.
  • Schedule the human round (calendar slot) for passers; polite decline/hold for others.
  • ATS sync (Greenhouse, Lever, Workday, Zoho Recruit) — write back results.
  • Bias/fairness guardrails (see 03 §HR): no protected-class questions, consistent questions, human-in-the-loop decisioning, candidate notice + opt-out to a human.

Success: screens completed, recruiter hours saved, time-to-schedule ↓, candidate CSAT, adverse-impact monitoring.

⚠️ Highest-regulation skill. Employment screening is "high-risk" under the EU AI Act and covered by NYC LL144 / Illinois AIVIA / EEOC. The AI assists, humans decide. Detail in 03.

4.4 Skill: AutoSpeak Stay (hotels / hospitality)

Job: Be the reservations + front-desk voice agent for a property.

Capabilities:

  • Reservations — check availability, quote rates, create/modify/cancel bookings via PMS/channel manager (Cloudbeds, Mews, Opera/OHIP, SiteMinder).
  • FAQs & concierge — check-in/out times, amenities, directions, restaurant hours, local recommendations.
  • Multilingual — greet/serve guests in their language (key differentiator for hospitality).
  • Take payment / deposit — secure pay-by-link or PCI-compliant capture (never raw card on our servers; see 03 §PCI).
  • Upsell (room upgrades, late checkout) within configured rules.
  • Escalate to front desk for complex/VIP/complaint cases.

Success: direct (commission-free) bookings captured, after-hours reservations, call deflection, guest CSAT, languages served.

4.5 Skill: Build-your-own agent (horizontal, self-serve)

A no-/low-code Agent Designer: define persona, knowledge, questions/flows, tools, voice, languages, and compliance — for use cases we didn't pre-build (dental, real estate, logistics dispatch, debt-collection-with-care, surveys, appointment reminders). This is what makes the platform horizontal.

The broader menu of candidate verticals — organized into reusable interaction patterns and prioritized into tiers — is in 07 — Vertical Opportunity Map. The three launch skills above are the deliberate beachhead; the menu is the expansion path.


5. Functional requirements

5.1 Onboarding & tenant setup

  • FR-1: Self-serve signup → create Organization (tenant) with isolated data.
  • FR-2: Provision/port a phone number or forward an existing line; per-country number support.
  • FR-3: Pick a skill template (Reception/Recruit/Stay/Blank) to pre-fill config.
  • FR-4: Upload knowledge (docs, URLs, FAQ) → indexed for retrieval.
  • FR-5: Choose/clone a voice (with consent capture for cloning a real person).
  • FR-6: Connect integrations via OAuth (calendar, ATS, PMS, CRM, payments).
  • FR-7: Set compliance profile by region (disclosure script, recording on/off, consent prompts, quiet hours, data residency).

5.2 Agent configuration

  • FR-8: Persona & instructions (brand tone, do/don't, escalation rules).
  • FR-9: Tools/actions catalog with per-tool enable + parameters.
  • FR-10: Conversation guardrails (topics to avoid, max call length, profanity/abuse handling, hallucination guards, "I don't know → human" rules).
  • FR-11: Routing rules (intents → destinations / humans / hours).
  • FR-12: Multilingual settings (allowed languages, default).
  • FR-13: Versioning + test bench ("call the agent" simulator + transcript review before go-live).

5.3 Runtime / call handling

  • FR-14: Answer inbound within 1–2 rings; place consented outbound.
  • FR-15: Speak AI disclosure per compliance profile at call start.
  • FR-16: Real-time STT→LLM→TTS with barge-in and <800ms median turn latency.
  • FR-17: Execute tools mid-call (book, look up, pay-by-link) and confirm verbally.
  • FR-18: Warm/cold transfer to human with context; voicemail capture if no answer.
  • FR-19: Graceful failure: on uncertainty/abuse/STT failure → fallback script → human/message.
  • FR-20: Capture consent + (where enabled) recording; redact sensitive fields (card, etc.).

5.4 Post-call

  • FR-21: Transcript, recording (if enabled), structured summary, extracted entities (name, intent, booking, outcome).
  • FR-22: Write-back to system of record (CRM/ATS/PMS) + notifications (SMS/email/Slack).
  • FR-23: Per-call QA score + sentiment + disposition.
  • FR-24: Analytics dashboards (volume, resolution rate, bookings, CSAT, language mix).

5.5 Dashboard (web)

  • FR-25: Live call monitor ("calls happening now" + listen/whisper/barge for supervisors).
  • FR-26: Call history with search/filter, transcripts, recordings, summaries.
  • FR-27: Analytics & reports (per skill); export.
  • FR-28: Agent designer + test bench.
  • FR-29: Integrations manager, number management, team/roles (RBAC), audit log.
  • FR-30: Billing & usage (plan, minutes consumed, invoices).

5.6 Admin / platform

  • FR-31: Multi-tenant isolation, RBAC, SSO (Phase 3: SAML/SCIM).
  • FR-32: Usage metering per tenant (minutes, calls, actions) → billing.
  • FR-33: Rate limits, fraud/abuse controls, concurrency caps per plan.
  • FR-34: Audit trail for compliance (who changed what, consent/disclosure logs, recording access).

5.7 Mobile (operator companion)

  • FR-35: A mobile operator companion app: live calls, transcripts, summaries, approvals (e.g., approve an escalation), push notifications — not the primary admin surface. (A solopreneur/SMB "answer my line" mode remains available as a niche path.)

6. Non-functional requirements

CategoryRequirement
LatencyMedian end-of-speech → start-of-speech < 800 ms; p95 < 1.5 s. Barge-in stop < 200 ms.
Availability99.9% for call path (Phase 1) → 99.95%+ with SLAs (Phase 3). Telephony failover across carriers.
ScaleThousands of concurrent calls; horizontal autoscale of media/inference workers.
ReliabilityNo dropped audio; graceful degradation (if LLM slow → filler/hold phrase; if STT fails → reprompt/human).
SecurityEncryption in transit + at rest; secrets vault; least-privilege; tenant isolation; PII/PCI redaction. (See 03.)
Privacy/residencyPer-region data storage (EU/India/US); configurable retention + deletion.
ObservabilityTracing per call (latency budget per stage), error budgets, alerting, call-quality scoring.
AccessibilityDisclosure + easy human escalation always available ("talk to a person").
InternationalizationMulti-language STT/TTS/LLM; locale-aware dates/currency; right-to-left text in dashboard later.

6.1 The latency budget (why this is hard)

A natural turn must fit ~800 ms end-to-end:

caller stops speaking
  → VAD endpointing            ~100–200 ms
  → STT final transcript       ~100–300 ms (streaming)
  → LLM first token + response  ~150–400 ms (fast inference)
  → TTS first audio chunk       ~100–300 ms (streaming TTS)
  → network/jitter buffer       ~50–150 ms
≈ 500–900 ms perceived

Design implications: stream everything, start TTS on first LLM tokens, use fast inference for the LLM, pre-warm models, and co-locate inference near the media server. Detail in 02.


7. Conversation design principles

  1. Disclose early, disclose clearly — "Hi, you've reached <business>. I'm an AI assistant — I can help with X, or connect you to a person anytime."
  2. Short turns — the AI speaks in 1–2 sentences, then yields. No monologues.
  3. Always interruptible (barge-in). Humans interrupt; the agent must stop instantly.
  4. Confirm before acting — read back bookings/payments. "I'll book Tuesday 3pm under Priya — correct?"
  5. Graceful unknowns — never invent. "I'm not sure about that — let me take a message / connect you."
  6. Easy exit to human — at any time, "talk to a person" works.
  7. Match pace & language — detect and switch language; mirror formality.
  8. Respect the caller — quiet hours (outbound), opt-out honored, no pressure tactics.

8. Success metrics (KPIs)

8.1 Product/runtime

  • Containment / resolution rate (% calls fully handled by AI without human).
  • Median turn latency and p95.
  • Barge-in correctness, ASR word error rate, interruption/over-talk rate.
  • Call completion vs. drop/abandon rate.
  • CSAT (post-call survey / sentiment proxy).

8.2 Vertical outcomes

  • Reception: bookings created, missed-call rate ↓, after-hours capture, transfer rate.
  • Recruit: screens completed, recruiter-hours saved, time-to-schedule, pass-through accuracy, adverse-impact metrics.
  • Stay: direct bookings, deflection rate, languages served, revenue/upsell.

8.3 Business

  • Activation (signup → first live call), time-to-value.
  • Minutes processed, retention (logo & revenue), and customer growth.

9. Roadmap (phased overview)

Phases are sequential and gated; each ends with a clear go/no-go gate. Timing depends on team size.

Phase 0 — Productionize the core (multi-tenant)

  • Extract the existing STT→LLM→TTS pipeline behind a clean provider-abstraction interface (swap vendors via config).
  • Introduce tenant/org model + isolation; move secrets/config to managed configuration.
  • Usage metering + structured logging/tracing per call.
  • Gate: a single tenant can be provisioned by config and take a real call with disclosure + transcript + summary.

Phase 1 — Reception MVP + design partners

  • Dashboard (setup, call history, transcripts, analytics), Agent Designer v1, test bench.
  • Reception skill: FAQ/RAG, booking (calendar), routing, warm transfer, after-hours.
  • Compliance profile v1 (disclosure, recording toggle, consent), billing (subscription + metered).
  • Onboard a small cohort of design partners (pilots).
  • Gate: committed design partners with live call volume + positive resolution rate.

Phase 1b — Horizontal: Recruit + Stay

  • Recruit: question templates, scoring rubric, ATS integration (start with 1–2), outbound-with-consent + fairness guardrails, candidate opt-out.
  • Stay: PMS integration (start with 1–2), multilingual, pay-by-link, modify/cancel.
  • Build-your-own Agent template generalized from the three.
  • Gate: at least one paying customer per vertical.

Phase 2 — Scale, Hybrid AI, Video avatar

  • Hybrid AI: introduce self-hosted inference behind the provider abstraction while keeping premium voices, to improve efficiency at higher volume.
  • Reliability: carrier failover, autoscaling media/inference, 99.95% target.
  • SOC 2 Type II / ISO 27001 kickoff; enterprise security review pack.
  • Video/avatar module (Phase 2 feature) — see §10.
  • Gate: improved efficiency at scale, first mid-market/enterprise logo, audit started.

Phase 3 — Own models + enterprise

  • Self-host/fine-tune STT+LLM+TTS; in-house voice cloning (privacy + best efficiency at scale).
  • Enterprise: SSO (SAML/SCIM), data residency guarantees, SLAs, private deployment options, on-prem/VPC for regulated buyers.
  • Marketplace of skills/integrations; partner/reseller program.
  • Gate: majority of volume on owned models; enterprise contracts.

10. Phase 2 deep-dive: Video / talking-avatar module

Concept: the same brain (STT→LLM→TTS) drives a photoreal talking human on a screen — a "video receptionist." Use cases:

  • Web concierge (avatar on the business's website/booking page).
  • Lobby/kiosk (hotel check-in avatar, clinic front desk).
  • Video voicemail / personalized outreach (pre-rendered).

Two technical modes:

  1. Pre-rendered (easy): generate a video from a script after the fact. Good for voicemail/marketing, not live conversation.
  2. Real-time avatar (hard): live lip-synced talking head streaming as the AI speaks. Latency + cost are the challenges; layer it on top of the existing voice loop.

Recommendation (Phase 2): start with a real-time avatar provider gated to web/kiosk where latency tolerance is higher than phone; evaluate self-hosting only at scale. Architecture detail in 02 §Video. Keep v1 voice-only.


11. Assumptions, risks & mitigations

RiskImpactMitigation
Latency feels roboticCore UX failsStream everything; fast LLM; filler phrases; rigorous latency budget + monitoring
AI hallucinates wrong info (price, availability)Trust/legalRAG + "don't know → human"; confirm before acting; read-backs; tool-grounded answers only
Regulatory breach (recording/robocall/disclosure)Fines, shutdownCompliance-native runtime; per-region profiles; outbound gating; legal review (see 03)
Voice-clone misuse / deepfake concernsBrand/legalConsent capture for clones; watermark/disclosure; restrict to consented voices
HR bias claimsLegal, reputationalHuman-in-the-loop; consistent questions; adverse-impact monitoring; documented; opt-out
Vendor lock-in / price hikesMarginProvider abstraction; Phase 2/3 self-host path; multi-vendor
Telephony deliverability (spam labeling, blocked outbound)Calls failSTIR/SHAKEN, branded caller ID, registered numbers, reputation mgmt

12. Market context (orientation)

Voice AI is an active and fast-growing market with both horizontal voice-agent platforms and vertical incumbents (human-led answering services, HR screening tools, hospitality voice assistants). AutoSpeak's approach is to provide one platform that spans Reception, Recruit, and Stay with vertical depth, compliance-by-default, multilingual support, and strong integrations — while leading sales with one wedge at a time.

Build the horizontal runtime so a single-vertical competitor can't out-feature you, but lead sales with one wedge at a time.


Continue to → 02 — Technical Architecture & Build Plan