01 — Product Requirements Document (PRD)

AutoSpeak B2B Voice AI Platform · v1.0 · 2026-06-02 Read 00 — Master Index first. Architecture detail lives in 02. This is a product overview, not legal advice; directional and subject to change.

1. Vision & problem

1.1 Problem

Every business with a phone number loses money and goodwill on calls:

Missed calls = missed revenue. SMBs miss a large share of inbound calls (after hours, busy, understaffed). Each missed call at a clinic, salon, or hotel is a lost booking.
Repetitive calls burn expensive humans. Front desks, recruiters, and reservation agents spend most of their day on the same 20 questions and routine scheduling.
Legacy IVR ("press 1 for…") is hated. It's rigid, can't understand free speech, and can't do anything beyond routing.
Hiring/training/retaining phone staff is hard and doesn't scale to spikes (seasonal, campaign-driven, overnight).

1.2 Vision

A human-quality AI voice agent that any business can configure in minutes to answer and make calls, take actions, and hand off to a person when needed — in any language, 24/7, at a fraction of the cost of a human seat. Not a smarter IVR; a colleague on the phone.

1.3 What makes it different

Sounds human — emotive, low-latency, cloned/branded voice (your best receptionist's voice, or a custom brand voice).
Does things, not just talks — books, screens, routes, takes payment, updates systems of record.
Horizontal runtime, vertical skills — one engine, pre-built solutions for Reception, Recruiting, Hospitality, and an open "build-your-own-agent" designer.
Compliance-native — AI disclosure, consent, recording controls, and data residency are first-class features, not afterthoughts.

2. Goals & non-goals

2.1 Goals (v1 → Phase 2)

Multi-tenant platform: a business self-serves an AI phone agent end-to-end.
Three production vertical skills on one runtime: Reception, Recruit, Stay.
Sub-second, natural, interruptible conversations in ≥3 languages.
Actions via integrations: calendars, ATS, PMS, CRM, payments, SMS/email follow-up.
A dashboard for setup, live monitoring, transcripts, analytics, and billing.
Compliance controls (disclosure, consent, recording, residency) configurable per tenant/region.
Metered + subscription billing.

2.2 Non-goals (explicitly out of scope for v1)

Full contact-center suite (workforce mgmt, omnichannel chat/email ticketing) — integrate, don't rebuild.
Outbound mass marketing/telemarketing campaigns (regulatory landmine; we gate outbound tightly).
On-device/offline inference.
Real-time video avatar (Phase 2).
Building our own carrier network (we ride established telephony providers).

3. Personas

3.1 Buyers (who pays)

Persona	Vertical	Pain	Buying trigger
Priya, SMB owner (clinic/salon/agency, 1–20 staff)	Reception	Missing calls = lost bookings; can't afford a receptionist	Wants 24/7 coverage cheaply
Rahul, Talent Acquisition lead (50–5000 hires/yr)	Recruit	Recruiters waste hours on phone screens; slow funnel	Volume hiring season, cost per hire
Sara, Hotel GM / Front-office mgr (independent or chain)	Stay	Front desk overwhelmed; after-hours bookings lost; multilingual guests	OTA commission fatigue, staffing gaps
Tom, Head of CX / Ops (mid-market)	Horizontal	Call volume spikes, IVR hated, CSAT low	Cost + CSAT mandate

3.2 Users (who interacts)

Admin — configures agents, integrations, billing.
Agent designer / ops — writes prompts/flows, reviews transcripts, tunes.
Human fallback agent — receives escalations/handoffs.
The caller / candidate / guest — the end person on the phone (must always be respected: disclosure, opt-out, human escalation).

4. The product: one runtime, three skills

4.1 Shared Voice Agent Runtime (the platform core)

Every skill is a configuration over the same runtime. The runtime provides:

Telephony I/O — inbound answer, outbound dial, transfer, DTMF, voicemail (carrier media streams).
Real-time loop — VAD → STT → dialogue manager (LLM + state) → TTS, with barge-in (caller can interrupt) and <800ms turn latency target.
Dialogue manager — system persona + business knowledge (RAG over the tenant's docs/FAQ) + tools (function calling) + guardrails.
Actions/tools — calendar, ATS, PMS, CRM, payments, send SMS/email, lookup/booking APIs, warm/cold transfer to human.
Memory — per-caller context, conversation history, CRM enrichment.
Knowledge — tenant uploads docs/URLs → chunked + embedded → retrieved at call time.
Voice — pick a stock voice, clone a consented human voice, or a brand voice; per-tenant.
Languages — auto-detect + multilingual response; configurable allow-list.
Compliance layer — AI disclosure line, consent capture, recording on/off, redaction, residency routing.
Observability — live transcript, recording, summary, sentiment, QA scoring, analytics.

4.2 Skill: AutoSpeak Reception (general business / SMB)

Job: Be the front desk. Answer the main line; understand intent; handle or route.

Capabilities:

Greet with business branding + AI disclosure.
Answer FAQs from the tenant's knowledge base (hours, location, services, pricing).
Qualify and route ("Are you a new or existing patient?" → correct destination).
Book / reschedule / cancel appointments (Google/Microsoft/Calendly, or vertical scheduler).
Take a message + send structured summary (SMS/email/CRM) when it can't resolve.
Warm transfer to a human with spoken context handoff.
After-hours mode, overflow mode (only when humans are busy), and full-time mode.

Success: % calls fully resolved by AI, bookings created, missed-call rate ↓, after-hours capture.

4.3 Skill: AutoSpeak Recruit (HR screening)

Job: Conduct structured first-round phone screens at scale (outbound scheduled or inbound), score, and schedule next steps.

Capabilities:

Outbound (consented) or inbound ("call this number to complete your screening") — outbound gated by consent + quiet-hours + DLT/TCPA controls.
Structured interview from a question template per role (skills, availability, salary expectation, work authorization, scenario questions).
Adaptive follow-ups — ask clarifying questions, but stay on-script for fairness/auditability.
Scoring & summary against a rubric; flags, transcript, and recommendation to recruiter.
Schedule the human round (calendar slot) for passers; polite decline/hold for others.
ATS sync (Greenhouse, Lever, Workday, Zoho Recruit) — write back results.
Bias/fairness guardrails (see 03 §HR): no protected-class questions, consistent questions, human-in-the-loop decisioning, candidate notice + opt-out to a human.

Success: screens completed, recruiter hours saved, time-to-schedule ↓, candidate CSAT, adverse-impact monitoring.

⚠️ Highest-regulation skill. Employment screening is "high-risk" under the EU AI Act and covered by NYC LL144 / Illinois AIVIA / EEOC. The AI assists, humans decide. Detail in 03.

4.4 Skill: AutoSpeak Stay (hotels / hospitality)

Job: Be the reservations + front-desk voice agent for a property.

Capabilities:

Reservations — check availability, quote rates, create/modify/cancel bookings via PMS/channel manager (Cloudbeds, Mews, Opera/OHIP, SiteMinder).
FAQs & concierge — check-in/out times, amenities, directions, restaurant hours, local recommendations.
Multilingual — greet/serve guests in their language (key differentiator for hospitality).
Take payment / deposit — secure pay-by-link or PCI-compliant capture (never raw card on our servers; see 03 §PCI).
Upsell (room upgrades, late checkout) within configured rules.
Escalate to front desk for complex/VIP/complaint cases.

Success: direct (commission-free) bookings captured, after-hours reservations, call deflection, guest CSAT, languages served.

4.5 Skill: Build-your-own agent (horizontal, self-serve)

A no-/low-code Agent Designer: define persona, knowledge, questions/flows, tools, voice, languages, and compliance — for use cases we didn't pre-build (dental, real estate, logistics dispatch, debt-collection-with-care, surveys, appointment reminders). This is what makes the platform horizontal.

The broader menu of candidate verticals — organized into reusable interaction patterns and prioritized into tiers — is in 07 — Vertical Opportunity Map. The three launch skills above are the deliberate beachhead; the menu is the expansion path.

5. Functional requirements

5.1 Onboarding & tenant setup

FR-1: Self-serve signup → create Organization (tenant) with isolated data.
FR-2: Provision/port a phone number or forward an existing line; per-country number support.
FR-3: Pick a skill template (Reception/Recruit/Stay/Blank) to pre-fill config.
FR-4: Upload knowledge (docs, URLs, FAQ) → indexed for retrieval.
FR-5: Choose/clone a voice (with consent capture for cloning a real person).
FR-6: Connect integrations via OAuth (calendar, ATS, PMS, CRM, payments).
FR-7: Set compliance profile by region (disclosure script, recording on/off, consent prompts, quiet hours, data residency).

5.2 Agent configuration

FR-8: Persona & instructions (brand tone, do/don't, escalation rules).
FR-9: Tools/actions catalog with per-tool enable + parameters.
FR-10: Conversation guardrails (topics to avoid, max call length, profanity/abuse handling, hallucination guards, "I don't know → human" rules).
FR-11: Routing rules (intents → destinations / humans / hours).
FR-12: Multilingual settings (allowed languages, default).
FR-13: Versioning + test bench ("call the agent" simulator + transcript review before go-live).

5.3 Runtime / call handling

FR-14: Answer inbound within 1–2 rings; place consented outbound.
FR-15: Speak AI disclosure per compliance profile at call start.
FR-16: Real-time STT→LLM→TTS with barge-in and <800ms median turn latency.
FR-17: Execute tools mid-call (book, look up, pay-by-link) and confirm verbally.
FR-18: Warm/cold transfer to human with context; voicemail capture if no answer.
FR-19: Graceful failure: on uncertainty/abuse/STT failure → fallback script → human/message.
FR-20: Capture consent + (where enabled) recording; redact sensitive fields (card, etc.).

5.4 Post-call

FR-21: Transcript, recording (if enabled), structured summary, extracted entities (name, intent, booking, outcome).
FR-22: Write-back to system of record (CRM/ATS/PMS) + notifications (SMS/email/Slack).
FR-23: Per-call QA score + sentiment + disposition.
FR-24: Analytics dashboards (volume, resolution rate, bookings, CSAT, language mix).

5.5 Dashboard (web)

FR-25: Live call monitor ("calls happening now" + listen/whisper/barge for supervisors).
FR-26: Call history with search/filter, transcripts, recordings, summaries.
FR-27: Analytics & reports (per skill); export.
FR-28: Agent designer + test bench.
FR-29: Integrations manager, number management, team/roles (RBAC), audit log.
FR-30: Billing & usage (plan, minutes consumed, invoices).

5.6 Admin / platform

FR-31: Multi-tenant isolation, RBAC, SSO (Phase 3: SAML/SCIM).
FR-32: Usage metering per tenant (minutes, calls, actions) → billing.
FR-33: Rate limits, fraud/abuse controls, concurrency caps per plan.
FR-34: Audit trail for compliance (who changed what, consent/disclosure logs, recording access).

5.7 Mobile (operator companion)

FR-35: A mobile operator companion app: live calls, transcripts, summaries, approvals (e.g., approve an escalation), push notifications — not the primary admin surface. (A solopreneur/SMB "answer my line" mode remains available as a niche path.)

6. Non-functional requirements

Category	Requirement
Latency	Median end-of-speech → start-of-speech < 800 ms; p95 < 1.5 s. Barge-in stop < 200 ms.
Availability	99.9% for call path (Phase 1) → 99.95%+ with SLAs (Phase 3). Telephony failover across carriers.
Scale	Thousands of concurrent calls; horizontal autoscale of media/inference workers.
Reliability	No dropped audio; graceful degradation (if LLM slow → filler/hold phrase; if STT fails → reprompt/human).
Security	Encryption in transit + at rest; secrets vault; least-privilege; tenant isolation; PII/PCI redaction. (See 03.)
Privacy/residency	Per-region data storage (EU/India/US); configurable retention + deletion.
Observability	Tracing per call (latency budget per stage), error budgets, alerting, call-quality scoring.
Accessibility	Disclosure + easy human escalation always available ("talk to a person").
Internationalization	Multi-language STT/TTS/LLM; locale-aware dates/currency; right-to-left text in dashboard later.

6.1 The latency budget (why this is hard)

A natural turn must fit ~800 ms end-to-end:

caller stops speaking
  → VAD endpointing            ~100–200 ms
  → STT final transcript       ~100–300 ms (streaming)
  → LLM first token + response  ~150–400 ms (fast inference)
  → TTS first audio chunk       ~100–300 ms (streaming TTS)
  → network/jitter buffer       ~50–150 ms
≈ 500–900 ms perceived

Design implications: stream everything, start TTS on first LLM tokens, use fast inference for the LLM, pre-warm models, and co-locate inference near the media server. Detail in 02.

7. Conversation design principles

Disclose early, disclose clearly — "Hi, you've reached <business>. I'm an AI assistant — I can help with X, or connect you to a person anytime."
Short turns — the AI speaks in 1–2 sentences, then yields. No monologues.
Always interruptible (barge-in). Humans interrupt; the agent must stop instantly.
Confirm before acting — read back bookings/payments. "I'll book Tuesday 3pm under Priya — correct?"
Graceful unknowns — never invent. "I'm not sure about that — let me take a message / connect you."
Easy exit to human — at any time, "talk to a person" works.
Match pace & language — detect and switch language; mirror formality.
Respect the caller — quiet hours (outbound), opt-out honored, no pressure tactics.

8. Success metrics (KPIs)

8.1 Product/runtime

Containment / resolution rate (% calls fully handled by AI without human).
Median turn latency and p95.
Barge-in correctness, ASR word error rate, interruption/over-talk rate.
Call completion vs. drop/abandon rate.
CSAT (post-call survey / sentiment proxy).

8.2 Vertical outcomes

Reception: bookings created, missed-call rate ↓, after-hours capture, transfer rate.
Recruit: screens completed, recruiter-hours saved, time-to-schedule, pass-through accuracy, adverse-impact metrics.
Stay: direct bookings, deflection rate, languages served, revenue/upsell.

8.3 Business

Activation (signup → first live call), time-to-value.
Minutes processed, retention (logo & revenue), and customer growth.

9. Roadmap (phased overview)

Phases are sequential and gated; each ends with a clear go/no-go gate. Timing depends on team size.

Phase 0 — Productionize the core (multi-tenant)

Extract the existing STT→LLM→TTS pipeline behind a clean provider-abstraction interface (swap vendors via config).
Introduce tenant/org model + isolation; move secrets/config to managed configuration.
Usage metering + structured logging/tracing per call.
Gate: a single tenant can be provisioned by config and take a real call with disclosure + transcript + summary.

Phase 1 — Reception MVP + design partners

Dashboard (setup, call history, transcripts, analytics), Agent Designer v1, test bench.
Reception skill: FAQ/RAG, booking (calendar), routing, warm transfer, after-hours.
Compliance profile v1 (disclosure, recording toggle, consent), billing (subscription + metered).
Onboard a small cohort of design partners (pilots).
Gate: committed design partners with live call volume + positive resolution rate.

Phase 1b — Horizontal: Recruit + Stay

Recruit: question templates, scoring rubric, ATS integration (start with 1–2), outbound-with-consent + fairness guardrails, candidate opt-out.
Stay: PMS integration (start with 1–2), multilingual, pay-by-link, modify/cancel.
Build-your-own Agent template generalized from the three.
Gate: at least one paying customer per vertical.

Phase 2 — Scale, Hybrid AI, Video avatar

Hybrid AI: introduce self-hosted inference behind the provider abstraction while keeping premium voices, to improve efficiency at higher volume.
Reliability: carrier failover, autoscaling media/inference, 99.95% target.
SOC 2 Type II / ISO 27001 kickoff; enterprise security review pack.
Video/avatar module (Phase 2 feature) — see §10.
Gate: improved efficiency at scale, first mid-market/enterprise logo, audit started.

Phase 3 — Own models + enterprise

Self-host/fine-tune STT+LLM+TTS; in-house voice cloning (privacy + best efficiency at scale).
Enterprise: SSO (SAML/SCIM), data residency guarantees, SLAs, private deployment options, on-prem/VPC for regulated buyers.
Marketplace of skills/integrations; partner/reseller program.
Gate: majority of volume on owned models; enterprise contracts.

10. Phase 2 deep-dive: Video / talking-avatar module

Concept: the same brain (STT→LLM→TTS) drives a photoreal talking human on a screen — a "video receptionist." Use cases:

Web concierge (avatar on the business's website/booking page).
Lobby/kiosk (hotel check-in avatar, clinic front desk).
Video voicemail / personalized outreach (pre-rendered).

Two technical modes:

Pre-rendered (easy): generate a video from a script after the fact. Good for voicemail/marketing, not live conversation.
Real-time avatar (hard): live lip-synced talking head streaming as the AI speaks. Latency + cost are the challenges; layer it on top of the existing voice loop.

Recommendation (Phase 2): start with a real-time avatar provider gated to web/kiosk where latency tolerance is higher than phone; evaluate self-hosting only at scale. Architecture detail in 02 §Video. Keep v1 voice-only.

11. Assumptions, risks & mitigations

Risk	Impact	Mitigation
Latency feels robotic	Core UX fails	Stream everything; fast LLM; filler phrases; rigorous latency budget + monitoring
AI hallucinates wrong info (price, availability)	Trust/legal	RAG + "don't know → human"; confirm before acting; read-backs; tool-grounded answers only
Regulatory breach (recording/robocall/disclosure)	Fines, shutdown	Compliance-native runtime; per-region profiles; outbound gating; legal review (see 03)
Voice-clone misuse / deepfake concerns	Brand/legal	Consent capture for clones; watermark/disclosure; restrict to consented voices
HR bias claims	Legal, reputational	Human-in-the-loop; consistent questions; adverse-impact monitoring; documented; opt-out
Vendor lock-in / price hikes	Margin	Provider abstraction; Phase 2/3 self-host path; multi-vendor
Telephony deliverability (spam labeling, blocked outbound)	Calls fail	STIR/SHAKEN, branded caller ID, registered numbers, reputation mgmt

12. Market context (orientation)

Voice AI is an active and fast-growing market with both horizontal voice-agent platforms and vertical incumbents (human-led answering services, HR screening tools, hospitality voice assistants). AutoSpeak's approach is to provide one platform that spans Reception, Recruit, and Stay with vertical depth, compliance-by-default, multilingual support, and strong integrations — while leading sales with one wedge at a time.

Build the horizontal runtime so a single-vertical competitor can't out-feature you, but lead sales with one wedge at a time.

Continue to → 02 — Technical Architecture & Build Plan

Product Requirements (PRD)

01 — Product Requirements Document (PRD)

1. Vision & problem

1.1 Problem

1.2 Vision

1.3 What makes it different

2. Goals & non-goals

2.1 Goals (v1 → Phase 2)

2.2 Non-goals (explicitly out of scope for v1)

3. Personas

3.1 Buyers (who pays)

3.2 Users (who interacts)

4. The product: one runtime, three skills

4.1 Shared Voice Agent Runtime (the platform core)

4.2 Skill: AutoSpeak Reception (general business / SMB)

4.3 Skill: AutoSpeak Recruit (HR screening)

4.4 Skill: AutoSpeak Stay (hotels / hospitality)

4.5 Skill: Build-your-own agent (horizontal, self-serve)

5. Functional requirements

5.1 Onboarding & tenant setup

5.2 Agent configuration

5.3 Runtime / call handling

5.4 Post-call

5.5 Dashboard (web)

5.6 Admin / platform

5.7 Mobile (operator companion)

6. Non-functional requirements

6.1 The latency budget (why this is hard)

7. Conversation design principles

8. Success metrics (KPIs)

8.1 Product/runtime

8.2 Vertical outcomes

8.3 Business

9. Roadmap (phased overview)

Phase 0 — Productionize the core (multi-tenant)

Phase 1 — Reception MVP + design partners

Phase 1b — Horizontal: Recruit + Stay

Phase 2 — Scale, Hybrid AI, Video avatar

Phase 3 — Own models + enterprise

10. Phase 2 deep-dive: Video / talking-avatar module

11. Assumptions, risks & mitigations

12. Market context (orientation)