Master Index
The whole picture — vision, the product, the three launch skills, and how the pieces fit together.
AutoSpeak — Master Document & Index
The complete picture: what we are building, how it works, and how we stay compliant.
Status: Product documentation v1.0 Scope: Global / multi-region vision · Horizontal platform (3 skills) · AI strategy presented as a progression toward owned models · Video/avatar = roadmap. Execution sequence: India-first launch, then global springboard — see 06. The full vertical menu beyond the launch three — see 07.
Note: This documentation is directional and informational, not legal advice. Compliance requirements vary by jurisdiction; validate against current regulations and qualified counsel before launch.
0. How to read this suite
This is a suite of linked documents plus this master index. Read them in order; each builds on the last.
| # | Document | Answers the question | Primary audience |
|---|---|---|---|
| 00 | This master index | What is the whole picture? | Everyone |
| 01 | Product Requirements Document | What are we building and for whom? | Product, eng, design |
| 02 | Technical Architecture | How does it work? Architecture, AI brain, payments | Engineering |
| 03 | Compliance & Security | How do we stay legal with data, consent & AI disclosure? | Legal, security |
| 06 | India-First Implementation Route Map | Where/how do we launch first? India execution path | Founders, ops, eng |
| 07 | Vertical Opportunity Map | What else can it do? Use cases + prioritization | Product, sales |
1. The one-paragraph thesis
We already built the hard part. AutoSpeak today is a working real-time voice agent: it answers a phone call, understands the caller (speech-to-text), thinks (LLM), and replies in a natural, human-sounding voice (text-to-speech) — over both cloud telephony and a real phone SIM. The core pipeline is proven. The opportunity is to turn that same pipeline into a multi-tenant platform that businesses use to handle their phone calls: a receptionist that never sleeps, a recruiter that screens hundreds of candidates overnight, and a hotel agent that takes reservations in any language. Much of the technology is reusable; the work ahead is multi-tenancy, vertical "skills," integrations (calendars, ATS, PMS, payments), and a compliance + billing layer.
2. What we are building (in plain terms)
AutoSpeak is a Voice AI platform. A business signs up, picks a phone number (or forwards their existing one), configures an AI agent through a dashboard, and that agent answers/makes calls on their behalf — speaking like a real human, taking actions (booking, routing, logging, screening), and handing off to a person when needed. Every call is transcribed, summarized, analyzed, and (where law allows) recorded.
On top of one shared Voice Agent Runtime, we ship three vertical solutions at launch:
| Working name | Vertical | What the AI does | Buyer |
|---|---|---|---|
| AutoSpeak Reception | General business / SMB front desk | Answers the main line, qualifies callers, books appointments, routes/escalates, takes messages, answers FAQs | Owner / office manager |
| AutoSpeak Recruit | HR / recruiting | Outbound + inbound candidate screening: asks role questions, scores answers, schedules interviews, syncs to ATS | Recruiter / TA lead |
| AutoSpeak Stay | Hotels / hospitality | Inbound reservations, availability, FAQs, concierge, modify/cancel bookings, takes payment | GM / front-office manager |
A roadmap video/avatar layer would add a photoreal "talking human" front-end (web concierge, kiosk, video voicemail) driven by the same brain. Documented as roadmap, not v1.
3. Why now / why us
- The pipeline exists and is code-verified. STT → LLM → TTS with voice cloning, cloud telephony media streams, native call handling, persistent data store, and analytics.
- Voice AI quality crossed the "is this a human?" threshold. Sub-second latency + emotive TTS now make natural phone conversations viable — the core bet of this product.
- Every business with a phone number is a potential customer. Reception, recruiting and hospitality are three concrete wedges into a horizontal market (any inbound/outbound voice workflow).
- The economics improve with scale. Our staged AI strategy (below) moves the platform from managed APIs toward owned, self-hosted models as volume grows — improving cost, privacy, and quality control over time.
4. The staged AI strategy
The "brain" of every call is three models in a loop: STT (hear), LLM (think), TTS (speak). How we source those models shapes cost, speed, quality, and privacy posture. We move through stages rather than picking one — each is the right answer at a different point of growth. Full detail in 02.
| Stage | Name | Approach | Why |
|---|---|---|---|
| 1 | Orchestrate | Best-in-class managed STT / LLM / TTS APIs | Fastest to market, top quality, zero ML ops. Validate product and experience. |
| 2 | Hybrid | Self-host the parts that are most cost-effective to run in-house; keep managed services where quality matters most | Balance cost and quality as volume grows. |
| 3 | Self-host / fine-tune | Fine-tuned open STT + LLM + open TTS, voice cloning in-house | Best control over cost, data privacy, and voice quality at scale. |
Approach: Build Stage 1 now, architect every model behind a provider-abstraction interface (swap STT/LLM/TTS via config, no app rewrite), and treat later stages as an evolution triggered by volume — not a rewrite.
5. Core capabilities at a glance
Full architecture in 02 — Technical Architecture.
| Category | Purpose |
|---|---|
| Telephony | Carrier connectivity, numbers, SIP, and real-time call media |
| STT | Speech → text |
| LLM | Reasoning / dialogue |
| TTS / voice clone | Text → natural, cloned speech |
| Compute | App servers and (later) GPU inference |
| Database | Operational store for accounts, calls, and configuration |
| Auth | Login / SSO (and enterprise SSO on the roadmap) |
| Billing | Subscription and usage billing for business customers |
| In-call payments | Securely take caller payments where applicable (e.g. hotels), without raw card data touching our servers |
| Analytics | Call analytics, summaries, and product insights |
Each layer is designed behind a clean interface so the underlying provider can evolve as the platform scales.
6. Compliance & regulation — the headline risks
Voice AI that calls real people is one of the most regulated things you can build. This is not optional polish — it is a gating requirement, especially for global launch. Detail in 03 — Compliance & Security. The five that matter most:
- AI disclosure — A growing number of jurisdictions require the bot to say it's an AI (e.g. EU AI Act Art. 50, California B.O.T. Act, Utah, Colorado). We build "You're speaking with an AI assistant" disclosure as a configurable, on-by-default runtime feature.
- Call recording consent — Two-party-consent jurisdictions (e.g. CA, FL) and the EU require all-party consent to record. We provide per-jurisdiction consent prompts and recording toggles.
- Robocall / outbound rules — US TCPA (prior express consent for autodialed/AI calls), India TRAI DLT registration for commercial calling, and EU national rules. Outbound (especially AutoSpeak Recruit) is the highest-risk surface.
- Data privacy — GDPR (EU), DPDP Act 2023 (India), CCPA/CPRA (California). Voice can constitute biometric/personal data; voice clones are special-category-adjacent. We support consent, DPAs, data-residency, deletion, and a sub-processor list.
- Payments — PCI-DSS applies the moment a caller reads a card number. Raw card numbers never touch our servers; we use tokenized pay-by-link / certified DTMF capture.
Sector overlays: HR screening → anti-discrimination/EEOC, NYC Local Law 144, Illinois AI Video Interview Act, EU AI Act "high-risk" for employment. Hospitality → payment and cancellation/consumer rules.
This section is a directional overview of our compliance posture, not legal advice.
7. The roadmap in one view
Detailed in 01 — Product Requirements Document. Phases are indicative and sequential.
PHASE 0 Harden the core (multi-tenant) ── productionize the existing pipeline
PHASE 1 Reception MVP + design partners ── 1 vertical live, billing, dashboard
PHASE 1b Recruit + Stay skills ── horizontal: 3 verticals on 1 runtime
PHASE 2 Scale + Hybrid AI + Video avatar ── add avatar, security certifications
PHASE 3 Self-host models + enterprise ── owned models, SSO, SLAs, global
8. Platform overview
AutoSpeak is built as a single, shared Voice Agent Runtime that serves many isolated business accounts (multi-tenancy), with vertical "skills" layered on top. The same brain powers every skill, so improvements to latency, quality, and reliability benefit all customers at once. The platform is designed to scale from early design partners to enterprise customers with SSO, SLAs, and multi-region data residency.
9. Glossary
| Term | Meaning |
|---|---|
| STT / ASR | Speech-to-text / automatic speech recognition |
| LLM | Large language model — the "thinking"/dialogue layer |
| TTS | Text-to-speech — generates the spoken voice |
| Barge-in | Caller interrupting the AI mid-sentence; AI stops and listens |
| Turn latency | Time from caller stops speaking → AI starts speaking (target <800ms) |
| Multi-tenancy | One platform instance serving many isolated business accounts |
| Skill / vertical | A pre-built configuration of the agent for a use case (Reception/Recruit/Stay) |
| ATS / PMS | Applicant Tracking System (HR) / Property Management System (hotels) |
| IVR | Interactive Voice Response — the legacy "press 1 for sales" menus we replace |
| DLT | India's commercial-calling registry (TRAI) |
| DPA | Data Processing Agreement — contract for handling personal data |
| VAD | Voice Activity Detection — detects when the caller is speaking |
Continue to → 01 — Product Requirements Document