AutoSpeak — Master Document & Index

The complete picture: what we are building, how it works, and how we stay compliant.

Status: Product documentation v1.0 Scope: Global / multi-region vision · Horizontal platform (3 skills) · AI strategy presented as a progression toward owned models · Video/avatar = roadmap. Execution sequence: India-first launch, then global springboard — see 06. The full vertical menu beyond the launch three — see 07.

Note: This documentation is directional and informational, not legal advice. Compliance requirements vary by jurisdiction; validate against current regulations and qualified counsel before launch.

0. How to read this suite

This is a suite of linked documents plus this master index. Read them in order; each builds on the last.

#	Document	Answers the question	Primary audience
00	This master index	What is the whole picture?	Everyone
01	Product Requirements Document	What are we building and for whom?	Product, eng, design
02	Technical Architecture	How does it work? Architecture, AI brain, payments	Engineering
03	Compliance & Security	How do we stay legal with data, consent & AI disclosure?	Legal, security
06	India-First Implementation Route Map	Where/how do we launch first? India execution path	Founders, ops, eng
07	Vertical Opportunity Map	What else can it do? Use cases + prioritization	Product, sales

1. The one-paragraph thesis

We already built the hard part. AutoSpeak today is a working real-time voice agent: it answers a phone call, understands the caller (speech-to-text), thinks (LLM), and replies in a natural, human-sounding voice (text-to-speech) — over both cloud telephony and a real phone SIM. The core pipeline is proven. The opportunity is to turn that same pipeline into a multi-tenant platform that businesses use to handle their phone calls: a receptionist that never sleeps, a recruiter that screens hundreds of candidates overnight, and a hotel agent that takes reservations in any language. Much of the technology is reusable; the work ahead is multi-tenancy, vertical "skills," integrations (calendars, ATS, PMS, payments), and a compliance + billing layer.

2. What we are building (in plain terms)

AutoSpeak is a Voice AI platform. A business signs up, picks a phone number (or forwards their existing one), configures an AI agent through a dashboard, and that agent answers/makes calls on their behalf — speaking like a real human, taking actions (booking, routing, logging, screening), and handing off to a person when needed. Every call is transcribed, summarized, analyzed, and (where law allows) recorded.

On top of one shared Voice Agent Runtime, we ship three vertical solutions at launch:

Working name	Vertical	What the AI does	Buyer
AutoSpeak Reception	General business / SMB front desk	Answers the main line, qualifies callers, books appointments, routes/escalates, takes messages, answers FAQs	Owner / office manager
AutoSpeak Recruit	HR / recruiting	Outbound + inbound candidate screening: asks role questions, scores answers, schedules interviews, syncs to ATS	Recruiter / TA lead
AutoSpeak Stay	Hotels / hospitality	Inbound reservations, availability, FAQs, concierge, modify/cancel bookings, takes payment	GM / front-office manager

A roadmap video/avatar layer would add a photoreal "talking human" front-end (web concierge, kiosk, video voicemail) driven by the same brain. Documented as roadmap, not v1.

3. Why now / why us

The pipeline exists and is code-verified. STT → LLM → TTS with voice cloning, cloud telephony media streams, native call handling, persistent data store, and analytics.
Voice AI quality crossed the "is this a human?" threshold. Sub-second latency + emotive TTS now make natural phone conversations viable — the core bet of this product.
Every business with a phone number is a potential customer. Reception, recruiting and hospitality are three concrete wedges into a horizontal market (any inbound/outbound voice workflow).
The economics improve with scale. Our staged AI strategy (below) moves the platform from managed APIs toward owned, self-hosted models as volume grows — improving cost, privacy, and quality control over time.

4. The staged AI strategy

The "brain" of every call is three models in a loop: STT (hear), LLM (think), TTS (speak). How we source those models shapes cost, speed, quality, and privacy posture. We move through stages rather than picking one — each is the right answer at a different point of growth. Full detail in 02.

Stage	Name	Approach	Why
1	Orchestrate	Best-in-class managed STT / LLM / TTS APIs	Fastest to market, top quality, zero ML ops. Validate product and experience.
2	Hybrid	Self-host the parts that are most cost-effective to run in-house; keep managed services where quality matters most	Balance cost and quality as volume grows.
3	Self-host / fine-tune	Fine-tuned open STT + LLM + open TTS, voice cloning in-house	Best control over cost, data privacy, and voice quality at scale.

Approach: Build Stage 1 now, architect every model behind a provider-abstraction interface (swap STT/LLM/TTS via config, no app rewrite), and treat later stages as an evolution triggered by volume — not a rewrite.

5. Core capabilities at a glance

Full architecture in 02 — Technical Architecture.

Category	Purpose
Telephony	Carrier connectivity, numbers, SIP, and real-time call media
STT	Speech → text
LLM	Reasoning / dialogue
TTS / voice clone	Text → natural, cloned speech
Compute	App servers and (later) GPU inference
Database	Operational store for accounts, calls, and configuration
Auth	Login / SSO (and enterprise SSO on the roadmap)
Billing	Subscription and usage billing for business customers
In-call payments	Securely take caller payments where applicable (e.g. hotels), without raw card data touching our servers
Analytics	Call analytics, summaries, and product insights

Each layer is designed behind a clean interface so the underlying provider can evolve as the platform scales.

6. Compliance & regulation — the headline risks

Voice AI that calls real people is one of the most regulated things you can build. This is not optional polish — it is a gating requirement, especially for global launch. Detail in 03 — Compliance & Security. The five that matter most:

AI disclosure — A growing number of jurisdictions require the bot to say it's an AI (e.g. EU AI Act Art. 50, California B.O.T. Act, Utah, Colorado). We build "You're speaking with an AI assistant" disclosure as a configurable, on-by-default runtime feature.
Call recording consent — Two-party-consent jurisdictions (e.g. CA, FL) and the EU require all-party consent to record. We provide per-jurisdiction consent prompts and recording toggles.
Robocall / outbound rules — US TCPA (prior express consent for autodialed/AI calls), India TRAI DLT registration for commercial calling, and EU national rules. Outbound (especially AutoSpeak Recruit) is the highest-risk surface.
Data privacy — GDPR (EU), DPDP Act 2023 (India), CCPA/CPRA (California). Voice can constitute biometric/personal data; voice clones are special-category-adjacent. We support consent, DPAs, data-residency, deletion, and a sub-processor list.
Payments — PCI-DSS applies the moment a caller reads a card number. Raw card numbers never touch our servers; we use tokenized pay-by-link / certified DTMF capture.

Sector overlays: HR screening → anti-discrimination/EEOC, NYC Local Law 144, Illinois AI Video Interview Act, EU AI Act "high-risk" for employment. Hospitality → payment and cancellation/consumer rules.

This section is a directional overview of our compliance posture, not legal advice.

7. The roadmap in one view

Detailed in 01 — Product Requirements Document. Phases are indicative and sequential.

 PHASE 0  Harden the core (multi-tenant)        ── productionize the existing pipeline
 PHASE 1  Reception MVP + design partners       ── 1 vertical live, billing, dashboard
 PHASE 1b Recruit + Stay skills                 ── horizontal: 3 verticals on 1 runtime
 PHASE 2  Scale + Hybrid AI + Video avatar      ── add avatar, security certifications
 PHASE 3  Self-host models + enterprise         ── owned models, SSO, SLAs, global

8. Platform overview

AutoSpeak is built as a single, shared Voice Agent Runtime that serves many isolated business accounts (multi-tenancy), with vertical "skills" layered on top. The same brain powers every skill, so improvements to latency, quality, and reliability benefit all customers at once. The platform is designed to scale from early design partners to enterprise customers with SSO, SLAs, and multi-region data residency.

9. Glossary

Term	Meaning
STT / ASR	Speech-to-text / automatic speech recognition
LLM	Large language model — the "thinking"/dialogue layer
TTS	Text-to-speech — generates the spoken voice
Barge-in	Caller interrupting the AI mid-sentence; AI stops and listens
Turn latency	Time from caller stops speaking → AI starts speaking (target <800ms)
Multi-tenancy	One platform instance serving many isolated business accounts
Skill / vertical	A pre-built configuration of the agent for a use case (Reception/Recruit/Stay)
ATS / PMS	Applicant Tracking System (HR) / Property Management System (hotels)
IVR	Interactive Voice Response — the legacy "press 1 for sales" menus we replace
DLT	India's commercial-calling registry (TRAI)
DPA	Data Processing Agreement — contract for handling personal data
VAD	Voice Activity Detection — detects when the caller is speaking

Continue to → 01 — Product Requirements Document

Master Index