All documents
Overview

Master Index

The whole picture — vision, the product, the three launch skills, and how the pieces fit together.

AutoSpeak — Master Document & Index

The complete picture: what we are building, how it works, and how we stay compliant.

Status: Product documentation v1.0 Scope: Global / multi-region vision · Horizontal platform (3 skills) · AI strategy presented as a progression toward owned models · Video/avatar = roadmap. Execution sequence: India-first launch, then global springboard — see 06. The full vertical menu beyond the launch three — see 07.

Note: This documentation is directional and informational, not legal advice. Compliance requirements vary by jurisdiction; validate against current regulations and qualified counsel before launch.


0. How to read this suite

This is a suite of linked documents plus this master index. Read them in order; each builds on the last.

#DocumentAnswers the questionPrimary audience
00This master indexWhat is the whole picture?Everyone
01Product Requirements DocumentWhat are we building and for whom?Product, eng, design
02Technical ArchitectureHow does it work? Architecture, AI brain, paymentsEngineering
03Compliance & SecurityHow do we stay legal with data, consent & AI disclosure?Legal, security
06India-First Implementation Route MapWhere/how do we launch first? India execution pathFounders, ops, eng
07Vertical Opportunity MapWhat else can it do? Use cases + prioritizationProduct, sales

1. The one-paragraph thesis

We already built the hard part. AutoSpeak today is a working real-time voice agent: it answers a phone call, understands the caller (speech-to-text), thinks (LLM), and replies in a natural, human-sounding voice (text-to-speech) — over both cloud telephony and a real phone SIM. The core pipeline is proven. The opportunity is to turn that same pipeline into a multi-tenant platform that businesses use to handle their phone calls: a receptionist that never sleeps, a recruiter that screens hundreds of candidates overnight, and a hotel agent that takes reservations in any language. Much of the technology is reusable; the work ahead is multi-tenancy, vertical "skills," integrations (calendars, ATS, PMS, payments), and a compliance + billing layer.


2. What we are building (in plain terms)

AutoSpeak is a Voice AI platform. A business signs up, picks a phone number (or forwards their existing one), configures an AI agent through a dashboard, and that agent answers/makes calls on their behalf — speaking like a real human, taking actions (booking, routing, logging, screening), and handing off to a person when needed. Every call is transcribed, summarized, analyzed, and (where law allows) recorded.

On top of one shared Voice Agent Runtime, we ship three vertical solutions at launch:

Working nameVerticalWhat the AI doesBuyer
AutoSpeak ReceptionGeneral business / SMB front deskAnswers the main line, qualifies callers, books appointments, routes/escalates, takes messages, answers FAQsOwner / office manager
AutoSpeak RecruitHR / recruitingOutbound + inbound candidate screening: asks role questions, scores answers, schedules interviews, syncs to ATSRecruiter / TA lead
AutoSpeak StayHotels / hospitalityInbound reservations, availability, FAQs, concierge, modify/cancel bookings, takes paymentGM / front-office manager

A roadmap video/avatar layer would add a photoreal "talking human" front-end (web concierge, kiosk, video voicemail) driven by the same brain. Documented as roadmap, not v1.


3. Why now / why us

  • The pipeline exists and is code-verified. STT → LLM → TTS with voice cloning, cloud telephony media streams, native call handling, persistent data store, and analytics.
  • Voice AI quality crossed the "is this a human?" threshold. Sub-second latency + emotive TTS now make natural phone conversations viable — the core bet of this product.
  • Every business with a phone number is a potential customer. Reception, recruiting and hospitality are three concrete wedges into a horizontal market (any inbound/outbound voice workflow).
  • The economics improve with scale. Our staged AI strategy (below) moves the platform from managed APIs toward owned, self-hosted models as volume grows — improving cost, privacy, and quality control over time.

4. The staged AI strategy

The "brain" of every call is three models in a loop: STT (hear), LLM (think), TTS (speak). How we source those models shapes cost, speed, quality, and privacy posture. We move through stages rather than picking one — each is the right answer at a different point of growth. Full detail in 02.

StageNameApproachWhy
1OrchestrateBest-in-class managed STT / LLM / TTS APIsFastest to market, top quality, zero ML ops. Validate product and experience.
2HybridSelf-host the parts that are most cost-effective to run in-house; keep managed services where quality matters mostBalance cost and quality as volume grows.
3Self-host / fine-tuneFine-tuned open STT + LLM + open TTS, voice cloning in-houseBest control over cost, data privacy, and voice quality at scale.

Approach: Build Stage 1 now, architect every model behind a provider-abstraction interface (swap STT/LLM/TTS via config, no app rewrite), and treat later stages as an evolution triggered by volume — not a rewrite.


5. Core capabilities at a glance

Full architecture in 02 — Technical Architecture.

CategoryPurpose
TelephonyCarrier connectivity, numbers, SIP, and real-time call media
STTSpeech → text
LLMReasoning / dialogue
TTS / voice cloneText → natural, cloned speech
ComputeApp servers and (later) GPU inference
DatabaseOperational store for accounts, calls, and configuration
AuthLogin / SSO (and enterprise SSO on the roadmap)
BillingSubscription and usage billing for business customers
In-call paymentsSecurely take caller payments where applicable (e.g. hotels), without raw card data touching our servers
AnalyticsCall analytics, summaries, and product insights

Each layer is designed behind a clean interface so the underlying provider can evolve as the platform scales.


6. Compliance & regulation — the headline risks

Voice AI that calls real people is one of the most regulated things you can build. This is not optional polish — it is a gating requirement, especially for global launch. Detail in 03 — Compliance & Security. The five that matter most:

  1. AI disclosure — A growing number of jurisdictions require the bot to say it's an AI (e.g. EU AI Act Art. 50, California B.O.T. Act, Utah, Colorado). We build "You're speaking with an AI assistant" disclosure as a configurable, on-by-default runtime feature.
  2. Call recording consent — Two-party-consent jurisdictions (e.g. CA, FL) and the EU require all-party consent to record. We provide per-jurisdiction consent prompts and recording toggles.
  3. Robocall / outbound rules — US TCPA (prior express consent for autodialed/AI calls), India TRAI DLT registration for commercial calling, and EU national rules. Outbound (especially AutoSpeak Recruit) is the highest-risk surface.
  4. Data privacyGDPR (EU), DPDP Act 2023 (India), CCPA/CPRA (California). Voice can constitute biometric/personal data; voice clones are special-category-adjacent. We support consent, DPAs, data-residency, deletion, and a sub-processor list.
  5. PaymentsPCI-DSS applies the moment a caller reads a card number. Raw card numbers never touch our servers; we use tokenized pay-by-link / certified DTMF capture.

Sector overlays: HR screening → anti-discrimination/EEOC, NYC Local Law 144, Illinois AI Video Interview Act, EU AI Act "high-risk" for employment. Hospitality → payment and cancellation/consumer rules.

This section is a directional overview of our compliance posture, not legal advice.


7. The roadmap in one view

Detailed in 01 — Product Requirements Document. Phases are indicative and sequential.

 PHASE 0  Harden the core (multi-tenant)        ── productionize the existing pipeline
 PHASE 1  Reception MVP + design partners       ── 1 vertical live, billing, dashboard
 PHASE 1b Recruit + Stay skills                 ── horizontal: 3 verticals on 1 runtime
 PHASE 2  Scale + Hybrid AI + Video avatar      ── add avatar, security certifications
 PHASE 3  Self-host models + enterprise         ── owned models, SSO, SLAs, global

8. Platform overview

AutoSpeak is built as a single, shared Voice Agent Runtime that serves many isolated business accounts (multi-tenancy), with vertical "skills" layered on top. The same brain powers every skill, so improvements to latency, quality, and reliability benefit all customers at once. The platform is designed to scale from early design partners to enterprise customers with SSO, SLAs, and multi-region data residency.


9. Glossary

TermMeaning
STT / ASRSpeech-to-text / automatic speech recognition
LLMLarge language model — the "thinking"/dialogue layer
TTSText-to-speech — generates the spoken voice
Barge-inCaller interrupting the AI mid-sentence; AI stops and listens
Turn latencyTime from caller stops speaking → AI starts speaking (target <800ms)
Multi-tenancyOne platform instance serving many isolated business accounts
Skill / verticalA pre-built configuration of the agent for a use case (Reception/Recruit/Stay)
ATS / PMSApplicant Tracking System (HR) / Property Management System (hotels)
IVRInteractive Voice Response — the legacy "press 1 for sales" menus we replace
DLTIndia's commercial-calling registry (TRAI)
DPAData Processing Agreement — contract for handling personal data
VADVoice Activity Detection — detects when the caller is speaking

Continue to → 01 — Product Requirements Document