02 — Technical Architecture

AutoSpeak · Technical Overview How the platform is built: system design, the AI pipeline, knowledge grounding, integrations, payments, and security. See 01 — PRD for what we build and why.

This document describes architecture concepts and is directional, not a contractual specification.

1. Design principles

Provider abstraction everywhere. Speech-to-text (STT), language models (LLM), text-to-speech (TTS), telephony, and payments each sit behind a clean interface so providers can be swapped via configuration, not code rewrites. This is the single most important architectural decision.
Stream everything. Latency is the product. No request is fully buffered when it can be streamed.
Multi-tenant from the metadata up. Every record, every call, and every secret is tenant-scoped.
Compliance is a runtime feature, not a checklist — disclosure, consent, recording, and data residency are first-class code paths.
Stateless media workers, stateful control plane. Scale call handling horizontally while keeping durable state in databases.

2. System architecture

                          ┌───────────────────────── CONTROL PLANE ─────────────────────────┐
                          │  Dashboard (web)   Mobile operator app   Public API / Webhooks   │
                          │        │                  │                      │               │
                          │        └──────────────────┴──────────────────────┘               │
                          │                          API Gateway                              │
                          │   Auth/RBAC · Tenant mgr · Agent config · Billing · Analytics     │
                          └───────┬───────────────────────────────────────────────┬──────────┘
                                  │                                                │
              config/secrets/knowledge                                   usage/events/metering
                                  │                                                │
   ┌──────────────────────────── DATA ───────────────────────┐        ┌───────── PLATFORM SVCS ─────────┐
   │ Operational DB    Transactional DB (billing)             │        │ Billing · Product analytics      │
   │ Vector DB (RAG)   Object store (recordings)  Cache/state │        │ Job queue   Secrets vault        │
   └─────────────────────────────────────────────────────────┘        └─────────────────────────────────┘
                                  │
   ┌────────────────────────── MEDIA / REALTIME PLANE (autoscaled) ───────────────────────────┐
   │  Telephony (media streams, SIP)  ⇄  Media Worker (per call)                                │
   │      Worker loop:  VAD → STT → Dialogue Mgr (LLM + tools + RAG + guardrails) → TTS         │
   │      Tools: calendar · ATS · PMS · CRM · payments · SMS/email · transfer-to-human          │
   └───────────────────────────────────────────────────────────────────────────────────────────┘
                                  │
   ┌───────────────────────────── AI INFERENCE (abstracted) ──────────────────────────────────┐
   │  Speech-to-text · Language model + tools · Text-to-speech / voice                          │
   │  Delivered behind provider interfaces so the inference layer can evolve over time          │
   └───────────────────────────────────────────────────────────────────────────────────────────┘

Three planes:

Control plane — dashboard, API, auth, configuration, billing, analytics (low QPS, durable).
Media/realtime plane — one ephemeral worker per active call; autoscaled; latency-critical.
AI inference — speech, language, and voice models sitting behind the abstraction layer.

3. Component breakdown

3.1 Telephony layer

Inbound: carrier → telephony provider → media stream → media worker.
Outbound: consent-gated dial API → same media worker.
Transfer: SIP REFER / dial-and-bridge to a human, with a whispered context handoff first.
Numbers: provision or port per country; manage caller ID, STIR/SHAKEN attestation, and branded caller ID where supported.
Abstraction: a TelephonyProvider interface (answer, dial, sendAudio, transfer, hangup, dtmf) keeps the platform carrier-agnostic and enables failover.

3.2 Media worker (the call brain, per call)

A stateless process per active call that runs the loop:

Receive audio frames; voice activity detection (VAD) detects speech start/end.
Stream to STT; get partial and final transcripts.
On endpoint, push to the Dialogue Manager: assemble the prompt (persona + RAG context + tools + history), call the LLM streaming, and parse tool calls.
Execute tools (book, lookup, pay-link) as needed; inject results back into the dialogue.
Stream LLM tokens into TTS; stream audio back to the caller; support barge-in (cancel TTS and flush on new speech).
Emit events (transcript, latency, disposition) to the control plane; persist on call end.

Concurrency: workers are horizontally scaled; a scheduler/queue assigns calls; an in-memory cache holds live call state for monitoring and handoff.

3.3 Dialogue manager

Prompt assembly: system persona (per tenant) + business knowledge (RAG) + tool schemas + conversation memory + compliance instructions.
Tool/function calling: structured actions with deterministic confirmation before any mutating action (bookings, payments).
Guardrails: input guardrails (abuse, jailbreak) and output guardrails (no PII leakage, no hallucinated facts — answers are grounded in tools/RAG, with a "don't know → escalate to human" fallback), plus length and topic limits, enforced with a lightweight policy layer and structured-output validation.
State machine overlay: for structured flows such as Recruit (interview screening) and payments, a deterministic flow constrains the LLM to preserve fairness and auditability rather than letting it improvise.

3.4 Knowledge / RAG

A tenant uploads documents, URLs, or FAQs → content is chunked → embedded → stored in a vector database.
At call time, the most relevant chunks are retrieved into the prompt. This keeps answers grounded and updatable without re-engineering prompts.

3.5 Control plane / API

A unified API handles auth/RBAC, tenant CRUD, agent configuration, integrations, number management, analytics, billing, and webhooks.
AuthN/Z: token-based auth with org-scoped RBAC; enterprise SSO (SAML/SCIM) on the roadmap.
Metering: every call emits usage events (minutes, actions) into billing and analytics.

3.6 Data stores

Store	Role	Notes
Operational database	Tenants, agents, calls, transcripts, summaries, voice profiles	Tenant-scoped with appropriate indexing
Transactional database	Billing integrity: subscriptions, invoices, usage ledger	ACID guarantees for anything involving money
Vector database	RAG embeddings	Standard embedding pipeline
Object store	Call recordings, audio samples	Per-region buckets for residency; encrypted; lifecycle-expired
Cache / state store	Live call state, queues, rate limits, sessions	Ephemeral, fast
Analytics warehouse	Call and product analytics	Columnar warehouse plus product analytics tooling

4. The AI pipeline

The voice pipeline is built around a single abstraction so the underlying speech, language, and voice models can evolve over time without changing application logic.

4.1 The abstraction

interface SttProvider   { stream(audio) -> partial/final transcripts }
interface LlmProvider   { complete(messages, tools) -> streamed tokens + tool calls }
interface TtsProvider   { synthesize(text, voiceId) -> streamed audio }
interface VoiceCloner   { clone(samples, consent) -> voiceId }

Configuration selects the implementation per tenant, region, or plan. Application logic never calls a provider SDK directly. This keeps the inference layer interchangeable.

4.2 Speech-to-text (STT)

Streaming transcription produces partial and final transcripts with low latency, with multilingual support suited to phone-band audio across accents and dialects.

4.3 Language model (LLM)

A multi-model approach routes most conversational turns to a fast, cost-effective model and reserves higher-capability models for harder reasoning. The router sits behind the LlmProvider interface so models can be added or changed by configuration.

4.4 Text-to-speech and voice (TTS)

Streaming, low-latency speech synthesis with natural prosody, plus consent-based voice cloning for branded voices. Different voice tiers can be offered to match quality and cost needs per tenant.

4.5 Model evaluation

Before any model change is shipped, it must pass an evaluation harness: golden conversations, word-error-rate (WER) for STT, task-success / hallucination / latency for the LLM, and naturalness scoring for TTS. No model swap ships without passing the evaluation bench (this ties to the test bench in the PRD).

Model and provider choices are directional and may change as the technology landscape evolves.

5. Dependencies

The platform depends on a set of standard building blocks, each behind an abstraction to limit lock-in.

5.1 Core runtime dependencies

Need	Approach	Lock-in risk	Mitigation
Telephony / numbers	Carrier media-stream provider	Medium (numbers, code)	`TelephonyProvider` abstraction; multi-carrier capable
STT	Streaming transcription provider	Low	Abstraction; self-host path available
LLM	Multi-model router	Low–Med	Abstraction; multi-model routing
TTS / voice clone	Voice synthesis provider	Medium	Tiered voices; abstraction
Vector / RAG	Standard vector store	Low	Standard embeddings
Compute / GPU	Cloud GPU compute	Med	Containerized and portable
Databases	Operational + transactional + vector + cache	Low	Standard technologies
Auth	Token-based auth	Low	Enterprise SSO option
Hosting	Containerized cloud platform	Med	Containerized for portability
Email / SMS	Messaging provider	Low	Abstraction
Observability	OpenTelemetry-based	Low	Open standard
Product analytics	Product analytics tooling	Low	—
Error tracking	Error-tracking tooling	Low	—

5.2 Integration dependencies (per vertical)

Vertical	System type	Example targets
Reception	Calendar / scheduling	Google Calendar, Microsoft 365, Calendly, Cal.com
Reception	CRM	HubSpot, Salesforce, Zoho, Pipedrive
Recruit	ATS	Greenhouse, Lever, Workday, Ashby, Zoho Recruit
Stay	PMS / channel manager	Cloudbeds, Mews, Oracle OPERA (OHIP), SiteMinder, RoomRaccoon
All	Messaging	Slack, WhatsApp Business, SMS, email
All	Payments	See §7

5.3 Build vs. buy approach

Component	Approach	Rationale
Carrier network	Buy	Telephony is not something to build in-house
Speech / language / voice models	Standard providers behind an abstraction	Speed and quality
Orchestration / runtime	Build	This is the core product
Billing engine	Buy	Money infrastructure is best left to specialists
Dashboard / agent designer	Build	Core UX and differentiation
Video avatar	Buy first	Validate before considering self-hosting
SSO / SCIM	Buy	Enterprise table-stakes, not differentiating
Compliance tooling	Buy	Accelerate certifications

6. Integrations architecture

OAuth-based connectors per system; tokens stored encrypted and tenant-scoped.
Tool adapters expose a uniform interface to the dialogue manager (createBooking, checkAvailability, pushCandidateResult, takePayment).
Outbound webhooks (call ended, booking made) drive customer automation; a public API enables programmatic control.
iPaaS escape hatch: Zapier/Make-style connectors cover the long tail so we don't have to build every integration directly.

7. Payments architecture (two distinct flows)

7.1 Flow A — Billing the business (subscriptions + usage)

Subscription billing combines plans (seats/tiers) with metered usage (minutes, calls, actions). A usage ledger feeds the billing provider.
Local payment methods and tax handling are supported per region, including a Merchant-of-Record option to manage global sales-tax/VAT/GST compliance.
Dunning, invoices, and proration are handled by the billing provider.

7.2 Flow B — The AI takes a payment from a caller

This is the PCI-sensitive path and is designed carefully (see 03 — Compliance & Security). Options, in preferred order:

Pay-by-link (preferred): the AI sends an SMS/WhatsApp/email payment link and the caller pays on a hosted page. No card data ever touches the voice channel or our servers. Lowest PCI scope.
Agent-assisted with certified DTMF capture: the caller types card details on the keypad into a PCI-certified capture service that masks digits from the AI and the recording. Higher scope; used only if needed.
Tokenized profile on file: charge a stored, tokenized method (for returning guests) — the AI never sees the card number.

Never transcribe a spoken card number into logs or recordings. Detect-and-redact plus recording pause during payment are enforced.

8. Video / avatar architecture (roadmap)

Pre-rendered: script → avatar vendor → video file, for voicemail and marketing use cases.
Real-time avatar: drive a streaming talking-head from the same LLM + TTS output.
- Buy first: integrate interactive-avatar vendors that consume our audio/text stream while we continue to own the conversational brain.
- Build later: self-host open lip-sync/talking-head models only if volume justifies it and latency is acceptable.
Gate to web/kiosk first, where latency tolerance is higher than on the phone. Architecture: the media worker emits (audio + visemes/text) → avatar renderer → WebRTC video to the browser or kiosk.

9. Security architecture (summary; full treatment in 03 — Compliance & Security)

TLS everywhere; encryption at rest (databases, object store, backups).
A secrets vault keeps all keys out of code and source control.
Tenant isolation at the data layer; per-request tenant scoping; RBAC; audit logs.
PII/PCI redaction in transcripts and recordings; configurable retention with hard delete.
Per-region data residency routing (EU / India / US buckets and databases).
Least-privilege service accounts; signed webhooks; rate limiting; abuse detection.

10. Tech stack summary

Backend / API: Node + TypeScript.
Media workers: Node/TypeScript, with a performance path available for latency-critical audio if needed.
Inference serving: Python-based serving on GPU for self-hosted components.
Frontend dashboard: React (web).
Mobile: React Native / Expo, as an operator companion app.
Data: operational + transactional + vector databases, a cache/state store, an object store, and an analytics warehouse.
Infra: containerized cloud with infrastructure-as-code, CI/CD, and OpenTelemetry-based observability.
Payments: subscription billing plus regional payment methods.

Continue to → 03 — Compliance & Security