Technical Architecture
How the platform is built: the real-time voice pipeline, the runtime, integrations and reliability.
02 — Technical Architecture
AutoSpeak · Technical Overview How the platform is built: system design, the AI pipeline, knowledge grounding, integrations, payments, and security. See 01 — PRD for what we build and why.
This document describes architecture concepts and is directional, not a contractual specification.
1. Design principles
- Provider abstraction everywhere. Speech-to-text (STT), language models (LLM), text-to-speech (TTS), telephony, and payments each sit behind a clean interface so providers can be swapped via configuration, not code rewrites. This is the single most important architectural decision.
- Stream everything. Latency is the product. No request is fully buffered when it can be streamed.
- Multi-tenant from the metadata up. Every record, every call, and every secret is tenant-scoped.
- Compliance is a runtime feature, not a checklist — disclosure, consent, recording, and data residency are first-class code paths.
- Stateless media workers, stateful control plane. Scale call handling horizontally while keeping durable state in databases.
2. System architecture
┌───────────────────────── CONTROL PLANE ─────────────────────────┐
│ Dashboard (web) Mobile operator app Public API / Webhooks │
│ │ │ │ │
│ └──────────────────┴──────────────────────┘ │
│ API Gateway │
│ Auth/RBAC · Tenant mgr · Agent config · Billing · Analytics │
└───────┬───────────────────────────────────────────────┬──────────┘
│ │
config/secrets/knowledge usage/events/metering
│ │
┌──────────────────────────── DATA ───────────────────────┐ ┌───────── PLATFORM SVCS ─────────┐
│ Operational DB Transactional DB (billing) │ │ Billing · Product analytics │
│ Vector DB (RAG) Object store (recordings) Cache/state │ │ Job queue Secrets vault │
└─────────────────────────────────────────────────────────┘ └─────────────────────────────────┘
│
┌────────────────────────── MEDIA / REALTIME PLANE (autoscaled) ───────────────────────────┐
│ Telephony (media streams, SIP) ⇄ Media Worker (per call) │
│ Worker loop: VAD → STT → Dialogue Mgr (LLM + tools + RAG + guardrails) → TTS │
│ Tools: calendar · ATS · PMS · CRM · payments · SMS/email · transfer-to-human │
└───────────────────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────────────────── AI INFERENCE (abstracted) ──────────────────────────────────┐
│ Speech-to-text · Language model + tools · Text-to-speech / voice │
│ Delivered behind provider interfaces so the inference layer can evolve over time │
└───────────────────────────────────────────────────────────────────────────────────────────┘
Three planes:
- Control plane — dashboard, API, auth, configuration, billing, analytics (low QPS, durable).
- Media/realtime plane — one ephemeral worker per active call; autoscaled; latency-critical.
- AI inference — speech, language, and voice models sitting behind the abstraction layer.
3. Component breakdown
3.1 Telephony layer
- Inbound: carrier → telephony provider → media stream → media worker.
- Outbound: consent-gated dial API → same media worker.
- Transfer: SIP REFER / dial-and-bridge to a human, with a whispered context handoff first.
- Numbers: provision or port per country; manage caller ID, STIR/SHAKEN attestation, and branded caller ID where supported.
- Abstraction: a
TelephonyProviderinterface (answer, dial, sendAudio, transfer, hangup, dtmf) keeps the platform carrier-agnostic and enables failover.
3.2 Media worker (the call brain, per call)
A stateless process per active call that runs the loop:
- Receive audio frames; voice activity detection (VAD) detects speech start/end.
- Stream to STT; get partial and final transcripts.
- On endpoint, push to the Dialogue Manager: assemble the prompt (persona + RAG context + tools + history), call the LLM streaming, and parse tool calls.
- Execute tools (book, lookup, pay-link) as needed; inject results back into the dialogue.
- Stream LLM tokens into TTS; stream audio back to the caller; support barge-in (cancel TTS and flush on new speech).
- Emit events (transcript, latency, disposition) to the control plane; persist on call end.
Concurrency: workers are horizontally scaled; a scheduler/queue assigns calls; an in-memory cache holds live call state for monitoring and handoff.
3.3 Dialogue manager
- Prompt assembly: system persona (per tenant) + business knowledge (RAG) + tool schemas + conversation memory + compliance instructions.
- Tool/function calling: structured actions with deterministic confirmation before any mutating action (bookings, payments).
- Guardrails: input guardrails (abuse, jailbreak) and output guardrails (no PII leakage, no hallucinated facts — answers are grounded in tools/RAG, with a "don't know → escalate to human" fallback), plus length and topic limits, enforced with a lightweight policy layer and structured-output validation.
- State machine overlay: for structured flows such as Recruit (interview screening) and payments, a deterministic flow constrains the LLM to preserve fairness and auditability rather than letting it improvise.
3.4 Knowledge / RAG
- A tenant uploads documents, URLs, or FAQs → content is chunked → embedded → stored in a vector database.
- At call time, the most relevant chunks are retrieved into the prompt. This keeps answers grounded and updatable without re-engineering prompts.
3.5 Control plane / API
- A unified API handles auth/RBAC, tenant CRUD, agent configuration, integrations, number management, analytics, billing, and webhooks.
- AuthN/Z: token-based auth with org-scoped RBAC; enterprise SSO (SAML/SCIM) on the roadmap.
- Metering: every call emits usage events (minutes, actions) into billing and analytics.
3.6 Data stores
| Store | Role | Notes |
|---|---|---|
| Operational database | Tenants, agents, calls, transcripts, summaries, voice profiles | Tenant-scoped with appropriate indexing |
| Transactional database | Billing integrity: subscriptions, invoices, usage ledger | ACID guarantees for anything involving money |
| Vector database | RAG embeddings | Standard embedding pipeline |
| Object store | Call recordings, audio samples | Per-region buckets for residency; encrypted; lifecycle-expired |
| Cache / state store | Live call state, queues, rate limits, sessions | Ephemeral, fast |
| Analytics warehouse | Call and product analytics | Columnar warehouse plus product analytics tooling |
4. The AI pipeline
The voice pipeline is built around a single abstraction so the underlying speech, language, and voice models can evolve over time without changing application logic.
4.1 The abstraction
interface SttProvider { stream(audio) -> partial/final transcripts }
interface LlmProvider { complete(messages, tools) -> streamed tokens + tool calls }
interface TtsProvider { synthesize(text, voiceId) -> streamed audio }
interface VoiceCloner { clone(samples, consent) -> voiceId }
Configuration selects the implementation per tenant, region, or plan. Application logic never calls a provider SDK directly. This keeps the inference layer interchangeable.
4.2 Speech-to-text (STT)
Streaming transcription produces partial and final transcripts with low latency, with multilingual support suited to phone-band audio across accents and dialects.
4.3 Language model (LLM)
A multi-model approach routes most conversational turns to a fast, cost-effective model and reserves higher-capability models for harder reasoning. The router sits behind the LlmProvider interface so models can be added or changed by configuration.
4.4 Text-to-speech and voice (TTS)
Streaming, low-latency speech synthesis with natural prosody, plus consent-based voice cloning for branded voices. Different voice tiers can be offered to match quality and cost needs per tenant.
4.5 Model evaluation
Before any model change is shipped, it must pass an evaluation harness: golden conversations, word-error-rate (WER) for STT, task-success / hallucination / latency for the LLM, and naturalness scoring for TTS. No model swap ships without passing the evaluation bench (this ties to the test bench in the PRD).
Model and provider choices are directional and may change as the technology landscape evolves.
5. Dependencies
The platform depends on a set of standard building blocks, each behind an abstraction to limit lock-in.
5.1 Core runtime dependencies
| Need | Approach | Lock-in risk | Mitigation |
|---|---|---|---|
| Telephony / numbers | Carrier media-stream provider | Medium (numbers, code) | TelephonyProvider abstraction; multi-carrier capable |
| STT | Streaming transcription provider | Low | Abstraction; self-host path available |
| LLM | Multi-model router | Low–Med | Abstraction; multi-model routing |
| TTS / voice clone | Voice synthesis provider | Medium | Tiered voices; abstraction |
| Vector / RAG | Standard vector store | Low | Standard embeddings |
| Compute / GPU | Cloud GPU compute | Med | Containerized and portable |
| Databases | Operational + transactional + vector + cache | Low | Standard technologies |
| Auth | Token-based auth | Low | Enterprise SSO option |
| Hosting | Containerized cloud platform | Med | Containerized for portability |
| Email / SMS | Messaging provider | Low | Abstraction |
| Observability | OpenTelemetry-based | Low | Open standard |
| Product analytics | Product analytics tooling | Low | — |
| Error tracking | Error-tracking tooling | Low | — |
5.2 Integration dependencies (per vertical)
| Vertical | System type | Example targets |
|---|---|---|
| Reception | Calendar / scheduling | Google Calendar, Microsoft 365, Calendly, Cal.com |
| Reception | CRM | HubSpot, Salesforce, Zoho, Pipedrive |
| Recruit | ATS | Greenhouse, Lever, Workday, Ashby, Zoho Recruit |
| Stay | PMS / channel manager | Cloudbeds, Mews, Oracle OPERA (OHIP), SiteMinder, RoomRaccoon |
| All | Messaging | Slack, WhatsApp Business, SMS, email |
| All | Payments | See §7 |
5.3 Build vs. buy approach
| Component | Approach | Rationale |
|---|---|---|
| Carrier network | Buy | Telephony is not something to build in-house |
| Speech / language / voice models | Standard providers behind an abstraction | Speed and quality |
| Orchestration / runtime | Build | This is the core product |
| Billing engine | Buy | Money infrastructure is best left to specialists |
| Dashboard / agent designer | Build | Core UX and differentiation |
| Video avatar | Buy first | Validate before considering self-hosting |
| SSO / SCIM | Buy | Enterprise table-stakes, not differentiating |
| Compliance tooling | Buy | Accelerate certifications |
6. Integrations architecture
- OAuth-based connectors per system; tokens stored encrypted and tenant-scoped.
- Tool adapters expose a uniform interface to the dialogue manager (
createBooking,checkAvailability,pushCandidateResult,takePayment). - Outbound webhooks (call ended, booking made) drive customer automation; a public API enables programmatic control.
- iPaaS escape hatch: Zapier/Make-style connectors cover the long tail so we don't have to build every integration directly.
7. Payments architecture (two distinct flows)
7.1 Flow A — Billing the business (subscriptions + usage)
- Subscription billing combines plans (seats/tiers) with metered usage (minutes, calls, actions). A usage ledger feeds the billing provider.
- Local payment methods and tax handling are supported per region, including a Merchant-of-Record option to manage global sales-tax/VAT/GST compliance.
- Dunning, invoices, and proration are handled by the billing provider.
7.2 Flow B — The AI takes a payment from a caller
This is the PCI-sensitive path and is designed carefully (see 03 — Compliance & Security). Options, in preferred order:
- Pay-by-link (preferred): the AI sends an SMS/WhatsApp/email payment link and the caller pays on a hosted page. No card data ever touches the voice channel or our servers. Lowest PCI scope.
- Agent-assisted with certified DTMF capture: the caller types card details on the keypad into a PCI-certified capture service that masks digits from the AI and the recording. Higher scope; used only if needed.
- Tokenized profile on file: charge a stored, tokenized method (for returning guests) — the AI never sees the card number.
- Never transcribe a spoken card number into logs or recordings. Detect-and-redact plus recording pause during payment are enforced.
8. Video / avatar architecture (roadmap)
- Pre-rendered: script → avatar vendor → video file, for voicemail and marketing use cases.
- Real-time avatar: drive a streaming talking-head from the same LLM + TTS output.
- Buy first: integrate interactive-avatar vendors that consume our audio/text stream while we continue to own the conversational brain.
- Build later: self-host open lip-sync/talking-head models only if volume justifies it and latency is acceptable.
- Gate to web/kiosk first, where latency tolerance is higher than on the phone. Architecture: the media worker emits
(audio + visemes/text)→ avatar renderer → WebRTC video to the browser or kiosk.
9. Security architecture (summary; full treatment in 03 — Compliance & Security)
- TLS everywhere; encryption at rest (databases, object store, backups).
- A secrets vault keeps all keys out of code and source control.
- Tenant isolation at the data layer; per-request tenant scoping; RBAC; audit logs.
- PII/PCI redaction in transcripts and recordings; configurable retention with hard delete.
- Per-region data residency routing (EU / India / US buckets and databases).
- Least-privilege service accounts; signed webhooks; rate limiting; abuse detection.
10. Tech stack summary
- Backend / API: Node + TypeScript.
- Media workers: Node/TypeScript, with a performance path available for latency-critical audio if needed.
- Inference serving: Python-based serving on GPU for self-hosted components.
- Frontend dashboard: React (web).
- Mobile: React Native / Expo, as an operator companion app.
- Data: operational + transactional + vector databases, a cache/state store, an object store, and an analytics warehouse.
- Infra: containerized cloud with infrastructure-as-code, CI/CD, and OpenTelemetry-based observability.
- Payments: subscription billing plus regional payment methods.
Continue to → 03 — Compliance & Security