Skip to content

readium/speech-server

Repository files navigation

Readium Speech Server

A remote text-to-speech HTTP service for the Readium ecosystem. Exposes a uniform API for listing voices and synthesizing speech, backed by open neural TTS models running on CPU — no GPU required.

Designed to pair with Readium Speech and any Readium-compatible reading application.


Overview

API GET /v1/voices · POST /v1/synthesize
Providers PocketTTS (v1) · Kokoro, ElevenLabs, Azure (planned)
Languages English · French · Italian · German · Spanish · Portuguese
Formats MP3 · WAV · Opus
Word boundaries Schema ready; supported when provider supplies timing data
Deployment Docker · CPU-only · single named volume for model weights

Quick start

You need Docker — nothing else. No Python, no PyTorch, no model files on your machine.

make configure        # interactive setup → writes .env
make build            # build the image (~2 min)
make dev-docker       # start server — downloads models on first run

Server: http://localhost:8000
Interactive API docs: http://localhost:8000/docs

Quick test — synthesize and play:

curl -s -X POST http://localhost:8000/v1/synthesize \
  -H 'Content-Type: application/json' \
  -d '{"text":"Hello world","voice":"urn:readium:tts:pocket:en-alba"}' \
  -o /tmp/speech.mp3 && open /tmp/speech.mp3

First start downloads the selected language models (~240 MB each) into a persistent Docker volume. Every restart after that is instant — models are already cached.

Production

make start   # detached, restarts automatically on crash or reboot
make stop    # stop all containers
make logs    # tail container logs

Setup wizard

make configure opens an interactive wizard. It handles both first-time setup and ongoing management:

Readium Speech Server

  Current: languages=en  workers=1

  1) Show full config
  2) Add a language
  3) Remove a language
  4) Change workers
  5) Update HF token
  6) First-time setup (re-run / overwrite)
  7) Reset
  q) Quit

Adding a language updates .env in place — only the new model is downloaded on next restart. Removing a language preserves the model files in the Docker volume; disk is reclaimed only if you choose to purge the volume.


Voices

156 voices across 7 language variants (26 voice identities × 6 languages). Every voice is available in every language — alba speaking English, alba speaking French, alba speaking German, etc. Only languages listed in LANGUAGES are loaded at startup (~240 MB RAM per language per worker).

The 26 voice identities, sourced from kyutai/tts-voices:

Voice Gender Origin
alba female Alba MacKenna (CC BY 4.0)
anna, vera, fantine, charles, paul, eponine, azelma, george, mary, jane, michael, eve mixed VCTK dataset (CC BY 4.0)
bill_boerst, peter_yearsley, stuart_bell, caro_davy mixed Voice Zero / LibriVox (CC0)
marius, javert male Voice donations (CC0)
cosette female Expresso dataset (CC BY-NC 4.0)
jean male EARS dataset (CC BY-NC 4.0)
estelle female Unmute production voices
giovanni, lola, juergen, rafael mixed Kyutai (language reference voices)

Voice URIs are language-scoped — the same speaker in different languages gets a distinct URI:

urn:readium:tts:pocket:en-alba    # Alba speaking English
urn:readium:tts:pocket:fr-alba    # Alba speaking French
urn:readium:tts:pocket:de-alba    # Alba speaking German

Supported language codes: en, fr, it, de, es, pt — derived directly from PocketTTS model names.

How it works: PocketTTS pre-computes voice embeddings for every voice × language combination (stored as .safetensors files in kyutai/pocket-tts-without-voice-cloning). The voice sample is encoded once at model-load time — no per-request cloning overhead.


API reference

Health

Method Path Description
GET /healthz Liveness — 200 when the process is running
GET /readyz Readiness — 503 until models are loaded and ffmpeg is available

Voices

GET /v1/voices
GET /v1/voices?language=fr
GET /v1/voices?provider=pocket
GET /v1/voices?offset=0&limit=20

Returns an array of voice objects. Null-valued optional fields are omitted. Each voice includes a boundary field indicating whether that provider supports word-level timing marks.

Pagination query params:

Param Type Description
language string Filter by BCP-47 language prefix (e.g. en, fr)
provider string Filter by provider id (e.g. pocket)
offset int ≥ 0 Voices to skip (default: 0)
limit int ≥ 1 Max voices to return (default: all)

Response headers:

Header Description
X-Total-Count Total matching voices before pagination
X-Offset Applied offset
X-Limit Applied limit (omitted when no limit set)

Synthesize

POST /v1/synthesize
Content-Type: application/json

Minimal request — returns binary MP3:

{
  "text": "Hello, world!",
  "voice": "urn:readium:tts:pocket:en-alba"
}

Full request:

{
  "id": "urn:uuid:019f178c-cc7c-7bb3-a39b-d185f43d3cc4",
  "text": "Ceci est un test.",
  "language": "fr",
  "voice": "urn:readium:tts:pocket:fr-estelle",
  "ssml": false,
  "prev_utterance": "La nuit était sombre.",
  "next_utterance": "La pièce était froide.",
  "publication_id": "urn:isbn:9780000000000",
  "boundary": false,
  "output": {
    "format": "mp3",
    "bitrate": 64,
    "speed": 1.0,
    "pitch": null
  }
}

Response (default, boundary: false):

Binary audio with Content-Type: audio/mpeg (or audio/wav, audio/ogg).

Response (boundary: true):

{
  "audio": "<base64-encoded audio>",
  "format": "mp3",
  "boundaries": [
    { "name": "word", "charIndex": 0,  "charLength": 5,  "elapsedTime": 0.0  },
    { "name": "word", "charIndex": 6,  "charLength": 3,  "elapsedTime": 0.38 }
  ]
}

boundaries is null when the provider does not support word timing. Check voice.boundary before requesting — if false, the response will always return null.

Word boundary fields mirror the Web Speech API boundary event: charIndex and charLength index into the original text; elapsedTime is seconds from audio start.

Output formats:

format Content-Type Notes
mp3 audio/mpeg Default
wav audio/wav No transcoding — fastest
opus audio/ogg Smallest file size

Errors:

All errors return a consistent shape:

{ "error": { "code": "voice_not_found", "message": "...", "detail": null } }
Status Code Cause
400 validation_failed Empty or whitespace text
404 voice_not_found Voice URI not registered
413 payload_too_large Text exceeds MAX_TEXT_LENGTH (default 2000 chars)
415 unsupported_format format value not in mp3, wav, opus
422 Request schema invalid (Pydantic detail)
503 Models not yet loaded

Configuration

Run make configure to generate .env, or run bash scripts/configure.sh directly.

Variable Default Description
LANGUAGES en Comma-separated BCP-47 language codes to load. Supported: en fr it de es pt
HF_TOKEN (empty) HuggingFace token. Optional — prevents rate-limiting on first-run model downloads
WORKERS 1 Uvicorn worker processes. Each loads a full copy of every active language model
MAX_CONCURRENT_SYNTHESES 2 Max parallel CPU inference jobs per worker
API_KEY_ENABLED false Require X-API-Key header on all routes
API_KEY (empty) Key value when API_KEY_ENABLED=true
LOG_LEVEL INFO DEBUG · INFO · WARNING · ERROR
PORT 8000 Listen port
MAX_TEXT_LENGTH 2000 Maximum characters per synthesis request
FFMPEG_BIN ffmpeg Path to ffmpeg binary (bundled in the Docker image)
POCKET_DEFAULT_VOICE alba Default voice when none is specified

RAM estimate: WORKERS × active languages × ~240 MB

Example: 2 workers, English + French = 2 × 2 × 240 MB ≈ 960 MB


Development

Commands

Command Description
make configure Run setup wizard
make build Build Docker image
make dev-docker Start dev server with hot-reload
make dev-docker-build Build then start
make start Start production stack (detached)
make stop Stop containers
make logs Tail app logs
make test-docker Fast test suite — no models needed
make test-integration-docker Integration tests — requires models in volume
make ci-docker Lint + format check + typecheck + tests
make lint-docker ruff check
make fmt-docker ruff format
make typecheck-docker mypy
make test Fast tests via uv (no Docker)
make ci Full local CI
make clean Remove __pycache__ and .pyc files

Running tests locally

The fast suite requires no models and no ffmpeg — everything is mocked:

uv sync
uv run pytest tests/ -m 'not integration and not slow' -v

Adding a provider

  1. Create app/providers/<name>.py implementing TTSProvider
  2. Declare id, supported_languages, and supports_boundaries as class variables
  3. Implement _all_voices() and synthesize()
  4. Register in app/main.py _build_registry()

No changes to routes, synthesizer, or voice catalog. Language filtering and boundary capability are inherited automatically from the base class.


Architecture

Client
  └─ POST /v1/synthesize
       └─ Synthesizer
            ├─ validate text length + content
            ├─ resolve voiceURI → (provider, voiceURI)  (VoiceCatalog)
            ├─ provider.synthesize()  ← runs in thread pool, bounded by semaphore
            │    └─ TTSModel.generate_audio()  — CPU inference
            └─ encode PCM → mp3/opus  (ffmpeg driver)  or  wrap → wav
  • Routes are async; all CPU-bound inference runs off the event loop via anyio.to_thread.run_sync
  • A semaphore (MAX_CONCURRENT_SYNTHESES) prevents model thrashing under concurrent load
  • Model throughput scales by adding worker processes (WORKERS), not threads
  • Model weights live in a named Docker volume — downloaded once, instant on every subsequent start

Provider roadmap

Provider Status Notes
PocketTTS Current CPU · 6 languages · 156 voices (26 identities × 6 languages)
Kokoro 📆 Comming soon Referenced, not vendored (IP cleanliness)
ElevenLabs 📆 Comming soon Proxied · word boundaries supported
Azure Speech 📆 Comming soon Proxied · word boundaries supported

Related projects

About

🤖 An open-source server for hosting and proxying TTS models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors