A remote text-to-speech HTTP service for the Readium ecosystem. Exposes a uniform API for listing voices and synthesizing speech, backed by open neural TTS models running on CPU — no GPU required.
Designed to pair with Readium Speech and any Readium-compatible reading application.
| API | GET /v1/voices · POST /v1/synthesize |
| Providers | PocketTTS (v1) · Kokoro, ElevenLabs, Azure (planned) |
| Languages | English · French · Italian · German · Spanish · Portuguese |
| Formats | MP3 · WAV · Opus |
| Word boundaries | Schema ready; supported when provider supplies timing data |
| Deployment | Docker · CPU-only · single named volume for model weights |
You need Docker — nothing else. No Python, no PyTorch, no model files on your machine.
make configure # interactive setup → writes .env
make build # build the image (~2 min)
make dev-docker # start server — downloads models on first runServer: http://localhost:8000
Interactive API docs: http://localhost:8000/docs
Quick test — synthesize and play:
curl -s -X POST http://localhost:8000/v1/synthesize \
-H 'Content-Type: application/json' \
-d '{"text":"Hello world","voice":"urn:readium:tts:pocket:en-alba"}' \
-o /tmp/speech.mp3 && open /tmp/speech.mp3First start downloads the selected language models (~240 MB each) into a persistent Docker volume. Every restart after that is instant — models are already cached.
make start # detached, restarts automatically on crash or reboot
make stop # stop all containers
make logs # tail container logsmake configure opens an interactive wizard. It handles both first-time setup and ongoing management:
Readium Speech Server
Current: languages=en workers=1
1) Show full config
2) Add a language
3) Remove a language
4) Change workers
5) Update HF token
6) First-time setup (re-run / overwrite)
7) Reset
q) Quit
Adding a language updates .env in place — only the new model is downloaded on next restart. Removing a language preserves the model files in the Docker volume; disk is reclaimed only if you choose to purge the volume.
156 voices across 7 language variants (26 voice identities × 6 languages). Every voice is available in every language — alba speaking English, alba speaking French, alba speaking German, etc. Only languages listed in LANGUAGES are loaded at startup (~240 MB RAM per language per worker).
The 26 voice identities, sourced from kyutai/tts-voices:
| Voice | Gender | Origin |
|---|---|---|
| alba | female | Alba MacKenna (CC BY 4.0) |
| anna, vera, fantine, charles, paul, eponine, azelma, george, mary, jane, michael, eve | mixed | VCTK dataset (CC BY 4.0) |
| bill_boerst, peter_yearsley, stuart_bell, caro_davy | mixed | Voice Zero / LibriVox (CC0) |
| marius, javert | male | Voice donations (CC0) |
| cosette | female | Expresso dataset (CC BY-NC 4.0) |
| jean | male | EARS dataset (CC BY-NC 4.0) |
| estelle | female | Unmute production voices |
| giovanni, lola, juergen, rafael | mixed | Kyutai (language reference voices) |
Voice URIs are language-scoped — the same speaker in different languages gets a distinct URI:
urn:readium:tts:pocket:en-alba # Alba speaking English
urn:readium:tts:pocket:fr-alba # Alba speaking French
urn:readium:tts:pocket:de-alba # Alba speaking German
Supported language codes: en, fr, it, de, es, pt — derived directly from PocketTTS model names.
How it works: PocketTTS pre-computes voice embeddings for every voice × language combination (stored as
.safetensorsfiles inkyutai/pocket-tts-without-voice-cloning). The voice sample is encoded once at model-load time — no per-request cloning overhead.
| Method | Path | Description |
|---|---|---|
GET |
/healthz |
Liveness — 200 when the process is running |
GET |
/readyz |
Readiness — 503 until models are loaded and ffmpeg is available |
GET /v1/voices
GET /v1/voices?language=fr
GET /v1/voices?provider=pocket
GET /v1/voices?offset=0&limit=20
Returns an array of voice objects. Null-valued optional fields are omitted. Each voice includes a boundary field indicating whether that provider supports word-level timing marks.
Pagination query params:
| Param | Type | Description |
|---|---|---|
language |
string | Filter by BCP-47 language prefix (e.g. en, fr) |
provider |
string | Filter by provider id (e.g. pocket) |
offset |
int ≥ 0 | Voices to skip (default: 0) |
limit |
int ≥ 1 | Max voices to return (default: all) |
Response headers:
| Header | Description |
|---|---|
X-Total-Count |
Total matching voices before pagination |
X-Offset |
Applied offset |
X-Limit |
Applied limit (omitted when no limit set) |
POST /v1/synthesize
Content-Type: application/json
Minimal request — returns binary MP3:
{
"text": "Hello, world!",
"voice": "urn:readium:tts:pocket:en-alba"
}Full request:
{
"id": "urn:uuid:019f178c-cc7c-7bb3-a39b-d185f43d3cc4",
"text": "Ceci est un test.",
"language": "fr",
"voice": "urn:readium:tts:pocket:fr-estelle",
"ssml": false,
"prev_utterance": "La nuit était sombre.",
"next_utterance": "La pièce était froide.",
"publication_id": "urn:isbn:9780000000000",
"boundary": false,
"output": {
"format": "mp3",
"bitrate": 64,
"speed": 1.0,
"pitch": null
}
}Response (default, boundary: false):
Binary audio with Content-Type: audio/mpeg (or audio/wav, audio/ogg).
Response (boundary: true):
{
"audio": "<base64-encoded audio>",
"format": "mp3",
"boundaries": [
{ "name": "word", "charIndex": 0, "charLength": 5, "elapsedTime": 0.0 },
{ "name": "word", "charIndex": 6, "charLength": 3, "elapsedTime": 0.38 }
]
}boundaries is null when the provider does not support word timing. Check voice.boundary before requesting — if false, the response will always return null.
Word boundary fields mirror the Web Speech API boundary event: charIndex and charLength index into the original text; elapsedTime is seconds from audio start.
Output formats:
format |
Content-Type |
Notes |
|---|---|---|
mp3 |
audio/mpeg |
Default |
wav |
audio/wav |
No transcoding — fastest |
opus |
audio/ogg |
Smallest file size |
Errors:
All errors return a consistent shape:
{ "error": { "code": "voice_not_found", "message": "...", "detail": null } }| Status | Code | Cause |
|---|---|---|
| 400 | validation_failed |
Empty or whitespace text |
| 404 | voice_not_found |
Voice URI not registered |
| 413 | payload_too_large |
Text exceeds MAX_TEXT_LENGTH (default 2000 chars) |
| 415 | unsupported_format |
format value not in mp3, wav, opus |
| 422 | — | Request schema invalid (Pydantic detail) |
| 503 | — | Models not yet loaded |
Run make configure to generate .env, or run bash scripts/configure.sh directly.
| Variable | Default | Description |
|---|---|---|
LANGUAGES |
en |
Comma-separated BCP-47 language codes to load. Supported: en fr it de es pt |
HF_TOKEN |
(empty) | HuggingFace token. Optional — prevents rate-limiting on first-run model downloads |
WORKERS |
1 |
Uvicorn worker processes. Each loads a full copy of every active language model |
MAX_CONCURRENT_SYNTHESES |
2 |
Max parallel CPU inference jobs per worker |
API_KEY_ENABLED |
false |
Require X-API-Key header on all routes |
API_KEY |
(empty) | Key value when API_KEY_ENABLED=true |
LOG_LEVEL |
INFO |
DEBUG · INFO · WARNING · ERROR |
PORT |
8000 |
Listen port |
MAX_TEXT_LENGTH |
2000 |
Maximum characters per synthesis request |
FFMPEG_BIN |
ffmpeg |
Path to ffmpeg binary (bundled in the Docker image) |
POCKET_DEFAULT_VOICE |
alba |
Default voice when none is specified |
RAM estimate: WORKERS × active languages × ~240 MB
Example: 2 workers, English + French = 2 × 2 × 240 MB ≈ 960 MB
| Command | Description |
|---|---|
make configure |
Run setup wizard |
make build |
Build Docker image |
make dev-docker |
Start dev server with hot-reload |
make dev-docker-build |
Build then start |
make start |
Start production stack (detached) |
make stop |
Stop containers |
make logs |
Tail app logs |
make test-docker |
Fast test suite — no models needed |
make test-integration-docker |
Integration tests — requires models in volume |
make ci-docker |
Lint + format check + typecheck + tests |
make lint-docker |
ruff check |
make fmt-docker |
ruff format |
make typecheck-docker |
mypy |
make test |
Fast tests via uv (no Docker) |
make ci |
Full local CI |
make clean |
Remove __pycache__ and .pyc files |
The fast suite requires no models and no ffmpeg — everything is mocked:
uv sync
uv run pytest tests/ -m 'not integration and not slow' -v- Create
app/providers/<name>.pyimplementingTTSProvider - Declare
id,supported_languages, andsupports_boundariesas class variables - Implement
_all_voices()andsynthesize() - Register in
app/main.py_build_registry()
No changes to routes, synthesizer, or voice catalog. Language filtering and boundary capability are inherited automatically from the base class.
Client
└─ POST /v1/synthesize
└─ Synthesizer
├─ validate text length + content
├─ resolve voiceURI → (provider, voiceURI) (VoiceCatalog)
├─ provider.synthesize() ← runs in thread pool, bounded by semaphore
│ └─ TTSModel.generate_audio() — CPU inference
└─ encode PCM → mp3/opus (ffmpeg driver) or wrap → wav
- Routes are
async; all CPU-bound inference runs off the event loop viaanyio.to_thread.run_sync - A semaphore (
MAX_CONCURRENT_SYNTHESES) prevents model thrashing under concurrent load - Model throughput scales by adding worker processes (
WORKERS), not threads - Model weights live in a named Docker volume — downloaded once, instant on every subsequent start
| Provider | Status | Notes |
|---|---|---|
| PocketTTS | Current | CPU · 6 languages · 156 voices (26 identities × 6 languages) |
| Kokoro | 📆 Comming soon | Referenced, not vendored (IP cleanliness) |
| ElevenLabs | 📆 Comming soon | Proxied · word boundaries supported |
| Azure Speech | 📆 Comming soon | Proxied · word boundaries supported |
- readium/speech — TypeScript read-aloud library this server is designed to pair with
- HadrienGardeur/web-speech-recommended-voices — voice catalog schema reference (CC0)
- pocket-tts — the underlying CPU TTS engine