Readium Speech Server

A remote text-to-speech HTTP service for the Readium ecosystem. Exposes a uniform API for listing voices and synthesizing speech, backed by open neural TTS models running on CPU — no GPU required.

Designed to pair with Readium Speech and any Readium-compatible reading application.

Overview


API	`GET /v1/voices` · `POST /v1/synthesize`
Providers	PocketTTS (v1) · Kokoro, ElevenLabs, Azure (planned)
Languages	English · French · Italian · German · Spanish · Portuguese
Formats	MP3 · WAV · Opus
Word boundaries	Schema ready; supported when provider supplies timing data
Deployment	Docker · CPU-only · single named volume for model weights

Quick start

You need Docker — nothing else. No Python, no PyTorch, no model files on your machine.

make configure        # interactive setup → writes .env
make build            # build the image (~2 min)
make dev-docker       # start server — downloads models on first run

Server: http://localhost:8000
Interactive API docs: http://localhost:8000/docs

Quick test — synthesize and play:

curl -s -X POST http://localhost:8000/v1/synthesize \
  -H 'Content-Type: application/json' \
  -d '{"text":"Hello world","voice":"urn:readium:tts:pocket:en-alba"}' \
  -o /tmp/speech.mp3 && open /tmp/speech.mp3

First start downloads the selected language models (~240 MB each) into a persistent Docker volume. Every restart after that is instant — models are already cached.

Production

make start   # detached, restarts automatically on crash or reboot
make stop    # stop all containers
make logs    # tail container logs

Setup wizard

make configure opens an interactive wizard. It handles both first-time setup and ongoing management:

Readium Speech Server

  Current: languages=en  workers=1

  1) Show full config
  2) Add a language
  3) Remove a language
  4) Change workers
  5) Update HF token
  6) First-time setup (re-run / overwrite)
  7) Reset
  q) Quit

Adding a language updates .env in place — only the new model is downloaded on next restart. Removing a language preserves the model files in the Docker volume; disk is reclaimed only if you choose to purge the volume.

Voices

156 voices across 7 language variants (26 voice identities × 6 languages). Every voice is available in every language — alba speaking English, alba speaking French, alba speaking German, etc. Only languages listed in LANGUAGES are loaded at startup (~240 MB RAM per language per worker).

The 26 voice identities, sourced from kyutai/tts-voices:

Voice	Gender	Origin
alba	female	Alba MacKenna (CC BY 4.0)
anna, vera, fantine, charles, paul, eponine, azelma, george, mary, jane, michael, eve	mixed	VCTK dataset (CC BY 4.0)
bill_boerst, peter_yearsley, stuart_bell, caro_davy	mixed	Voice Zero / LibriVox (CC0)
marius, javert	male	Voice donations (CC0)
cosette	female	Expresso dataset (CC BY-NC 4.0)
jean	male	EARS dataset (CC BY-NC 4.0)
estelle	female	Unmute production voices
giovanni, lola, juergen, rafael	mixed	Kyutai (language reference voices)

Voice URIs are language-scoped — the same speaker in different languages gets a distinct URI:

urn:readium:tts:pocket:en-alba    # Alba speaking English
urn:readium:tts:pocket:fr-alba    # Alba speaking French
urn:readium:tts:pocket:de-alba    # Alba speaking German

Supported language codes: en, fr, it, de, es, pt — derived directly from PocketTTS model names.

How it works: PocketTTS pre-computes voice embeddings for every voice × language combination (stored as .safetensors files in kyutai/pocket-tts-without-voice-cloning). The voice sample is encoded once at model-load time — no per-request cloning overhead.

API reference

Health

Method	Path	Description
`GET`	`/healthz`	Liveness — 200 when the process is running
`GET`	`/readyz`	Readiness — 503 until models are loaded and ffmpeg is available

Voices

GET /v1/voices
GET /v1/voices?language=fr
GET /v1/voices?provider=pocket
GET /v1/voices?offset=0&limit=20

Returns an array of voice objects. Null-valued optional fields are omitted. Each voice includes a boundary field indicating whether that provider supports word-level timing marks.

Pagination query params:

Param	Type	Description
`language`	string	Filter by BCP-47 language prefix (e.g. `en`, `fr`)
`provider`	string	Filter by provider id (e.g. `pocket`)
`offset`	int ≥ 0	Voices to skip (default: 0)
`limit`	int ≥ 1	Max voices to return (default: all)

Response headers:

Header	Description
`X-Total-Count`	Total matching voices before pagination
`X-Offset`	Applied offset
`X-Limit`	Applied limit (omitted when no limit set)

Synthesize

POST /v1/synthesize
Content-Type: application/json

Minimal request — returns binary MP3:

{
  "text": "Hello, world!",
  "voice": "urn:readium:tts:pocket:en-alba"
}

Full request:

{
  "id": "urn:uuid:019f178c-cc7c-7bb3-a39b-d185f43d3cc4",
  "text": "Ceci est un test.",
  "language": "fr",
  "voice": "urn:readium:tts:pocket:fr-estelle",
  "ssml": false,
  "prev_utterance": "La nuit était sombre.",
  "next_utterance": "La pièce était froide.",
  "publication_id": "urn:isbn:9780000000000",
  "boundary": false,
  "output": {
    "format": "mp3",
    "bitrate": 64,
    "speed": 1.0,
    "pitch": null
  }
}

Response (default, boundary: false):

Binary audio with Content-Type: audio/mpeg (or audio/wav, audio/ogg).

Response (boundary: true):

{
  "audio": "<base64-encoded audio>",
  "format": "mp3",
  "boundaries": [
    { "name": "word", "charIndex": 0,  "charLength": 5,  "elapsedTime": 0.0  },
    { "name": "word", "charIndex": 6,  "charLength": 3,  "elapsedTime": 0.38 }
  ]
}

boundaries is null when the provider does not support word timing. Check voice.boundary before requesting — if false, the response will always return null.

Word boundary fields mirror the Web Speech API boundary event: charIndex and charLength index into the original text; elapsedTime is seconds from audio start.

Output formats:

`format`	`Content-Type`	Notes
`mp3`	`audio/mpeg`	Default
`wav`	`audio/wav`	No transcoding — fastest
`opus`	`audio/ogg`	Smallest file size

Errors:

All errors return a consistent shape:

{ "error": { "code": "voice_not_found", "message": "...", "detail": null } }

Status	Code	Cause
400	`validation_failed`	Empty or whitespace text
404	`voice_not_found`	Voice URI not registered
413	`payload_too_large`	Text exceeds `MAX_TEXT_LENGTH` (default 2000 chars)
415	`unsupported_format`	`format` value not in `mp3`, `wav`, `opus`
422	—	Request schema invalid (Pydantic detail)
503	—	Models not yet loaded

Configuration

Run make configure to generate .env, or run bash scripts/configure.sh directly.

Variable	Default	Description
`LANGUAGES`	`en`	Comma-separated BCP-47 language codes to load. Supported: `en fr it de es pt`
`HF_TOKEN`	(empty)	HuggingFace token. Optional — prevents rate-limiting on first-run model downloads
`WORKERS`	`1`	Uvicorn worker processes. Each loads a full copy of every active language model
`MAX_CONCURRENT_SYNTHESES`	`2`	Max parallel CPU inference jobs per worker
`API_KEY_ENABLED`	`false`	Require `X-API-Key` header on all routes
`API_KEY`	(empty)	Key value when `API_KEY_ENABLED=true`
`LOG_LEVEL`	`INFO`	`DEBUG` · `INFO` · `WARNING` · `ERROR`
`PORT`	`8000`	Listen port
`MAX_TEXT_LENGTH`	`2000`	Maximum characters per synthesis request
`FFMPEG_BIN`	`ffmpeg`	Path to ffmpeg binary (bundled in the Docker image)
`POCKET_DEFAULT_VOICE`	`alba`	Default voice when none is specified

RAM estimate: WORKERS × active languages × ~240 MB

Example: 2 workers, English + French = 2 × 2 × 240 MB ≈ 960 MB

Development

Commands

Command	Description
`make configure`	Run setup wizard
`make build`	Build Docker image
`make dev-docker`	Start dev server with hot-reload
`make dev-docker-build`	Build then start
`make start`	Start production stack (detached)
`make stop`	Stop containers
`make logs`	Tail app logs
`make test-docker`	Fast test suite — no models needed
`make test-integration-docker`	Integration tests — requires models in volume
`make ci-docker`	Lint + format check + typecheck + tests
`make lint-docker`	`ruff check`
`make fmt-docker`	`ruff format`
`make typecheck-docker`	`mypy`
`make test`	Fast tests via `uv` (no Docker)
`make ci`	Full local CI
`make clean`	Remove `__pycache__` and `.pyc` files

Running tests locally

The fast suite requires no models and no ffmpeg — everything is mocked:

uv sync
uv run pytest tests/ -m 'not integration and not slow' -v

Adding a provider

Create app/providers/<name>.py implementing TTSProvider
Declare id, supported_languages, and supports_boundaries as class variables
Implement _all_voices() and synthesize()
Register in app/main.py _build_registry()

No changes to routes, synthesizer, or voice catalog. Language filtering and boundary capability are inherited automatically from the base class.

Architecture

Client
  └─ POST /v1/synthesize
       └─ Synthesizer
            ├─ validate text length + content
            ├─ resolve voiceURI → (provider, voiceURI)  (VoiceCatalog)
            ├─ provider.synthesize()  ← runs in thread pool, bounded by semaphore
            │    └─ TTSModel.generate_audio()  — CPU inference
            └─ encode PCM → mp3/opus  (ffmpeg driver)  or  wrap → wav

Routes are async; all CPU-bound inference runs off the event loop via anyio.to_thread.run_sync
A semaphore (MAX_CONCURRENT_SYNTHESES) prevents model thrashing under concurrent load
Model throughput scales by adding worker processes (WORKERS), not threads
Model weights live in a named Docker volume — downloaded once, instant on every subsequent start

Provider roadmap

Provider	Status	Notes
PocketTTS	Current	CPU · 6 languages · 156 voices (26 identities × 6 languages)
Kokoro	📆 Comming soon	Referenced, not vendored (IP cleanliness)
ElevenLabs	📆 Comming soon	Proxied · word boundaries supported
Azure Speech	📆 Comming soon	Proxied · word boundaries supported

Related projects

readium/speech — TypeScript read-aloud library this server is designed to pair with
HadrienGardeur/web-speech-recommended-voices — voice catalog schema reference (CC0)
pocket-tts — the underlying CPU TTS engine

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
app		app
docs		docs
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Readium Speech Server

Overview

Quick start

Production

Setup wizard

Voices

API reference

Health

Voices

Synthesize

Configuration

Development

Commands

Running tests locally

Adding a provider

Architecture

Provider roadmap

Related projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Readium Speech Server

Overview

Quick start

Production

Setup wizard

Voices

API reference

Health

Voices

Synthesize

Configuration

Development

Commands

Running tests locally

Adding a provider

Architecture

Provider roadmap

Related projects

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages