docs-puller mirrors vendor, reference, and local project docs into Markdown, builds a local SQLite FTS5 index, and gives agents a fast private search surface with reproducible retrieval evals.
This is the open-core local CLI. The hosted Team tier (multi-tenant control plane, billing, managed corpora) is proprietary. See OPEN-CORE.md for the commercial boundary.
Requirements: Go 1.26+.
go install github.com/nstranquist/docs-puller@latestOr from a checkout:
git clone https://github.com/nstranquist/docs-puller.git
cd docs-puller
go install .This creates a tiny local corpus, indexes it, checks health, then searches it.
tmp="$(mktemp -d)"
mkdir -p "$tmp/input"
printf '# PurpleWidget setup\n\nRun `purplewidget init` to configure the local docs mirror.\n' > "$tmp/input/setup.md"
docs-puller pull --local "$tmp/input" --name smoke --out "$tmp/corpus"
docs-puller reindex --out "$tmp/corpus"
docs-puller status --out "$tmp/corpus" --check
docs-puller search "purplewidget init" --out "$tmp/corpus" --source smoke --limit 1 --jsonThe search result should include setup.md.
docs-puller pull --from urls.md --out ~/code/docs
docs-puller pull-url https://example.com/docs/page --out ~/code/docs
docs-puller pull --local ~/projects/my-app --name my-app --out ~/code/docs
docs-puller pull-local-batch --source app=~/projects/my-app --source docs=~/code/docs --out ~/code/docs
docs-puller pull --github-repo owner/repo --name repo-docs --out ~/code/docs
docs-puller reindex --out ~/code/docs
docs-puller status --out ~/code/docs --check
docs-puller status --out ~/code/docs --check --check-embeddings
docs-puller search "supabase row level security" --out ~/code/docs --compact
docs-puller pins refresh --write --out ~/code/docs
docs-puller search "flatlist performance" --out ~/code/docs --source react-native --version 0.79
docs-puller search "react native debugging" --out ~/code/docs --source react-native --all-versionsEmbeddings are stored separately from FTS at <out>/.cache/embeddings.db; the FTS index remains <out>/.cache/search.db. Whole-doc embedding runs also write a flat vector sidecar (embeddings-<model>.vec) used by --rerank-hybrid before falling back to SQLite.
status reports missing or stale embedding sidecars, but status --check only fails on core corpus/index health. Add --check-embeddings when rerank readiness should be part of the gate.
docs-puller embed --out ~/code/docs --model text-embedding-3-small
docs-puller embed --out ~/code/docs --model text-embedding-3-small --write-flat-only
docs-puller embed --out ~/code/docs --migrate-legacy
docs-puller search "how do I count tokens with Anthropic" --out ~/code/docs --rerank-llm --rerank-hybrid --rerank-k 10The embedding batcher retries per-input token cap failures and recursively splits batches when the provider rejects total batch tokens.
Query logging is opt-in:
docs-puller search "deploy Azure Functions from the CLI" --out ~/code/docs --log-query --intent support
DOCS_PULLER_QUERY_LOG=1 docs-puller search "react native list performance" --out ~/code/docsCurate observed queries into a candidate fixture:
docs-puller telemetry log --limit 20
docs-puller telemetry fixture --intent support --out-file eval/support-candidates.yamlTelemetry-derived fixtures use the observed top hit as expect and include a note to verify before promotion.
Latest docs stay canonical at <out>/<source>/. Versioned docs are bounded overlays generated from lockfiles, then searched through the same FTS5 index. Sources without source-specific crawl pages seed one entrypoint; high-breakage sources can define a small versioned_pages set in version_policy.yaml.
docs-puller pins refresh --out ~/code/docs --json
docs-puller pins refresh --out ~/code/docs --write
docs-puller pins sync --out ~/code/docs --write
docs-puller pull-pins --out ~/code/docs --source react-native --write
docs-puller pins gc --out ~/code/docs --grace-days 14 --writepull-pins --write stages a complete pinned source directory before replacing the live overlay, then refreshes only those source IDs in FTS5. Latest docs remain untouched and keep ranking first for migration/latest-intent queries.
Generated pins live at <out>/_DOCS_PINS.json. Source families keep their stable names (react-native), while pinned source IDs use <family>__v<version> (react-native__v0.79). Search defaults prefer the current workspace pin, then other workspace pins, latest docs, tools pins, and finally other pins. Use --all-versions when mirror hits should remain separate, --version latest for upgrade work, or --version <tag> for an exact lane.
docs-puller serve runs a local search server with an embedded web UI — no build
step, no extra dependencies:
docs-puller serve --out ~/code/docs
# open http://127.0.0.1:7799The UI supports live search with source filtering and doc preview over the JSON API:
GET /api/search?q=<query>&source=<id>&limit=<n>GET /api/sourcesGET /api/statusGET /api/doc?path=<rel>
Security defaults: binds 127.0.0.1 and refuses a non-loopback --addr unless a
bearer token is set (--auth-token, --auth-token-file, or $DOCS_SERVE_TOKEN).
The server picks up out-of-process pull/reindex runs automatically — no
restart needed.
vscode-extension/ ships a VS Code client for the same endpoint ("Docs Puller:
Search").
You do not need config for pull, search, reindex, eval, or the smoke test above. Config is for power users who want cwd-based profile selection, monorepo pin scanning, and custom source keyword boosts.
docs-puller config init
# edits ~/.docs-puller/config.yaml paths + ~/.docs-puller/profiles/my-stack.yaml sources
docs-puller profile list
docs-puller search "your query" --profile my-stack --out ~/code/docsconfig init writes:
~/.docs-puller/config.yaml(fromconfig.example.yaml)~/.docs-puller/profiles/<profile>.yaml(fromprofiles/example.yaml)
Use --profile NAME to pick a different profile name. Pass --force to overwrite
existing files.
Check where config resolves:
docs-puller config pathOverride location with DOCS_PULLER_CONFIG=/path/to/config.yaml.
mkdir -p ~/.docs-puller/profiles
cp config.example.yaml ~/.docs-puller/config.yaml
cp profiles/example.yaml ~/.docs-puller/profiles/my-stack.yaml
# edit paths + profile sources, then verify:
docs-puller profile listProfile lookup order: <corpus>/profiles/ → ~/.docs-puller/profiles/ → profiles
beside your config file → embedded profiles/example.yaml.
See config.example.yaml for the schema (cwd_profiles, pin_scan_roots,
tools_pin_scopes, source_keywords).
- Default corpus:
~/code/docs(override withDOCS_PULLER_OUT=<dir>) - Isolated corpus: pass
--out <dir>on pull, reindex, status, search, eval, and pins commands - Index:
<out>/.cache/search.db - Embeddings:
<out>/.cache/embeddings.dbplus optional flat vector sidecars - Query log: opt-in, controlled by
--log-queryorDOCS_PULLER_QUERY_LOG=1 - Ranking-hygiene policy:
DOCS_PULLER_HYGIENE_POLICY=/path/to/policy.jsonappends your own downranked path patterns (same JSON shape asinternal/sourcehygiene/policy.json) to the built-in set — useful for keeping generated notes or scratch exports out of results - Legacy shared-state paths: set
DOCS_PULLER_LEGACY_NDEV_PATHS=1only when intentionally sharing corpus state with a private wrapper install (operator builds only)
Run these before publishing a public change:
go build -tags sqlite_fts5 ./...
go vet -tags sqlite_fts5 ./...
go test -tags sqlite_fts5 ./...
docs-puller eval --check-fixture
docs-puller eval --answer-context --record-run
docs-puller eval-suite --overview-md retrieval-metrics.md --overview-html retrieval-metrics.html
docs-puller eval-leaderboard --format json
docs-puller curation linteval-suite --overview-md/--overview-html writes per-library and per-query-type retrieval metrics, including Hit@K, MRR, latency, returned-token estimates, and full answer-context token counts. Overview generation enables answer-context counting automatically so the token columns reflect the returned Markdown docs rather than only snippet metadata.
The eval harness ships vendor-style YAML fixtures (eval/*.yaml) you can run against your own corpus:
docs-puller eval --check-fixture
docs-puller eval-suite --jsonA fully reproducible baseline ships in eval/sample-corpus/:
24 pinned public doc pages (SQLite, Go, PostgreSQL) + 24 queries + a frozen
BM25-only baseline (Hit@1 95.8% / Hit@5 100% / MRR 0.979) that anyone can
replay with no API key:
corpus="$(mktemp -d)"
docs-puller pull --from eval/sample-corpus/sources.md --out "$corpus"
docs-puller reindex --out "$corpus"
docs-puller eval --fixture eval/sample-corpus/fixture.yaml --out "$corpus"
docs-puller eval-leaderboard --fixtures eval/sample-corpus --out "$corpus" --format jsonThe main eval/*.yaml fixture numbers are measured on the maintainer's larger
multi-vendor corpus mirror — treat those as operator-measured until you rebuild
an equivalent corpus.