fix: NFR review findings — error-path hardening + protocol/contract test coverage by tiantt · Pull Request #116 · volcengine/agentkit-sdk-python

tiantt · 2026-07-03T09:06:53Z

Summary

Three rounds of fixes driven by a systematic NFR (non-functional requirements) review across security / observability / reliability / performance / test-quality dimensions.

Round 1 (27169e1)

Remove dead duplicate POST /run_sse registration (debug leftover with print; the telemetry-instrumented handler registered first and moved to routes[0] is the live one — runtime behavior unchanged)
Redact credential headers (authorization/token) before recording gen_ai.request.headers on spans in simple_app (they bypassed the logging RedactionFilter; mirrors agent_server middleware)
ve_sign: thread signing scope through as parameters instead of mutating module globals (concurrent calls for different services/regions could sign with each other's scope); reuse a shared requests.Session for connection pooling; add golden-vector tests pinning the signing algorithm
utils.misc.retry: exponential backoff with cap, explicit retry_on, documented idempotency contract; runtime polling gains mild backoff (3s → 10s cap)
Stream pipeline step-log / runtime failure-log downloads to disk chunk-wise (flat peak memory)
clean_env fixture snapshots/restores the full environment — raw os.environ writes can no longer leak across tests

Round 2 (a50edd6)

Fix 3 order-dependent tests that only "passed" via leaked CLOUD_PROVIDER env (they never validated the global-config path they claimed to; also patch global_config_exists())
/run_sse SSE error events no longer echo internal exception details to clients
Cap server-provided Retry-After at 30s (+ tests)
a2a executor wrapper: initialize result before try — executor failures previously got masked by UnboundLocalError in the finally-block telemetry call
simple_app: run sync entrypoints via asyncio.to_thread (blocking user code no longer stalls the event loop)
code_pipeline log export: bounded-pool concurrent downloads stitched in original order
+49 tests: real-FastAPI route-table assembly (regression guard for the duplicate-route class), agentkit.apps public export surface, A2A JSON-RPC round-trip validated with a2a.types, MCP round-trip via fastmcp in-memory client, offline unit tests for the 5 previously untested SDK service clients

Round 3 (25d737c) — shadow-review blockers + cross-cutting majors

cr: bound the instance-creation status poll with a 30-min deadline (was an unbounded while True — an instance stuck in Creating hung the CLI forever)
sandbox exec: terminal websocket gains a 30s connect timeout + protocol-level ping keepalive (30s interval / 10s timeout) — a silently dead server is now detected instead of hanging the session until Ctrl-C (the app-level ping is server-initiated, so the client needed its own probe)
apps: sensitive-header exclusion centralized in agentkit.apps.utils.SENSITIVE_HEADERS and extended with x-security-token / cookie / x-api-key / proxy-authorization / api-key / set-cookie — was two hand-synced two-entry sets plus one inline copy; X-Security-Token (STS credentials, see agentkit.auth._sigv4) previously leaked onto spans
a2a telemetry: inline the user-id fallback instead of importing google-adk's private _get_user_id (closes the undeclared-dependency ImportError flagged in earlier rounds), wrap trace_a2a_agent with dont_throw so a telemetry failure in the finally block can no longer mask the business exception, and log execute failures with stack + context_id
agent server /invoke: error frames no longer echo raw str(e) into the SSE payload (information leak + unescaped JSON); parity with /run_sse — generic message + error_type, json.dumps-built
CI: add a test workflow (3.10/3.12 matrix, pytest + coverage report artifact — no --cov-fail-under yet: establish the baseline first, then ratchet); declare pytest / pytest-cov / google-adk in the dev dependency group; pytest + coverage config in pyproject

Test status

python3 -m pytest tests → 910 passed, 0 failed (branch started at 856 passed / 4 failed; holds after round 3).

Deliberately not addressed (separate decisions)

CORS default allow_origins=["*"] + allow_credentials=True, and no default endpoint auth — behavior/product decisions, breaking for existing deployments
Dependency CVEs reported by GitHub (4 critical / 21 high) — separate upgrade branch recommended
base_service_client response validation can leak raw pydantic.ValidationError outside the ApiError hierarchy
requirements.txt ↔ pyproject.toml dependency drift (each lists packages the other lacks; several core deps unpinned) — needs a single-source-of-truth decision
Logging RedactionFilter's opaque-token regex over-redacts (request_id/trace_id/paths become ***), while the auth credential chain has no logging at all — both need a scoped design pass

Note on base branch

Stacked on fix/agentkit-latent-bugs (this work builds on the _signed_request/http_defaults engineering-standards changes there). When that branch merges, this PR retargets automatically.

…ility/perf Observability & routing: - agent_server_app: remove dead duplicate /run_sse registration (leftover debug copy with print(); the telemetry-instrumented handler registered first and moved to routes[0] is the live one) - simple_app telemetry: redact credential headers (authorization/token) before recording gen_ai.request.headers on spans — they bypassed the logging RedactionFilter; mirrors agent_server middleware _EXCLUDED_HEADERS Security/concurrency & performance (ve_sign): - thread signing scope (service/version/region/host/content_type/scheme) through as parameters instead of mutating module globals — concurrent calls for different services/regions could sign with each other's scope - reuse a shared requests.Session for pooled keep-alive connections - add golden-vector tests for the signing algorithm (canonical request, HMAC chain, Authorization header) and per-call-scope isolation Reliability: - utils.misc.retry: exponential backoff with cap, explicit retry_on parameter, document the idempotency contract - ve_agentkit runtime polling: mild backoff (3s -> 10s cap) Performance (memory): - code_pipeline step-log download and ve_agentkit failure-log download now stream to disk chunk-wise; failure-log read-back bounded to displayed lines Test hygiene: - tests/platform clean_env: snapshot/restore full environ so raw os.environ writes cannot leak across tests (order-dependent failures) - global_config_io: leave debug traces on best-effort fallbacks (chmod 0600 / mtime cache), add module logger Not addressed on purpose: CORS default allow_origins=["*"] and missing default endpoint auth are behavior/product decisions (breaking for existing deployments) — flagged in the NFR report for a separate decision. Pre-existing failures (also fail on the base commit, standalone): test_cli_config_interactive_provider_resolution (2) and test_executor_platform_context_provider — order-dependent platform-context tests previously masked by env leakage; left for a follow-up. Suite: 857 passed, 3 pre-existing failures (see above).

…und 2) Test fixes (unblocks CI): - cli_config_interactive / executor_platform_context tests: also patch global_config_exists() and clear provider env vars — they previously "passed" only when earlier tests leaked CLOUD_PROVIDER into the process environment, and never validated the global-config path they claimed to Security/robustness: - /run_sse SSE error events no longer echo internal exception details to clients (error_type only; full detail stays in server logs) - cap server-provided Retry-After at 30s so a misconfigured/hostile server cannot stall signed-request retries for arbitrary durations (+ tests) - a2a executor wrapper: initialize result before try — an executor failure previously masked the original exception with UnboundLocalError in the finally-block telemetry call (mcp_app already had this fix) Performance: - simple_app: run sync entrypoints via asyncio.to_thread so blocking user code no longer stalls the event loop for concurrent requests - code_pipeline log export: download step logs concurrently (bounded pool, streamed to temp files) and stitch in original order — export time no longer grows linearly with pipeline size New test coverage (+49 tests, suite now 910 passed / 0 failed): - agent_server route-table assembly on a real FastAPI app: /run_sse registered once and first (regression guard for the removed duplicate), /invoke not shadowed by the A2A mount, telemetry middleware attached - agentkit.apps public export surface (__all__ lazy __getattr__) - A2A protocol round-trip against the real A2AStarletteApplication: agent-card discovery + JSON-RPC message/send validated with a2a.types - MCP round-trip via fastmcp in-memory client: tools/list schema + calls - offline unit tests for the 5 previously untested SDK service clients (knowledge/mcp/memory/skills/tools): wire payload, response parsing, error mapping Known follow-ups (reported, not addressed here): base_service_client response validation can leak raw pydantic.ValidationError outside the ApiError hierarchy; a2a telemetry imports a private google-adk symbol that is not declared in dependencies.

…nd 3) - cr: bound the instance-creation status poll with a 30min deadline (was an unbounded while True that hung the CLI forever on a stuck instance) - sandbox exec: give the terminal websocket a connect timeout and protocol-level ping keepalive so a silently dead server is detected instead of hanging the session - apps: centralize the sensitive-header exclusion list in apps.utils (single source of truth, extended to x-security-token / cookie / x-api-key / proxy-authorization; was two hand-synced two-entry sets plus one inline copy) - a2a telemetry: stop importing google-adk's private _get_user_id (undeclared dependency; fallback logic inlined), protect trace_a2a_agent with dont_throw so a telemetry failure can no longer mask the original business exception from the finally block, and log execute failures with stack + context_id - agent server /invoke: stop echoing raw str(e) into the SSE error frame (parity with /run_sse: generic message + error_type only, json.dumps-built) - CI: add a test workflow (3.10/3.12, pytest + coverage report) and declare pytest/pytest-cov/google-adk in the dev dependency group; add pytest/coverage config to pyproject

tiantt added 3 commits July 3, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: NFR review findings — error-path hardening + protocol/contract test coverage#116

fix: NFR review findings — error-path hardening + protocol/contract test coverage#116
tiantt wants to merge 3 commits into
fix/agentkit-latent-bugsfrom
fix/nfr-scan-findings

tiantt commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tiantt commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test status

Deliberately not addressed (separate decisions)

Note on base branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tiantt commented Jul 3, 2026 •

edited

Loading