Skip to content

fix: NFR review findings — error-path hardening + protocol/contract test coverage#116

Open
tiantt wants to merge 3 commits into
fix/agentkit-latent-bugsfrom
fix/nfr-scan-findings
Open

fix: NFR review findings — error-path hardening + protocol/contract test coverage#116
tiantt wants to merge 3 commits into
fix/agentkit-latent-bugsfrom
fix/nfr-scan-findings

Conversation

@tiantt

@tiantt tiantt commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three rounds of fixes driven by a systematic NFR (non-functional requirements) review across security / observability / reliability / performance / test-quality dimensions.

Round 1 (27169e1)

  • Remove dead duplicate POST /run_sse registration (debug leftover with print; the telemetry-instrumented handler registered first and moved to routes[0] is the live one — runtime behavior unchanged)
  • Redact credential headers (authorization/token) before recording gen_ai.request.headers on spans in simple_app (they bypassed the logging RedactionFilter; mirrors agent_server middleware)
  • ve_sign: thread signing scope through as parameters instead of mutating module globals (concurrent calls for different services/regions could sign with each other's scope); reuse a shared requests.Session for connection pooling; add golden-vector tests pinning the signing algorithm
  • utils.misc.retry: exponential backoff with cap, explicit retry_on, documented idempotency contract; runtime polling gains mild backoff (3s → 10s cap)
  • Stream pipeline step-log / runtime failure-log downloads to disk chunk-wise (flat peak memory)
  • clean_env fixture snapshots/restores the full environment — raw os.environ writes can no longer leak across tests

Round 2 (a50edd6)

  • Fix 3 order-dependent tests that only "passed" via leaked CLOUD_PROVIDER env (they never validated the global-config path they claimed to; also patch global_config_exists())
  • /run_sse SSE error events no longer echo internal exception details to clients
  • Cap server-provided Retry-After at 30s (+ tests)
  • a2a executor wrapper: initialize result before try — executor failures previously got masked by UnboundLocalError in the finally-block telemetry call
  • simple_app: run sync entrypoints via asyncio.to_thread (blocking user code no longer stalls the event loop)
  • code_pipeline log export: bounded-pool concurrent downloads stitched in original order
  • +49 tests: real-FastAPI route-table assembly (regression guard for the duplicate-route class), agentkit.apps public export surface, A2A JSON-RPC round-trip validated with a2a.types, MCP round-trip via fastmcp in-memory client, offline unit tests for the 5 previously untested SDK service clients

Round 3 (25d737c) — shadow-review blockers + cross-cutting majors

  • cr: bound the instance-creation status poll with a 30-min deadline (was an unbounded while True — an instance stuck in Creating hung the CLI forever)
  • sandbox exec: terminal websocket gains a 30s connect timeout + protocol-level ping keepalive (30s interval / 10s timeout) — a silently dead server is now detected instead of hanging the session until Ctrl-C (the app-level ping is server-initiated, so the client needed its own probe)
  • apps: sensitive-header exclusion centralized in agentkit.apps.utils.SENSITIVE_HEADERS and extended with x-security-token / cookie / x-api-key / proxy-authorization / api-key / set-cookie — was two hand-synced two-entry sets plus one inline copy; X-Security-Token (STS credentials, see agentkit.auth._sigv4) previously leaked onto spans
  • a2a telemetry: inline the user-id fallback instead of importing google-adk's private _get_user_id (closes the undeclared-dependency ImportError flagged in earlier rounds), wrap trace_a2a_agent with dont_throw so a telemetry failure in the finally block can no longer mask the business exception, and log execute failures with stack + context_id
  • agent server /invoke: error frames no longer echo raw str(e) into the SSE payload (information leak + unescaped JSON); parity with /run_sse — generic message + error_type, json.dumps-built
  • CI: add a test workflow (3.10/3.12 matrix, pytest + coverage report artifact — no --cov-fail-under yet: establish the baseline first, then ratchet); declare pytest / pytest-cov / google-adk in the dev dependency group; pytest + coverage config in pyproject

Test status

python3 -m pytest tests910 passed, 0 failed (branch started at 856 passed / 4 failed; holds after round 3).

Deliberately not addressed (separate decisions)

  • CORS default allow_origins=["*"] + allow_credentials=True, and no default endpoint auth — behavior/product decisions, breaking for existing deployments
  • Dependency CVEs reported by GitHub (4 critical / 21 high) — separate upgrade branch recommended
  • base_service_client response validation can leak raw pydantic.ValidationError outside the ApiError hierarchy
  • requirements.txtpyproject.toml dependency drift (each lists packages the other lacks; several core deps unpinned) — needs a single-source-of-truth decision
  • Logging RedactionFilter's opaque-token regex over-redacts (request_id/trace_id/paths become ***), while the auth credential chain has no logging at all — both need a scoped design pass

Note on base branch

Stacked on fix/agentkit-latent-bugs (this work builds on the _signed_request/http_defaults engineering-standards changes there). When that branch merges, this PR retargets automatically.

tiantt added 3 commits July 3, 2026 16:25
…ility/perf

Observability & routing:
- agent_server_app: remove dead duplicate /run_sse registration (leftover
  debug copy with print(); the telemetry-instrumented handler registered
  first and moved to routes[0] is the live one)
- simple_app telemetry: redact credential headers (authorization/token)
  before recording gen_ai.request.headers on spans — they bypassed the
  logging RedactionFilter; mirrors agent_server middleware _EXCLUDED_HEADERS

Security/concurrency & performance (ve_sign):
- thread signing scope (service/version/region/host/content_type/scheme)
  through as parameters instead of mutating module globals — concurrent
  calls for different services/regions could sign with each other's scope
- reuse a shared requests.Session for pooled keep-alive connections
- add golden-vector tests for the signing algorithm (canonical request,
  HMAC chain, Authorization header) and per-call-scope isolation

Reliability:
- utils.misc.retry: exponential backoff with cap, explicit retry_on
  parameter, document the idempotency contract
- ve_agentkit runtime polling: mild backoff (3s -> 10s cap)

Performance (memory):
- code_pipeline step-log download and ve_agentkit failure-log download now
  stream to disk chunk-wise; failure-log read-back bounded to displayed lines

Test hygiene:
- tests/platform clean_env: snapshot/restore full environ so raw
  os.environ writes cannot leak across tests (order-dependent failures)
- global_config_io: leave debug traces on best-effort fallbacks
  (chmod 0600 / mtime cache), add module logger

Not addressed on purpose: CORS default allow_origins=["*"] and missing
default endpoint auth are behavior/product decisions (breaking for existing
deployments) — flagged in the NFR report for a separate decision.
Pre-existing failures (also fail on the base commit, standalone):
test_cli_config_interactive_provider_resolution (2) and
test_executor_platform_context_provider — order-dependent platform-context
tests previously masked by env leakage; left for a follow-up.

Suite: 857 passed, 3 pre-existing failures (see above).
…und 2)

Test fixes (unblocks CI):
- cli_config_interactive / executor_platform_context tests: also patch
  global_config_exists() and clear provider env vars — they previously
  "passed" only when earlier tests leaked CLOUD_PROVIDER into the process
  environment, and never validated the global-config path they claimed to

Security/robustness:
- /run_sse SSE error events no longer echo internal exception details to
  clients (error_type only; full detail stays in server logs)
- cap server-provided Retry-After at 30s so a misconfigured/hostile server
  cannot stall signed-request retries for arbitrary durations (+ tests)
- a2a executor wrapper: initialize result before try — an executor failure
  previously masked the original exception with UnboundLocalError in the
  finally-block telemetry call (mcp_app already had this fix)

Performance:
- simple_app: run sync entrypoints via asyncio.to_thread so blocking user
  code no longer stalls the event loop for concurrent requests
- code_pipeline log export: download step logs concurrently (bounded pool,
  streamed to temp files) and stitch in original order — export time no
  longer grows linearly with pipeline size

New test coverage (+49 tests, suite now 910 passed / 0 failed):
- agent_server route-table assembly on a real FastAPI app: /run_sse
  registered once and first (regression guard for the removed duplicate),
  /invoke not shadowed by the A2A mount, telemetry middleware attached
- agentkit.apps public export surface (__all__ lazy __getattr__)
- A2A protocol round-trip against the real A2AStarletteApplication:
  agent-card discovery + JSON-RPC message/send validated with a2a.types
- MCP round-trip via fastmcp in-memory client: tools/list schema + calls
- offline unit tests for the 5 previously untested SDK service clients
  (knowledge/mcp/memory/skills/tools): wire payload, response parsing,
  error mapping

Known follow-ups (reported, not addressed here): base_service_client
response validation can leak raw pydantic.ValidationError outside the
ApiError hierarchy; a2a telemetry imports a private google-adk symbol that
is not declared in dependencies.
…nd 3)

- cr: bound the instance-creation status poll with a 30min deadline
  (was an unbounded while True that hung the CLI forever on a stuck
  instance)
- sandbox exec: give the terminal websocket a connect timeout and
  protocol-level ping keepalive so a silently dead server is detected
  instead of hanging the session
- apps: centralize the sensitive-header exclusion list in apps.utils
  (single source of truth, extended to x-security-token / cookie /
  x-api-key / proxy-authorization; was two hand-synced two-entry sets
  plus one inline copy)
- a2a telemetry: stop importing google-adk's private _get_user_id
  (undeclared dependency; fallback logic inlined), protect
  trace_a2a_agent with dont_throw so a telemetry failure can no longer
  mask the original business exception from the finally block, and log
  execute failures with stack + context_id
- agent server /invoke: stop echoing raw str(e) into the SSE error
  frame (parity with /run_sse: generic message + error_type only,
  json.dumps-built)
- CI: add a test workflow (3.10/3.12, pytest + coverage report) and
  declare pytest/pytest-cov/google-adk in the dev dependency group;
  add pytest/coverage config to pyproject
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant