fix: NFR review findings — error-path hardening + protocol/contract test coverage#116
Open
tiantt wants to merge 3 commits into
Open
fix: NFR review findings — error-path hardening + protocol/contract test coverage#116tiantt wants to merge 3 commits into
tiantt wants to merge 3 commits into
Conversation
…ility/perf Observability & routing: - agent_server_app: remove dead duplicate /run_sse registration (leftover debug copy with print(); the telemetry-instrumented handler registered first and moved to routes[0] is the live one) - simple_app telemetry: redact credential headers (authorization/token) before recording gen_ai.request.headers on spans — they bypassed the logging RedactionFilter; mirrors agent_server middleware _EXCLUDED_HEADERS Security/concurrency & performance (ve_sign): - thread signing scope (service/version/region/host/content_type/scheme) through as parameters instead of mutating module globals — concurrent calls for different services/regions could sign with each other's scope - reuse a shared requests.Session for pooled keep-alive connections - add golden-vector tests for the signing algorithm (canonical request, HMAC chain, Authorization header) and per-call-scope isolation Reliability: - utils.misc.retry: exponential backoff with cap, explicit retry_on parameter, document the idempotency contract - ve_agentkit runtime polling: mild backoff (3s -> 10s cap) Performance (memory): - code_pipeline step-log download and ve_agentkit failure-log download now stream to disk chunk-wise; failure-log read-back bounded to displayed lines Test hygiene: - tests/platform clean_env: snapshot/restore full environ so raw os.environ writes cannot leak across tests (order-dependent failures) - global_config_io: leave debug traces on best-effort fallbacks (chmod 0600 / mtime cache), add module logger Not addressed on purpose: CORS default allow_origins=["*"] and missing default endpoint auth are behavior/product decisions (breaking for existing deployments) — flagged in the NFR report for a separate decision. Pre-existing failures (also fail on the base commit, standalone): test_cli_config_interactive_provider_resolution (2) and test_executor_platform_context_provider — order-dependent platform-context tests previously masked by env leakage; left for a follow-up. Suite: 857 passed, 3 pre-existing failures (see above).
…und 2) Test fixes (unblocks CI): - cli_config_interactive / executor_platform_context tests: also patch global_config_exists() and clear provider env vars — they previously "passed" only when earlier tests leaked CLOUD_PROVIDER into the process environment, and never validated the global-config path they claimed to Security/robustness: - /run_sse SSE error events no longer echo internal exception details to clients (error_type only; full detail stays in server logs) - cap server-provided Retry-After at 30s so a misconfigured/hostile server cannot stall signed-request retries for arbitrary durations (+ tests) - a2a executor wrapper: initialize result before try — an executor failure previously masked the original exception with UnboundLocalError in the finally-block telemetry call (mcp_app already had this fix) Performance: - simple_app: run sync entrypoints via asyncio.to_thread so blocking user code no longer stalls the event loop for concurrent requests - code_pipeline log export: download step logs concurrently (bounded pool, streamed to temp files) and stitch in original order — export time no longer grows linearly with pipeline size New test coverage (+49 tests, suite now 910 passed / 0 failed): - agent_server route-table assembly on a real FastAPI app: /run_sse registered once and first (regression guard for the removed duplicate), /invoke not shadowed by the A2A mount, telemetry middleware attached - agentkit.apps public export surface (__all__ lazy __getattr__) - A2A protocol round-trip against the real A2AStarletteApplication: agent-card discovery + JSON-RPC message/send validated with a2a.types - MCP round-trip via fastmcp in-memory client: tools/list schema + calls - offline unit tests for the 5 previously untested SDK service clients (knowledge/mcp/memory/skills/tools): wire payload, response parsing, error mapping Known follow-ups (reported, not addressed here): base_service_client response validation can leak raw pydantic.ValidationError outside the ApiError hierarchy; a2a telemetry imports a private google-adk symbol that is not declared in dependencies.
…nd 3) - cr: bound the instance-creation status poll with a 30min deadline (was an unbounded while True that hung the CLI forever on a stuck instance) - sandbox exec: give the terminal websocket a connect timeout and protocol-level ping keepalive so a silently dead server is detected instead of hanging the session - apps: centralize the sensitive-header exclusion list in apps.utils (single source of truth, extended to x-security-token / cookie / x-api-key / proxy-authorization; was two hand-synced two-entry sets plus one inline copy) - a2a telemetry: stop importing google-adk's private _get_user_id (undeclared dependency; fallback logic inlined), protect trace_a2a_agent with dont_throw so a telemetry failure can no longer mask the original business exception from the finally block, and log execute failures with stack + context_id - agent server /invoke: stop echoing raw str(e) into the SSE error frame (parity with /run_sse: generic message + error_type only, json.dumps-built) - CI: add a test workflow (3.10/3.12, pytest + coverage report) and declare pytest/pytest-cov/google-adk in the dev dependency group; add pytest/coverage config to pyproject
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three rounds of fixes driven by a systematic NFR (non-functional requirements) review across security / observability / reliability / performance / test-quality dimensions.
Round 1 (
27169e1)POST /run_sseregistration (debug leftover withprint; the telemetry-instrumented handler registered first and moved toroutes[0]is the live one — runtime behavior unchanged)authorization/token) before recordinggen_ai.request.headerson spans in simple_app (they bypassed the logging RedactionFilter; mirrors agent_server middleware)ve_sign: thread signing scope through as parameters instead of mutating module globals (concurrent calls for different services/regions could sign with each other's scope); reuse a sharedrequests.Sessionfor connection pooling; add golden-vector tests pinning the signing algorithmutils.misc.retry: exponential backoff with cap, explicitretry_on, documented idempotency contract; runtime polling gains mild backoff (3s → 10s cap)clean_envfixture snapshots/restores the full environment — rawos.environwrites can no longer leak across testsRound 2 (
a50edd6)CLOUD_PROVIDERenv (they never validated the global-config path they claimed to; also patchglobal_config_exists())/run_sseSSE error events no longer echo internal exception details to clientsRetry-Afterat 30s (+ tests)resultbeforetry— executor failures previously got masked byUnboundLocalErrorin the finally-block telemetry callasyncio.to_thread(blocking user code no longer stalls the event loop)agentkit.appspublic export surface, A2A JSON-RPC round-trip validated witha2a.types, MCP round-trip via fastmcp in-memory client, offline unit tests for the 5 previously untested SDK service clientsRound 3 (
25d737c) — shadow-review blockers + cross-cutting majorscr: bound the instance-creation status poll with a 30-min deadline (was an unboundedwhile True— an instance stuck in Creating hung the CLI forever)agentkit.apps.utils.SENSITIVE_HEADERSand extended withx-security-token/cookie/x-api-key/proxy-authorization/api-key/set-cookie— was two hand-synced two-entry sets plus one inline copy;X-Security-Token(STS credentials, seeagentkit.auth._sigv4) previously leaked onto spans_get_user_id(closes the undeclared-dependency ImportError flagged in earlier rounds), wraptrace_a2a_agentwithdont_throwso a telemetry failure in thefinallyblock can no longer mask the business exception, and log execute failures with stack +context_id/invoke: error frames no longer echo rawstr(e)into the SSE payload (information leak + unescaped JSON); parity with/run_sse— generic message +error_type,json.dumps-built--cov-fail-underyet: establish the baseline first, then ratchet); declarepytest/pytest-cov/google-adkin thedevdependency group; pytest + coverage config in pyprojectTest status
python3 -m pytest tests→ 910 passed, 0 failed (branch started at 856 passed / 4 failed; holds after round 3).Deliberately not addressed (separate decisions)
allow_origins=["*"]+allow_credentials=True, and no default endpoint auth — behavior/product decisions, breaking for existing deploymentsbase_service_clientresponse validation can leak rawpydantic.ValidationErroroutside theApiErrorhierarchyrequirements.txt↔pyproject.tomldependency drift (each lists packages the other lacks; several core deps unpinned) — needs a single-source-of-truth decisionRedactionFilter's opaque-token regex over-redacts (request_id/trace_id/paths become***), while the auth credential chain has no logging at all — both need a scoped design passNote on base branch
Stacked on
fix/agentkit-latent-bugs(this work builds on the_signed_request/http_defaults engineering-standards changes there). When that branch merges, this PR retargets automatically.