fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568
fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568nbayati wants to merge 4 commits into
Conversation
…tartup hang and crash In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError. Previously, `os.path.exists()` caught this PermissionError and returned `False`. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError. This change: 1. Updates `_is_certificate_file_ready` to use `os.stat` and propagate `PermissionError`. 2. Catches `PermissionError` immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing. 3. Adds corresponding unit tests.
There was a problem hiding this comment.
Code Review
This pull request improves error handling when checking and accessing agent identity certificates. Specifically, it updates _is_certificate_file_ready to propagate PermissionError while catching other OSError exceptions, and handles PermissionError in get_agent_identity_certificate_path by logging a warning and falling back to unbound tokens. Corresponding unit tests have been added to verify these scenarios. There are no review comments, so I have no additional feedback to provide.
There was a problem hiding this comment.
Thank you for this contribution!
System-level / Off-diff Finding:
Graceful OSError Handling in get_and_parse_agent_identity_certificate()
get_and_parse_agent_identity_certificate() (around line 208) does not handle exceptions when reading the certificate file. Because os.path.getsize only requires directory traversal / metadata search permissions, _is_certificate_file_ready can evaluate to True even for a file that the current process does not have read permissions to open (e.g., owned by root with 600). When open(cert_path, "rb") is called, it will crash with a raw PermissionError.
Wrap the file-read in a try...except OSError block to gracefully fail-closed with a controlled RefreshError:
try:
with open(cert_path, "rb") as cert_file:
cert_bytes = cert_file.read()
except OSError as e:
raise exceptions.RefreshError(
f"Failed to read agent identity certificate file at {cert_path}: {e}. "
"Token binding protection is failing. You can turn off this protection by setting "
f"{environment_vars.GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES} to false "
"to fall back to unbound tokens."
) from eThere was a problem hiding this comment.
I think we may want to catch here too. I think it's possible for get_agent_identity_certificate_path to work correctly but then have .read to raise a PermissionError since .stat and .read require different permissions
There was a problem hiding this comment.
Done! I have wrapped the file read block in a try...except OSError block to catch read-level permissions errors and fall back to unbound tokens.
| if not path: | ||
| return False | ||
| try: | ||
| return os.path.getsize(path) > 0 |
There was a problem hiding this comment.
nit: the PR description mentions using os.stat , but the code uses os.path.getsize . Functionally equivalent since getsize uses stat under the hood, but just pointing it out!
There was a problem hiding this comment.
The new commit uses os.stat directly. I'll double check the PR description after making all the changes requested by reviewers to ensure it's still valid. Thanks for pointing it out :)
| try: | ||
| with open(cert_path, "rb") as cert_file: | ||
| cert_bytes = cert_file.read() | ||
| except OSError as e: |
There was a problem hiding this comment.
Should we keep the fallback limited to PermissionError?
In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError.
Previously,
os.path.exists()caught this PermissionError and returnedFalse. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError.This change:
_is_certificate_file_readyto useos.statand propagatePermissionError.PermissionErrorimmediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing.Fixes b/522957573