Skip to content

[COMPRESS-727] Provide a symlink-resistant and concurrency-safe archive Extractor#776

Open
tomasilluminati wants to merge 4 commits into
apache:masterfrom
tomasilluminati:COMPRESS-727-secure-extractor
Open

[COMPRESS-727] Provide a symlink-resistant and concurrency-safe archive Extractor#776
tomasilluminati wants to merge 4 commits into
apache:masterfrom
tomasilluminati:COMPRESS-727-secure-extractor

Conversation

@tomasilluminati

Copy link
Copy Markdown

Adds a security-first Extractor under archivers. extractor with a SecureDirectoryStream-based race-safe path on Linux and a best-effort base extractor elsewhere. Symlink and special-file policies, safe hard-link resolution, tests and a changes.xml entry are included.

…ve Extractor

Adds a security-first Extractor under archivers. extractor with a SecureDirectoryStream-based race-safe path on Linux and a best-effort base extractor elsewhere. Symlink and special-file policies, safe hard-link resolution, tests and a changes.xml entry are included.
@tomasilluminati

Copy link
Copy Markdown
Author

@ppkarwasz PR Opened.

Compare Path values instead of toString() so the platform separator does not break the test.
@tomasilluminati

Copy link
Copy Markdown
Author

@ppkarwasz Fixed the Windows Test Errors.

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @tomasilluminati
This is interesting and I am hoping we can extract something more generic on this theme.

The background nugget is hidden in Apache Commons Text in the package-private class PathFence.

I extracted the class for addition in Apache Commons IO in PR 805 but it might not be good enough. The IO PR is something I had started to iterate with @ppkarwasz but other things popped up since and now seems like a good time to revist.

There are two concerns: resolving relative paths and dealing with various types of file system links. Handling the relative path issue is "simple", the links less so. At first glance it looks like you've covered many cases here.

I'm not sure if both concerns should be intermingled in one class or composed.

Can you see a PathFence class as something that can be improved on and reused here?

The PathFence could also be reused in Apache Commons Configuration for example.

Looking forward to your thoughts.

TY!

@tomasilluminati

Copy link
Copy Markdown
Author

@garydgregory
Thanks, this is a great direction, I'm keen to reuse PathFence instead of carrying my own containment code.

The syntactic part is a clean fit. toAbsolutePath().normalize() plus the within-root check is generic, and Compress, Configuration and Text can all share it as-is.

Links are the tricky bit, and I'd lean toward composing rather than folding them in. Part of why they resist living in a path predicate is that during extraction the entry doesn't exist yet, so I can't just toRealPath() the target. The real protection ends up being operational, resolve each component no-follow, create with CREATE_NEW so an existing link is never followed, and on Linux write relative to the directory handle (SecureDirectoryStream) to close the TOCTOU window. That's bound to the act of creating files, not to judging a path.

Where @ppkarwasz point lands well is the already-on-disk case, like a trusted config dir. There a resolving check (toRealPath then the same within-root test) makes sense, and that could be an optional mode in PathFence. The extractor would lean on the syntactic decision and keep its no-follow creation layer on top.

So my vote is composed. PathFence owns the containment decision, extraction-specific link safety sits around it, and PathFence stays light enough for Configuration to reuse.

Happy to dig into the IO PR with you and Piotr whenever it's a good time, and to bring over the link cases I already cover as tests. Thanks.

@tomasilluminati

Copy link
Copy Markdown
Author

@Marcono1234 thanks, both are good calls.

On ALLOW_WITHIN_ROOT, you're right. The lexical normalize() check can't model how the OS resolves .. after a symlink component, and with the forward-reference ordering even a real-path check at creation wouldn't catch it, since the second link doesn't exist yet when the first is validated. So that mode can't promise every created link stays inside the root once they are followed together. I'd document it as best-effort and lexical, calling out that it cannot prevent escapes via chained or indirect symlinks, with REJECT (the default) or SKIP as the safe choices for untrusted archives. Worth noting that extraction itself never follows these links (the write path resolves every component no-follow), so nothing is written outside the root during extraction. The hazard is a created link that escapes when something follows it later, and the docs should say so plainly. If a hard guarantee is ever needed, the only way I see is a post-extraction pass that toRealPath-validates every created symlink and drops the escapers, which could be a future opt-in.

On resolveWithinRoot and resolved.equals(rootDirectory), agreed, better to be explicit than to rely on the downstream ops being safe today. Entries that resolve to the root itself (/, ., a/.. and friends) carry nothing to materialize, so I'd skip them rather than target the root path. I'll have resolveWithinRoot signal that case and let the caller skip, so a benign / entry doesn't fail the extraction and nothing ever writes to or replaces the root, plus tests for those names.

Before I push this, @garydgregory @ppkarwasz does this direction sound right to you (document ALLOW_WITHIN_ROOT as best-effort, and skip entries that resolve to the root)? If you are good with it, I'll proceed.

@garydgregory

Copy link
Copy Markdown
Member

@garydgregory Thanks, this is a great direction, I'm keen to reuse PathFence instead of carrying my own containment code.

The syntactic part is a clean fit. toAbsolutePath().normalize() plus the within-root check is generic, and Compress, Configuration and Text can all share it as-is.

Links are the tricky bit, and I'd lean toward composing rather than folding them in. Part of why they resist living in a path predicate is that during extraction the entry doesn't exist yet, so I can't just toRealPath() the target. The real protection ends up being operational, resolve each component no-follow, create with CREATE_NEW so an existing link is never followed, and on Linux write relative to the directory handle (SecureDirectoryStream) to close the TOCTOU window. That's bound to the act of creating files, not to judging a path.

Where @ppkarwasz point lands well is the already-on-disk case, like a trusted config dir. There a resolving check (toRealPath then the same within-root test) makes sense, and that could be an optional mode in PathFence. The extractor would lean on the syntactic decision and keep its no-follow creation layer on top.

So my vote is composed. PathFence owns the containment decision, extraction-specific link safety sits around it, and PathFence stays light enough for Configuration to reuse.

Happy to dig into the IO PR with you and Piotr whenever it's a good time, and to bring over the link cases I already cover as tests. Thanks.

Let's iron out the PathFence in Commons IO if you are available. I'd love to get that out and reuse it as it fits best. If we need a separate link utility, that sounds like a good Commons IO addition as well. Then we could compose in Compress, Configuration, Text, and maybe others like JEXL.
WDYT?

@tomasilluminati

tomasilluminati commented Jun 30, 2026

Copy link
Copy Markdown
Author

@garydgregory

I'm in, happy to help with you and @ppkarwasz.

On "one class or composed", I'd compose and keep PathFence a purely lexical predicate (toAbsolutePath().normalize() then startsWith, no filesystem access), so it stays trivially reusable for Configuration, Text and JEXL.

One thing I'd make explicit in its contract though: lexical containment alone does not stop symlink traversal. normalize() never touches the filesystem, so a symlinked component slips straight through. It should therefore state plainly that PathFence is a syntactic gate, safe on its own only for trusted or link-free trees, and must be paired with real resolution for untrusted input.

That resolution is the separate utility you mentioned. Refining my own earlier comment, I'd keep the toRealPath check in the utility rather than as a mode inside PathFence, so PathFence never touches the filesystem. It carries the two modes the cases need:

  • toRealPath then containment for the trusted on-disk case (the CATALINA_BASE conf symlink). It resolves links but needs the path to exist, so it fits reads, not targets that don't exist yet.
  • no-follow resolution for untrusted input and for creation, never following a link, which is what the Extractor already does.

That puts the security-sensitive resolution in one place and keeps PathFence small enough to reuse everywhere.

If that sounds right to you, I can start there.

Meanwhile I'll keep #776 moving on @Marcono1234 two points, neither depends on PathFence.

…t-resolving entries

ALLOW_WITHIN_ROOT can't fully guarantee containment, so the Javadoc now
says it is lexical and best-effort, with REJECT and SKIP as the safe
modes for untrusted input, plus a Safe Extraction section in the user
guide.

resolveWithinRoot returns null for names that resolve to the root itself
like "." or "a/..", and the caller skips them rather than writing at the
root; escapes still throw. Also folded the four containment checks into
one isWithinRoot helper, and noted that hard-link creation carries the
same residual TOCTOU as symlink creation.
@tomasilluminati

Copy link
Copy Markdown
Author

@Marcono1234 thanks, both were right, addressed both.

ALLOW_WITHIN_ROOT: you're right that it can't be a hard guarantee, so I rewrote the constant's Javadoc to say plainly it is best-effort and lexical (it does not stop a link reached through another link, or an in-root chain that escapes only once the OS resolves it), with REJECT (default) and SKIP called out as the safe modes for untrusted input, plus a new "Safe Extraction" section in the user guide. Your indirect-symlink PoC is now a test: blocked under REJECT and SKIP, and under ALLOW_WITHIN_ROOT/ALLOW_ALL the links are created but nothing is written outside the root during extraction.

resolveWithinRoot: it now returns null when the entry resolves to the root itself and the caller skips it, so a benign . or a/.. no longer fails and nothing ever targets or replaces the root. I kept it a nullable Path rather than Optional to match the codebase, escapes still throw, and I folded the four containment checks into one isWithinRoot helper with a sibling-prefix test.

Pushed. Thanks for the careful review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants