[COMPRESS-727] Provide a symlink-resistant and concurrency-safe archive Extractor by tomasilluminati · Pull Request #776 · apache/commons-compress

tomasilluminati · 2026-06-28T18:35:04Z

Adds a security-first Extractor under archivers. extractor with a SecureDirectoryStream-based race-safe path on Linux and a best-effort base extractor elsewhere. Symlink and special-file policies, safe hard-link resolution, tests and a changes.xml entry are included.

…ve Extractor Adds a security-first Extractor under archivers. extractor with a SecureDirectoryStream-based race-safe path on Linux and a best-effort base extractor elsewhere. Symlink and special-file policies, safe hard-link resolution, tests and a changes.xml entry are included.

tomasilluminati · 2026-06-28T18:35:33Z

@ppkarwasz PR Opened.

Compare Path values instead of toString() so the platform separator does not break the test.

tomasilluminati · 2026-06-28T21:57:41Z

@ppkarwasz Fixed the Windows Test Errors.

garydgregory

Hello @tomasilluminati
This is interesting and I am hoping we can extract something more generic on this theme.

The background nugget is hidden in Apache Commons Text in the package-private class PathFence.

I extracted the class for addition in Apache Commons IO in PR 805 but it might not be good enough. The IO PR is something I had started to iterate with @ppkarwasz but other things popped up since and now seems like a good time to revist.

There are two concerns: resolving relative paths and dealing with various types of file system links. Handling the relative path issue is "simple", the links less so. At first glance it looks like you've covered many cases here.

I'm not sure if both concerns should be intermingled in one class or composed.

Can you see a PathFence class as something that can be improved on and reused here?

The PathFence could also be reused in Apache Commons Configuration for example.

Looking forward to your thoughts.

TY!

tomasilluminati · 2026-06-29T04:13:23Z

@garydgregory
Thanks, this is a great direction, I'm keen to reuse PathFence instead of carrying my own containment code.

The syntactic part is a clean fit. toAbsolutePath().normalize() plus the within-root check is generic, and Compress, Configuration and Text can all share it as-is.

Links are the tricky bit, and I'd lean toward composing rather than folding them in. Part of why they resist living in a path predicate is that during extraction the entry doesn't exist yet, so I can't just toRealPath() the target. The real protection ends up being operational, resolve each component no-follow, create with CREATE_NEW so an existing link is never followed, and on Linux write relative to the directory handle (SecureDirectoryStream) to close the TOCTOU window. That's bound to the act of creating files, not to judging a path.

Where @ppkarwasz point lands well is the already-on-disk case, like a trusted config dir. There a resolving check (toRealPath then the same within-root test) makes sense, and that could be an optional mode in PathFence. The extractor would lean on the syntactic decision and keep its no-follow creation layer on top.

So my vote is composed. PathFence owns the containment decision, extraction-specific link safety sits around it, and PathFence stays light enough for Configuration to reuse.

Happy to dig into the IO PR with you and Piotr whenever it's a good time, and to bring over the link cases I already cover as tests. Thanks.

tomasilluminati · 2026-06-30T16:08:05Z

@Marcono1234 thanks, both are good calls.

On ALLOW_WITHIN_ROOT, you're right. The lexical normalize() check can't model how the OS resolves .. after a symlink component, and with the forward-reference ordering even a real-path check at creation wouldn't catch it, since the second link doesn't exist yet when the first is validated. So that mode can't promise every created link stays inside the root once they are followed together. I'd document it as best-effort and lexical, calling out that it cannot prevent escapes via chained or indirect symlinks, with REJECT (the default) or SKIP as the safe choices for untrusted archives. Worth noting that extraction itself never follows these links (the write path resolves every component no-follow), so nothing is written outside the root during extraction. The hazard is a created link that escapes when something follows it later, and the docs should say so plainly. If a hard guarantee is ever needed, the only way I see is a post-extraction pass that toRealPath-validates every created symlink and drops the escapers, which could be a future opt-in.

On resolveWithinRoot and resolved.equals(rootDirectory), agreed, better to be explicit than to rely on the downstream ops being safe today. Entries that resolve to the root itself (/, ., a/.. and friends) carry nothing to materialize, so I'd skip them rather than target the root path. I'll have resolveWithinRoot signal that case and let the caller skip, so a benign / entry doesn't fail the extraction and nothing ever writes to or replaces the root, plus tests for those names.

Before I push this, @garydgregory @ppkarwasz does this direction sound right to you (document ALLOW_WITHIN_ROOT as best-effort, and skip entries that resolve to the root)? If you are good with it, I'll proceed.

garydgregory · 2026-06-30T19:13:26Z

@garydgregory Thanks, this is a great direction, I'm keen to reuse PathFence instead of carrying my own containment code.

The syntactic part is a clean fit. toAbsolutePath().normalize() plus the within-root check is generic, and Compress, Configuration and Text can all share it as-is.

Links are the tricky bit, and I'd lean toward composing rather than folding them in. Part of why they resist living in a path predicate is that during extraction the entry doesn't exist yet, so I can't just toRealPath() the target. The real protection ends up being operational, resolve each component no-follow, create with CREATE_NEW so an existing link is never followed, and on Linux write relative to the directory handle (SecureDirectoryStream) to close the TOCTOU window. That's bound to the act of creating files, not to judging a path.

Where @ppkarwasz point lands well is the already-on-disk case, like a trusted config dir. There a resolving check (toRealPath then the same within-root test) makes sense, and that could be an optional mode in PathFence. The extractor would lean on the syntactic decision and keep its no-follow creation layer on top.

So my vote is composed. PathFence owns the containment decision, extraction-specific link safety sits around it, and PathFence stays light enough for Configuration to reuse.

Happy to dig into the IO PR with you and Piotr whenever it's a good time, and to bring over the link cases I already cover as tests. Thanks.

Let's iron out the PathFence in Commons IO if you are available. I'd love to get that out and reuse it as it fits best. If we need a separate link utility, that sounds like a good Commons IO addition as well. Then we could compose in Compress, Configuration, Text, and maybe others like JEXL.
WDYT?

tomasilluminati · 2026-06-30T22:43:42Z

@garydgregory

I'm in, happy to help with you and @ppkarwasz.

On "one class or composed", I'd compose and keep PathFence a purely lexical predicate (toAbsolutePath().normalize() then startsWith, no filesystem access), so it stays trivially reusable for Configuration, Text and JEXL.

One thing I'd make explicit in its contract though: lexical containment alone does not stop symlink traversal. normalize() never touches the filesystem, so a symlinked component slips straight through. It should therefore state plainly that PathFence is a syntactic gate, safe on its own only for trusted or link-free trees, and must be paired with real resolution for untrusted input.

That resolution is the separate utility you mentioned. Refining my own earlier comment, I'd keep the toRealPath check in the utility rather than as a mode inside PathFence, so PathFence never touches the filesystem. It carries the two modes the cases need:

toRealPath then containment for the trusted on-disk case (the CATALINA_BASE conf symlink). It resolves links but needs the path to exist, so it fits reads, not targets that don't exist yet.
no-follow resolution for untrusted input and for creation, never following a link, which is what the Extractor already does.

That puts the security-sensitive resolution in one place and keeps PathFence small enough to reuse everywhere.

If that sounds right to you, I can start there.

Meanwhile I'll keep #776 moving on @Marcono1234 two points, neither depends on PathFence.

…t-resolving entries ALLOW_WITHIN_ROOT can't fully guarantee containment, so the Javadoc now says it is lexical and best-effort, with REJECT and SKIP as the safe modes for untrusted input, plus a Safe Extraction section in the user guide. resolveWithinRoot returns null for names that resolve to the root itself like "." or "a/..", and the caller skips them rather than writing at the root; escapes still throw. Also folded the four containment checks into one isWithinRoot helper, and noted that hard-link creation carries the same residual TOCTOU as symlink creation.

…re-extractor

tomasilluminati · 2026-07-02T03:48:18Z

@Marcono1234 thanks, both were right, addressed both.

ALLOW_WITHIN_ROOT: you're right that it can't be a hard guarantee, so I rewrote the constant's Javadoc to say plainly it is best-effort and lexical (it does not stop a link reached through another link, or an in-root chain that escapes only once the OS resolves it), with REJECT (default) and SKIP called out as the safe modes for untrusted input, plus a new "Safe Extraction" section in the user guide. Your indirect-symlink PoC is now a test: blocked under REJECT and SKIP, and under ALLOW_WITHIN_ROOT/ALLOW_ALL the links are created but nothing is written outside the root during extraction.

resolveWithinRoot: it now returns null when the entry resolves to the root itself and the caller skips it, so a benign . or a/.. no longer fails and nothing ever targets or replaces the root. I kept it a nullable Path rather than Optional to match the codebase, escapes still throw, and I folded the four containment checks into one isWithinRoot helper with a sibling-prefix test.

Pushed. Thanks for the careful review.

[COMPRESS-727] Fix symlink target assertions on Windows

9db201e

Compare Path values instead of toString() so the platform separator does not break the test.

garydgregory reviewed Jun 29, 2026

View reviewed changes

Marcono1234 reviewed Jun 30, 2026

View reviewed changes

Comment thread src/main/java/org/apache/commons/compress/archivers/extractor/Extractor.java

Marcono1234 reviewed Jun 30, 2026

View reviewed changes

Comment thread src/main/java/org/apache/commons/compress/archivers/extractor/Extractor.java

Marcono1234 reviewed Jul 1, 2026

View reviewed changes

Comment thread src/main/java/org/apache/commons/compress/archivers/extractor/Extractor.java

tomasilluminati added 2 commits July 2, 2026 00:36

Merge remote-tracking branch 'upstream/master' into COMPRESS-727-secu…

4afd063

…re-extractor

tomasilluminati requested a review from Marcono1234 July 2, 2026 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[COMPRESS-727] Provide a symlink-resistant and concurrency-safe archive Extractor#776

[COMPRESS-727] Provide a symlink-resistant and concurrency-safe archive Extractor#776
tomasilluminati wants to merge 4 commits into
apache:masterfrom
tomasilluminati:COMPRESS-727-secure-extractor

tomasilluminati commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

garydgregory left a comment

Uh oh!

tomasilluminati commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

tomasilluminati commented Jun 30, 2026

Uh oh!

garydgregory commented Jun 30, 2026

Uh oh!

tomasilluminati commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

tomasilluminati commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tomasilluminati commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

tomasilluminati commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

tomasilluminati commented Jun 30, 2026

Uh oh!

garydgregory commented Jun 30, 2026

Uh oh!

tomasilluminati commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tomasilluminati commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomasilluminati commented Jun 30, 2026 •

edited

Loading