PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
Open
andiwand wants to merge 28 commits into
Open
PDF text: dual-layer + single-layer rendering with PdfTextMode option#579andiwand wants to merge 28 commits into
andiwand wants to merge 28 commits into
Conversation
Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.
Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
Runs grouped into line blocks by baseline; space detection inserts
gap spans. Each run span is display:inline-block with CSS justify
(text-align:justify; text-align-last:justify; text-justify:inter-
character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.
Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
per font, then picks the most-frequent glyph as the cmap entry —
so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74f51ee76f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
This was referenced Jul 2, 2026
pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class helper, the run's left/top-or-matrix placement classes, and the post-pass font-face/style writer were each copy-pasted between the dual-layer and single-layer paths. Hoist them into shared statics (add_position_classes, font_is_usable, font_class, write_font_face) used by both. Verified byte-identical document.html output for both PdfTextMode values across several PDF fixtures before/after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Tight-continuation runs were merged into the previous .sr span's text without recomputing its declared width, leaving it at the first sub-run's width while the visible text grew arbitrarily longer (e.g. "Particle Acceleration and Detection" declared 10px wide). Track each open run's starting x-offset and re-derive the width on merge. Also propagate font-size to the selection layer (runs, gap spacers, and the trailing space that closes a line), which previously inherited the browser default and could overflow/clip against the PDF-derived width, desyncing the invisible hit-test text from the true glyph run.
…rder .sg (gap spacer) lacked the overflow:hidden that .sr (text run) has; per CSS an inline-block's baseline is its content's text baseline when overflow is visible but the bottom margin edge otherwise, so the two box types baseline-aligned differently within the same line, visibly shifting spaces in y. Give .sg the same overflow:hidden. Also content-stream order doesn't always run top-to-bottom (margins, columns), which made drag-selection highlight rows inconsistently. Stable-sort each page's selection lines by baseline y after the page is fully processed, keeping content-stream (x) order intact for lines on the same row.
…cmap Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the overflow into Supplementary PUA-A and emit a format-12 cmap subtable to cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex to 0xFFFF. Also add configurable dual-layer selection fallback fonts and a size-adjust so the invisible selection text widths track the PDF boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Correctness: - Treat pure 180° rotation (a=d=-1) as a matrix transform by also requiring m.a > 0 for the axis-aligned fast path; previously it fed a negative m.a into font-size and the left/top math. Both modes. - Guard dual-layer visual word-spacing: it is inert on PUA glyph runs (which never emit a literal space) and must skip composite fonts (PDF Tw applies only to single-byte code 32), matching single-layer. - Measure the selection-layer line-break against the previous run's font size, not the current run's, so it can't drift from the visual and single-layer heuristics. Extracted a shared starts_new_line(). - Quantize the selection-line sort key to 0.1px so float-noise baselines on the same row don't reorder same-row lines. Cleanups: - SingleRunOut::color stores the class name without a leading space. - Collapse-check loop breaks early and drops a redundant text.font check. - Unify class prefixes: ws = word-spacing everywhere (w = width). - Comment escape_markup (why not html::escape_text) and the pre-pass double parse. Test: emit the single-layer PdfTextMode alongside the dual-layer output for one representative PDF under a `-single` suffix, so both text modes are covered by reference-output diffing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Extract the byte-identical machinery the two PdfTextMode orchestrators duplicated into static helpers, so each mode reduces to its actual essence (grouping policy + span emission) and the shared logic has a single source of truth (structurally preventing drift like the earlier line-break / 180°-rotation divergences): - RunGeometry + run_geometry(): the per-run geometry prelude (transform, is_matrix, ascent, origin, extent, font sizes), consumed via a structured binding so the call sites keep their local names. - color_class(): the non-black paint-colour class suffix. - PageBox + begin_page(): page-box dimensions, the page to_box transform and the `.p x# y#` class string. - intern_font(): the font accept/reject bookkeeping shared by both font_family lambdas (each supplies its own per-font array growth). - write_page_items(): the `<defs>` + paint-order SVG open/close dance over a variant<Line, Path> item list. - write_header_common(): the document/head prologue with a callback for the mode-specific CSS rules. Output-neutral: every reference-output document.html (dual and single layer, all engine=odr PDFs) is byte-identical before and after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Trim redundant inline comments that restated adjacent docstrings, tighten the two long CSS-rationale blocks (fallback font size-adjust and .sr/.sg justify) without losing the reasoning, and hoist the duplicated to255 channel-clamp lambda into a shared helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
andiwand
commented
Jul 3, 2026
andiwand
commented
Jul 3, 2026
Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>
…Document.core into pdf-text-selection
Font sizes, positions, widths, margins and spacing are a more natural fit for pt, and the SVG viewBox already lives in PDF user-space points, so authoring the text layer in pt drops the pt_to_px conversion entirely. pt and px are both fixed absolute CSS units (4:3 ratio), so rendering is unchanged. The matrix() translation is intrinsically px, so rotated/skewed runs carry their pt translation via a leading translate(...pt). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Drop the baseline-y stable_sort of selection lines. Sorting by y fixed out-of-order single-column pages but interleaved multi-column layouts, which the content stream keeps contiguous. Reading order can't be recovered by a scalar sort key, so revert to plain stream order for now and remove the now-dead SelLineOut.y field. Record the proper fix (recursive XY-cut page segmentation, with a lighter single-pass column-detection first step) in internal/pdf/READING_ORDER.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
The visual, selection and single text layers each carried their own copy
of the same line-detection bookkeeping — an open-line index plus the
previous run's end/baseline/font/matrix — and re-implemented the same
new-line decision and previous-run update inline. That is exactly the
kind of parallel state that drifts (the reason `starts_new_line` was
already extracted).
Introduce a `LineFlow` struct holding the open-line index and previous-
run geometry, with `decide()` (new-line + margin) and `advance()`
(record predecessor). Each layer keeps its own instance — the state
footprint and downstream emission genuinely differ — but the shared gate
and update now live in one place and cannot diverge:
- visual/single reduce to decide()/advance() almost mechanically;
- single ORs its flow-key change onto the decision;
- selection reuses decide()/advance() and keeps its gap test and its
extra state (ends-space, run-start-ox, prev font-size) as locals; it
is never close()d, preserving run contiguity across drawing ops.
No output change intended — this is a representation-preserving refactor.
Also fix a latent build break carried in the prior cleanup: the single-
layer main pass dereferenced `page` (now a reference) as `page->`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
…ering in READING_ORDER Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
Undo the shared LineFlow struct: the three text layers diverge enough (state footprint, close semantics, downstream emission) that inlining the previous-run bookkeeping and new-line gate reads more directly than routing through decide()/advance(). Behavior is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
…reak space The single-layer HTML collapse test requires a 1:1 alignment between a run's character codes and its Unicode text (`utf8_length(text) == advances.size()`). Space inference prepends an inferred `U+0020` to `text` without a backing code or advance, so every run that recovers a leading word-break space failed that test outright and was painted via PUA glyphs (generated content + embedded font) instead of collapsing to real, natively selectable Unicode. Because almost every word in running text follows a space break, this affected nearly all body text: on 978-3-030-65771-0 it was 63302 of 63304 unclean runs. Mark the inferred space explicitly (`TextElement::leading_space_inferred`, set where `show()` injects it) and make the collapse test — and the frequency pre-pass, which previously excluded these runs from voting on the winning glyph — align the codes against the run text *after* that space. A collapsing run that carries one emits the space as a zero-width selectable overlay (like the dual layer's spacer) rather than visible text, so `white-space:pre` cannot shift the glyphs off their placement origin while copy/search still read the recovered space. PUA is left to its real purpose: genuine glyph/Unicode conflicts and `no_unicode` runs. Effect on that document: unclean (PUA) runs 63304 -> 2 (the two remaining are `no_unicode`); document.html 13.3M -> 11.5M. Single-layer reference outputs need regenerating. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
Summary
Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.
PdfTextModeenum toHtmlConfig(dual_layerdefault,single_layeropt-in)position:absoluteon the line<div>,margin-lefton inline run<span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph groupingDual-layer mode (
PdfTextMode::dual_layer, default)Similar approach to pdf.js:
<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection insertsdisplay:inline-blockspacer spans. Each run span uses CSStext-align:justify; text-align-last:justify; text-justify:inter-characterto spread characters to match the PDF advance — no JavaScript.Single-layer mode (
PdfTextMode::single_layer)Similar approach to pdf2htmlEX:
(uchar, glyph)co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).::before{content:attr(data-g)}CSS generated content with a zero-widthdisplay:inline-block; overflow:hiddenoverlay<span>carrying the real Unicode for selection.Test plan
style-various-1.pdf):class="vis" aria-hidden+class="sel"divs present; visual spans contain PUA bytes; selection spans contain readable Unicode--singleflag on CLI):gl+ovclasses present;data-gattributes contain PUA bytes; inline text contains readable Unicode