PDF text: dual-layer + single-layer rendering with PdfTextMode option by andiwand · Pull Request #579 · opendocument-app/OpenDocument.core

andiwand · 2026-07-01T19:26:10Z

🤖 Generated with Claude Code

Summary

Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.

Adds PdfTextMode enum to HtmlConfig (dual_layer default, single_layer opt-in)
Both modes use line blocks (position:absolute on the line <div>, margin-left on inline run <span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph grouping

Dual-layer mode (`PdfTextMode::dual_layer`, default)

Similar approach to pdf.js:

Visual layer (<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.
Selection/search layer (<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection inserts display:inline-block spacer spans. Each run span uses CSS text-align:justify; text-align-last:justify; text-justify:inter-character to spread characters to match the PDF advance — no JavaScript.

Single-layer mode (`PdfTextMode::single_layer`)

Similar approach to pdf2htmlEX:

Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).
Clean runs (all uchar→glyph pairs match the winner): real Unicode rendered directly in the embedded font — natively selectable and findable.
Unclean runs: glyphs painted via ::before{content:attr(data-g)} CSS generated content with a zero-width display:inline-block; overflow:hidden overlay <span> carrying the real Unicode for selection.
PUA-only characters (no Unicode mapping): remain visible but unselectable.

Test plan

Build passes, all 658 tests pass
Dual-layer output (style-various-1.pdf): class="vis" aria-hidden + class="sel" divs present; visual spans contain PUA bytes; selection spans contain readable Unicode
Single-layer output (--single flag on CLI): gl + ov classes present; data-g attributes contain PUA bytes; inline text contains readable Unicode
Both modes render visually correct in browser
Text selection and find-in-page work in both modes

Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Replaces the single-glyph-per-absolute-span approach with two modes, both using line blocks (position:absolute on the line div, margin-left on inline run spans) instead of per-glyph absolute positioning. Dual-layer mode (default, PdfTextMode::dual_layer): - Visual layer (<div class="vis" aria-hidden>): paint-order glyph rendering. Fonts re-encoded to PUA. Invisible text omitted. - Selection layer (<div class="sel">): transparent real-Unicode text. Runs grouped into line blocks by baseline; space detection inserts gap spans. Each run span is display:inline-block with CSS justify (text-align:justify; text-align-last:justify; text-justify:inter- character) so characters fill the PDF advance without JavaScript. - Similar approach to pdf.js. Single-layer mode (PdfTextMode::single_layer): - One combined layer per page in paint order. - Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font, then picks the most-frequent glyph as the cmap entry — so the common case wins, not first-come-first-serve. - Clean runs (all uchar→glyph pairs match the winner) render the real Unicode directly in the embedded font — natively selectable. - Unclean runs paint glyphs via ::before{content:attr(data-g)} with a zero-width display:inline-block overlay span for selectability. - PUA-only chars (no Unicode mapping) remain visible but unselectable. - Similar approach to pdf2htmlEX. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74f51ee76f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

pt_to_px/pt_to_in, the SFNT/CFF usability probe, the fvN/fnN class helper, the run's left/top-or-matrix placement classes, and the post-pass font-face/style writer were each copy-pasted between the dual-layer and single-layer paths. Hoist them into shared statics (add_position_classes, font_is_usable, font_class, write_font_face) used by both. Verified byte-identical document.html output for both PdfTextMode values across several PDF fixtures before/after. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Tight-continuation runs were merged into the previous .sr span's text without recomputing its declared width, leaving it at the first sub-run's width while the visible text grew arbitrarily longer (e.g. "Particle Acceleration and Detection" declared 10px wide). Track each open run's starting x-offset and re-derive the width on merge. Also propagate font-size to the selection layer (runs, gap spacers, and the trailing space that closes a line), which previously inherited the browser default and could overflow/clip against the PDF-derived width, desyncing the invisible hit-test text from the true glyph run.

…rder .sg (gap spacer) lacked the overflow:hidden that .sr (text run) has; per CSS an inline-block's baseline is its content's text baseline when overflow is visible but the bottom margin edge otherwise, so the two box types baseline-aligned differently within the same line, visibly shifting spaces in y. Give .sg the same overflow:hidden. Also content-stream order doesn't always run top-to-bottom (margins, columns), which made drag-selection highlight rows inconsistently. Stable-sort each page's selection lines by baseline y after the page is fully processed, keeping content-stream (x) order intact for lines on the same row.

…cmap Large glyph counts exceeded the 6400-slot BMP PUA and threw. Spill the overflow into Supplementary PUA-A and emit a format-12 cmap subtable to cover the beyond-BMP code points, clamping OS/2 usFirst/usLastCharIndex to 0xFFFF. Also add configurable dual-layer selection fallback fonts and a size-adjust so the invisible selection text widths track the PDF boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Correctness: - Treat pure 180° rotation (a=d=-1) as a matrix transform by also requiring m.a > 0 for the axis-aligned fast path; previously it fed a negative m.a into font-size and the left/top math. Both modes. - Guard dual-layer visual word-spacing: it is inert on PUA glyph runs (which never emit a literal space) and must skip composite fonts (PDF Tw applies only to single-byte code 32), matching single-layer. - Measure the selection-layer line-break against the previous run's font size, not the current run's, so it can't drift from the visual and single-layer heuristics. Extracted a shared starts_new_line(). - Quantize the selection-line sort key to 0.1px so float-noise baselines on the same row don't reorder same-row lines. Cleanups: - SingleRunOut::color stores the class name without a leading space. - Collapse-check loop breaks early and drops a redundant text.font check. - Unify class prefixes: ws = word-spacing everywhere (w = width). - Comment escape_markup (why not html::escape_text) and the pre-pass double parse. Test: emit the single-layer PdfTextMode alongside the dual-layer output for one representative PDF under a `-single` suffix, so both text modes are covered by reference-output diffing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Extract the byte-identical machinery the two PdfTextMode orchestrators duplicated into static helpers, so each mode reduces to its actual essence (grouping policy + span emission) and the shared logic has a single source of truth (structurally preventing drift like the earlier line-break / 180°-rotation divergences): - RunGeometry + run_geometry(): the per-run geometry prelude (transform, is_matrix, ascent, origin, extent, font sizes), consumed via a structured binding so the call sites keep their local names. - color_class(): the non-black paint-colour class suffix. - PageBox + begin_page(): page-box dimensions, the page to_box transform and the `.p x# y#` class string. - intern_font(): the font accept/reject bookkeeping shared by both font_family lambdas (each supplies its own per-font array growth). - write_page_items(): the `<defs>` + paint-order SVG open/close dance over a variant<Line, Path> item list. - write_header_common(): the document/head prologue with a callback for the mode-specific CSS rules. Output-neutral: every reference-output document.html (dual and single layer, all engine=odr PDFs) is byte-identical before and after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Trim redundant inline comments that restated adjacent docstrings, tighten the two long CSS-rationale blocks (fallback font size-adjust and .sr/.sg justify) without losing the reasoning, and hoist the duplicated to255 channel-clamp lambda into a shared helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>

…Document.core into pdf-text-selection

Font sizes, positions, widths, margins and spacing are a more natural fit for pt, and the SVG viewBox already lives in PDF user-space points, so authoring the text layer in pt drops the pt_to_px conversion entirely. pt and px are both fixed absolute CSS units (4:3 ratio), so rendering is unchanged. The matrix() translation is intrinsically px, so rotated/skewed runs carry their pt translation via a leading translate(...pt). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Drop the baseline-y stable_sort of selection lines. Sorting by y fixed out-of-order single-column pages but interleaved multi-column layouts, which the content stream keeps contiguous. Reading order can't be recovered by a scalar sort key, so revert to plain stream order for now and remove the now-dead SelLineOut.y field. Record the proper fix (recursive XY-cut page segmentation, with a lighter single-pass column-detection first step) in internal/pdf/READING_ORDER.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

The visual, selection and single text layers each carried their own copy of the same line-detection bookkeeping — an open-line index plus the previous run's end/baseline/font/matrix — and re-implemented the same new-line decision and previous-run update inline. That is exactly the kind of parallel state that drifts (the reason `starts_new_line` was already extracted). Introduce a `LineFlow` struct holding the open-line index and previous- run geometry, with `decide()` (new-line + margin) and `advance()` (record predecessor). Each layer keeps its own instance — the state footprint and downstream emission genuinely differ — but the shared gate and update now live in one place and cannot diverge: - visual/single reduce to decide()/advance() almost mechanically; - single ORs its flow-key change onto the decision; - selection reuses decide()/advance() and keeps its gap test and its extra state (ends-space, run-start-ox, prev font-size) as locals; it is never close()d, preserving run contiguity across drawing ops. No output change intended — this is a representation-preserving refactor. Also fix a latent build break carried in the prior cleanup: the single- layer main pass dereferenced `page` (now a reference) as `page->`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

…ering in READING_ORDER Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

Undo the shared LineFlow struct: the three text layers diverge enough (state footprint, close semantics, downstream emission) that inlining the previous-run bookkeeping and new-line gate reads more directly than routing through decide()/advance(). Behavior is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

…reak space The single-layer HTML collapse test requires a 1:1 alignment between a run's character codes and its Unicode text (`utf8_length(text) == advances.size()`). Space inference prepends an inferred `U+0020` to `text` without a backing code or advance, so every run that recovers a leading word-break space failed that test outright and was painted via PUA glyphs (generated content + embedded font) instead of collapsing to real, natively selectable Unicode. Because almost every word in running text follows a space break, this affected nearly all body text: on 978-3-030-65771-0 it was 63302 of 63304 unclean runs. Mark the inferred space explicitly (`TextElement::leading_space_inferred`, set where `show()` injects it) and make the collapse test — and the frequency pre-pass, which previously excluded these runs from voting on the winning glyph — align the codes against the run text *after* that space. A collapsing run that carries one emits the space as a zero-width selectable overlay (like the dual layer's spacer) rather than visible text, so `white-space:pre` cannot shift the glyphs off their placement origin while copy/search still read the recovered space. PUA is left to its real purpose: genuine glyph/Unicode conflicts and `no_unicode` runs. Effect on that document: unclean (PUA) runs 63304 -> 2 (the two remaining are `no_unicode`); document.html 13.3M -> 11.5M. Single-layer reference outputs need regenerating. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

andiwand and others added 2 commits July 1, 2026 19:06

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/odr/internal/html/pdf_file.cpp Outdated

Comment thread src/odr/internal/html/pdf_file.cpp Outdated

andiwand and others added 3 commits July 1, 2026 22:29

revert test

d1b4527

This was referenced Jul 2, 2026

PDF: single-layer text selection with gen-time margins #578

Closed

PDF: separate transparent selection layer for text #577

Closed

andiwand and others added 11 commits July 2, 2026 14:19

update refs

532f94a

checkout lfs; cleanup

263cbb4

generalize tests

39c3045

cleanup

e5960c4

andiwand commented Jul 3, 2026

View reviewed changes

Comment thread src/odr/internal/html/pdf_file.cpp Outdated

andiwand commented Jul 3, 2026

View reviewed changes

Comment thread src/odr/internal/html/pdf_file.cpp Outdated

andiwand and others added 9 commits July 3, 2026 22:36

Apply suggestions from code review

aa22f26

Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>

Merge branch 'pdf-text-selection' of github.com:opendocument-app/Open…

34fa3f5

…Document.core into pdf-text-selection

format

c3e9e0b

minor cleanup

682ce0c

pdf: document writing-mode, non-Manhattan limits, and bottom-up clust…

7cd9c2f

…ering in READING_ORDER Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GG4WpNTKKR2uk5dBsqyAdG

andiwand and others added 3 commits July 4, 2026 01:04

update refs

6fee5ff

update ref

39b0623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
andiwand wants to merge 28 commits into
mainfrom
pdf-text-selection

andiwand commented Jul 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andiwand commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dual-layer mode (PdfTextMode::dual_layer, default)

Single-layer mode (PdfTextMode::single_layer)

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jul 1, 2026 •

edited

Loading

Dual-layer mode (`PdfTextMode::dual_layer`, default)

Single-layer mode (`PdfTextMode::single_layer`)