feat: smart (semantic) diff for subtitle regression outputs by gaurav02081 · Pull Request #1142 · CCExtractor/sample-platform

gaurav02081 · 2026-06-30T19:37:38Z

What this is

A semantic diff for regression-test outputs. Instead of a raw line diff, it
classifies how two subtitle outputs differ, so a person — or an AI agent —
gets an actionable verdict instead of a wall of changed lines.

Example: instead of "53 lines differ", it says "All 13 cues match but are
500 ms late."

Classification kinds

identical, timing_shift (constant offset), timing_drift (growing offset),
text_change, formatting_change (tags/entities only), whitespace_change
(CEA-608 padding only), encoding_change (non-ASCII/accents only), split_cues,
merged_cues, missing_cues, extra_cues, unsupported (no cues parsed), or
mixed. Each result also carries a per-cue changes list (which cue, what kind,
expected/actual snippet, timing offset).

Grounded in CCExtractor's own behaviour

Mirrors tests/extract_expected.py's normalization (strip styling tags,
unescape entities, trim CEA-608 trailing padding) so cosmetic differences are
separated from real text changes.
Handles the -latin1 charset case (accent folding → encoding_change).
Verified on real CCExtractor output (CEA-608 English + DVB Spanish with
<font> tags and accents), vendored as golden fixtures.

Where it lives

mod_test/smartdiff/ — a Flask-decoupled core (SRT + WebVTT parsers, a format
dispatcher, normalization, and the classifier). Fully unit-tested.
TestResultFile.generate_smart_diff() + a JSON endpoint
GET /diff/<test>/<regression>/<output>/smart — reusable by the web UI, a CLI,
and agents.
A "Smart" option next to the existing "Fail" diff link on the result page
(additive; the raw diff is unchanged).

Notes

Independent of the REST API work — this is standalone in the platform.
Tests: parsers, classifier, normalization, real golden fixtures, robustness
(malformed/garbage input), and the model glue.
Follow-up (not in this PR): a CLI sp run diff --smart consuming the same
endpoint.

Add a Flask-decoupled smart-diff module that classifies *how* two subtitle outputs differ instead of producing a raw line diff: - srt.py: parse SubRip content into structured cues (BOM/CRLF tolerant). - compare.py: align cues and classify as identical, timing_shift (with a consistent offset_ms), text_change, missing_cues, extra_cues, or mixed, with an agent-actionable one-line summary. Includes unit tests for the parser and every classification branch.

- vtt.py: parse WebVTT into cues (skips WEBVTT/NOTE/STYLE/REGION blocks, handles optional hours and trailing cue settings). - parsing.py: parse_subtitles() picks the parser by explicit hint or by auto-detecting the format from content. - compare.smart_diff() now takes an optional fmt and works across SRT/VTT. Adds parser tests for WebVTT and a cross-format auto-detect compare test.

…inds Mirror CCExtractor's own expected-output handling (tests/extract_expected.py): strip HTML/styling tags, unescape entities, and trim per-line trailing whitespace. This lets the comparator separate cosmetic differences from real text changes, adding two classifications: - formatting_change: cues differ only in tags/entities, not text. - whitespace_change: cues differ only in CEA-608 trailing padding. Parsers now preserve raw cue text (only surrounding blank lines are dropped) so the comparator, not the parser, decides what is cosmetic. Verified against a real CCExtractor CEA-608 sample.

- timing_drift: detect a growing (non-constant) offset across cues, the signature of a progressive sync bug, distinct from a constant timing_shift. - split_cues / merged_cues: when cue count changes but the text content is unchanged, report re-segmentation instead of missing/extra cues. - Vendor a real CCExtractor CEA-608 sample (tests/.../fixtures/cea608_real.srt) and add golden-fixture tests so the diff is exercised on true output: identical, constant shift, and cosmetic de-padding.

Add ascii_fold() and an 'encoding' text category so the comparator can tell a charset difference (e.g. CCExtractor's -latin1 output: 'Voilà' vs 'Voila') from a real word change. Surfaced as a new 'encoding_change' classification.

…s tests - Vendor dvb_spanish_real.srt: a genuine CCExtractor DVB Spanish output with <font> colour tags and accented text. Security-scanned before vendoring (no paths/IPs/emails/URLs/secrets) and verified valid UTF-8. - Strict fixture tests assert exact kinds and values on real output: identical, timing_shift (offset 500), formatting_change (font tags), encoding_change (accent folding), missing_cues. - Robustness tests: malformed/empty/control-byte/garbage input must classify cleanly and never raise. Note: the available "Chinese" DVB samples were failed OCR (no real CJK, and invalid UTF-8), so they were deliberately not vendored.

… the UI - TestResultFile.generate_smart_diff(): reads the expected/actual output files (reusing the encoding-tolerant read_lines) and returns a semantic classification via smart_diff. - New JSON endpoint GET /diff/<test>/<regression>/<output>/smart, reusable by the web UI, the CLI, and agents. Returns 'unavailable' gracefully if the output files are not on disk. - Result page: a "Smart" link next to each "Fail" diff link opens a small popup with the difference kind + summary (additive, opt-in). Includes a unit test of the model glue against real on-disk files.

…offset) smart_diff now returns a capped 'changes' list alongside the verdict: each changed cue with its kind, a per-cue timing offset, and (for text changes) expected/actual snippets. This gives an agent the structured detail to act on without scraping the raw HTML diff. The web "Smart" popup lists these changes (HTML-escaped). Result shape stays backward compatible (additive).

In the result cell, place the Smart link below the Fail link separated by a thin horizontal rule, instead of inline, so the two diff actions read clearly.

- compare: zero-cue outputs (non-srt/vtt formats) no longer report 'identical' when they differ — return a new 'unsupported' kind so a real failure isn't masked; equal cue-less outputs still report 'identical'. - normalize: require a non-empty ASCII skeleton for 'encoding', so two different non-Latin texts (CJK/Cyrillic) classify as 'text', not 'encoding'. - controllers: smart_diff_view also catches UnicodeDecodeError (outputs with bytes invalid in both utf-8 and cp1252) and returns 'unavailable' instead of a 500. - compare: drop the misleading "(timing aligned)" from the text_change summary (per-cue offsets are still in `changes`); compute _content(exp) once.

sonarqubecloud · 2026-06-30T19:39:13Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gaurav-dev02 added 10 commits June 25, 2026 02:19

style(smartdiff): stack Fail / Smart vertically with a divider

8c1f731

In the result cell, place the Smart link below the Fail link separated by a thin horizontal rule, instead of inline, so the two diff actions read clearly.

gaurav02081 requested review from canihavesomecoffee and thealphadollar as code owners June 30, 2026 19:37

Merge branch 'master' into feat/smart-diff

ba8d4f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: smart (semantic) diff for subtitle regression outputs#1142

feat: smart (semantic) diff for subtitle regression outputs#1142
gaurav02081 wants to merge 11 commits into
CCExtractor:masterfrom
gaurav02081:feat/smart-diff

gaurav02081 commented Jun 30, 2026

Uh oh!

sonarqubecloud Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gaurav02081 commented Jun 30, 2026

What this is

Classification kinds

Grounded in CCExtractor's own behaviour

Where it lives

Notes

Uh oh!

sonarqubecloud Bot commented Jun 30, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants