feat: smart (semantic) diff for subtitle regression outputs#1142
Open
gaurav02081 wants to merge 11 commits into
Open
feat: smart (semantic) diff for subtitle regression outputs#1142gaurav02081 wants to merge 11 commits into
gaurav02081 wants to merge 11 commits into
Conversation
Add a Flask-decoupled smart-diff module that classifies *how* two subtitle outputs differ instead of producing a raw line diff: - srt.py: parse SubRip content into structured cues (BOM/CRLF tolerant). - compare.py: align cues and classify as identical, timing_shift (with a consistent offset_ms), text_change, missing_cues, extra_cues, or mixed, with an agent-actionable one-line summary. Includes unit tests for the parser and every classification branch.
- vtt.py: parse WebVTT into cues (skips WEBVTT/NOTE/STYLE/REGION blocks, handles optional hours and trailing cue settings). - parsing.py: parse_subtitles() picks the parser by explicit hint or by auto-detecting the format from content. - compare.smart_diff() now takes an optional fmt and works across SRT/VTT. Adds parser tests for WebVTT and a cross-format auto-detect compare test.
…inds Mirror CCExtractor's own expected-output handling (tests/extract_expected.py): strip HTML/styling tags, unescape entities, and trim per-line trailing whitespace. This lets the comparator separate cosmetic differences from real text changes, adding two classifications: - formatting_change: cues differ only in tags/entities, not text. - whitespace_change: cues differ only in CEA-608 trailing padding. Parsers now preserve raw cue text (only surrounding blank lines are dropped) so the comparator, not the parser, decides what is cosmetic. Verified against a real CCExtractor CEA-608 sample.
- timing_drift: detect a growing (non-constant) offset across cues, the signature of a progressive sync bug, distinct from a constant timing_shift. - split_cues / merged_cues: when cue count changes but the text content is unchanged, report re-segmentation instead of missing/extra cues. - Vendor a real CCExtractor CEA-608 sample (tests/.../fixtures/cea608_real.srt) and add golden-fixture tests so the diff is exercised on true output: identical, constant shift, and cosmetic de-padding.
Add ascii_fold() and an 'encoding' text category so the comparator can tell a charset difference (e.g. CCExtractor's -latin1 output: 'Voilà' vs 'Voila') from a real word change. Surfaced as a new 'encoding_change' classification.
…s tests - Vendor dvb_spanish_real.srt: a genuine CCExtractor DVB Spanish output with <font> colour tags and accented text. Security-scanned before vendoring (no paths/IPs/emails/URLs/secrets) and verified valid UTF-8. - Strict fixture tests assert exact kinds and values on real output: identical, timing_shift (offset 500), formatting_change (font tags), encoding_change (accent folding), missing_cues. - Robustness tests: malformed/empty/control-byte/garbage input must classify cleanly and never raise. Note: the available "Chinese" DVB samples were failed OCR (no real CJK, and invalid UTF-8), so they were deliberately not vendored.
… the UI - TestResultFile.generate_smart_diff(): reads the expected/actual output files (reusing the encoding-tolerant read_lines) and returns a semantic classification via smart_diff. - New JSON endpoint GET /diff/<test>/<regression>/<output>/smart, reusable by the web UI, the CLI, and agents. Returns 'unavailable' gracefully if the output files are not on disk. - Result page: a "Smart" link next to each "Fail" diff link opens a small popup with the difference kind + summary (additive, opt-in). Includes a unit test of the model glue against real on-disk files.
…offset) smart_diff now returns a capped 'changes' list alongside the verdict: each changed cue with its kind, a per-cue timing offset, and (for text changes) expected/actual snippets. This gives an agent the structured detail to act on without scraping the raw HTML diff. The web "Smart" popup lists these changes (HTML-escaped). Result shape stays backward compatible (additive).
In the result cell, place the Smart link below the Fail link separated by a thin horizontal rule, instead of inline, so the two diff actions read clearly.
- compare: zero-cue outputs (non-srt/vtt formats) no longer report 'identical' when they differ — return a new 'unsupported' kind so a real failure isn't masked; equal cue-less outputs still report 'identical'. - normalize: require a non-empty ASCII skeleton for 'encoding', so two different non-Latin texts (CJK/Cyrillic) classify as 'text', not 'encoding'. - controllers: smart_diff_view also catches UnicodeDecodeError (outputs with bytes invalid in both utf-8 and cp1252) and returns 'unavailable' instead of a 500. - compare: drop the misleading "(timing aligned)" from the text_change summary (per-cue offsets are still in `changes`); compute _content(exp) once.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What this is
A semantic diff for regression-test outputs. Instead of a raw line diff, it
classifies how two subtitle outputs differ, so a person — or an AI agent —
gets an actionable verdict instead of a wall of changed lines.
Example: instead of "53 lines differ", it says "All 13 cues match but are
500 ms late."
Classification kinds
identical,timing_shift(constant offset),timing_drift(growing offset),text_change,formatting_change(tags/entities only),whitespace_change(CEA-608 padding only),
encoding_change(non-ASCII/accents only),split_cues,merged_cues,missing_cues,extra_cues,unsupported(no cues parsed), ormixed. Each result also carries a per-cuechangeslist (which cue, what kind,expected/actual snippet, timing offset).
Grounded in CCExtractor's own behaviour
tests/extract_expected.py's normalization (strip styling tags,unescape entities, trim CEA-608 trailing padding) so cosmetic differences are
separated from real text changes.
-latin1charset case (accent folding →encoding_change).<font>tags and accents), vendored as golden fixtures.Where it lives
mod_test/smartdiff/— a Flask-decoupled core (SRT + WebVTT parsers, a formatdispatcher, normalization, and the classifier). Fully unit-tested.
TestResultFile.generate_smart_diff()+ a JSON endpointGET /diff/<test>/<regression>/<output>/smart— reusable by the web UI, a CLI,and agents.
(additive; the raw diff is unchanged).
Notes
(malformed/garbage input), and the model glue.
sp run diff --smartconsuming the sameendpoint.