Skip to content

feat: smart (semantic) diff for subtitle regression outputs#1142

Open
gaurav02081 wants to merge 11 commits into
CCExtractor:masterfrom
gaurav02081:feat/smart-diff
Open

feat: smart (semantic) diff for subtitle regression outputs#1142
gaurav02081 wants to merge 11 commits into
CCExtractor:masterfrom
gaurav02081:feat/smart-diff

Conversation

@gaurav02081

Copy link
Copy Markdown

What this is

A semantic diff for regression-test outputs. Instead of a raw line diff, it
classifies how two subtitle outputs differ, so a person — or an AI agent —
gets an actionable verdict instead of a wall of changed lines.

Example: instead of "53 lines differ", it says "All 13 cues match but are
500 ms late."

Classification kinds

identical, timing_shift (constant offset), timing_drift (growing offset),
text_change, formatting_change (tags/entities only), whitespace_change
(CEA-608 padding only), encoding_change (non-ASCII/accents only), split_cues,
merged_cues, missing_cues, extra_cues, unsupported (no cues parsed), or
mixed. Each result also carries a per-cue changes list (which cue, what kind,
expected/actual snippet, timing offset).

Grounded in CCExtractor's own behaviour

  • Mirrors tests/extract_expected.py's normalization (strip styling tags,
    unescape entities, trim CEA-608 trailing padding) so cosmetic differences are
    separated from real text changes.
  • Handles the -latin1 charset case (accent folding → encoding_change).
  • Verified on real CCExtractor output (CEA-608 English + DVB Spanish with
    <font> tags and accents), vendored as golden fixtures.

Where it lives

  • mod_test/smartdiff/ — a Flask-decoupled core (SRT + WebVTT parsers, a format
    dispatcher, normalization, and the classifier). Fully unit-tested.
  • TestResultFile.generate_smart_diff() + a JSON endpoint
    GET /diff/<test>/<regression>/<output>/smart — reusable by the web UI, a CLI,
    and agents.
  • A "Smart" option next to the existing "Fail" diff link on the result page
    (additive; the raw diff is unchanged).

Notes

  • Independent of the REST API work — this is standalone in the platform.
  • Tests: parsers, classifier, normalization, real golden fixtures, robustness
    (malformed/garbage input), and the model glue.
  • Follow-up (not in this PR): a CLI sp run diff --smart consuming the same
    endpoint.

Add a Flask-decoupled smart-diff module that classifies *how* two subtitle
outputs differ instead of producing a raw line diff:
- srt.py: parse SubRip content into structured cues (BOM/CRLF tolerant).
- compare.py: align cues and classify as identical, timing_shift (with a
  consistent offset_ms), text_change, missing_cues, extra_cues, or mixed,
  with an agent-actionable one-line summary.

Includes unit tests for the parser and every classification branch.
- vtt.py: parse WebVTT into cues (skips WEBVTT/NOTE/STYLE/REGION blocks,
  handles optional hours and trailing cue settings).
- parsing.py: parse_subtitles() picks the parser by explicit hint or by
  auto-detecting the format from content.
- compare.smart_diff() now takes an optional fmt and works across SRT/VTT.

Adds parser tests for WebVTT and a cross-format auto-detect compare test.
…inds

Mirror CCExtractor's own expected-output handling (tests/extract_expected.py):
strip HTML/styling tags, unescape entities, and trim per-line trailing
whitespace. This lets the comparator separate cosmetic differences from real
text changes, adding two classifications:
- formatting_change: cues differ only in tags/entities, not text.
- whitespace_change: cues differ only in CEA-608 trailing padding.

Parsers now preserve raw cue text (only surrounding blank lines are dropped)
so the comparator, not the parser, decides what is cosmetic. Verified against
a real CCExtractor CEA-608 sample.
- timing_drift: detect a growing (non-constant) offset across cues, the
  signature of a progressive sync bug, distinct from a constant timing_shift.
- split_cues / merged_cues: when cue count changes but the text content is
  unchanged, report re-segmentation instead of missing/extra cues.
- Vendor a real CCExtractor CEA-608 sample (tests/.../fixtures/cea608_real.srt)
  and add golden-fixture tests so the diff is exercised on true output:
  identical, constant shift, and cosmetic de-padding.
Add ascii_fold() and an 'encoding' text category so the comparator can tell a
charset difference (e.g. CCExtractor's -latin1 output: 'Voilà' vs 'Voila')
from a real word change. Surfaced as a new 'encoding_change' classification.
…s tests

- Vendor dvb_spanish_real.srt: a genuine CCExtractor DVB Spanish output with
  <font> colour tags and accented text. Security-scanned before vendoring
  (no paths/IPs/emails/URLs/secrets) and verified valid UTF-8.
- Strict fixture tests assert exact kinds and values on real output:
  identical, timing_shift (offset 500), formatting_change (font tags),
  encoding_change (accent folding), missing_cues.
- Robustness tests: malformed/empty/control-byte/garbage input must classify
  cleanly and never raise.

Note: the available "Chinese" DVB samples were failed OCR (no real CJK, and
invalid UTF-8), so they were deliberately not vendored.
… the UI

- TestResultFile.generate_smart_diff(): reads the expected/actual output files
  (reusing the encoding-tolerant read_lines) and returns a semantic
  classification via smart_diff.
- New JSON endpoint GET /diff/<test>/<regression>/<output>/smart, reusable by
  the web UI, the CLI, and agents. Returns 'unavailable' gracefully if the
  output files are not on disk.
- Result page: a "Smart" link next to each "Fail" diff link opens a small
  popup with the difference kind + summary (additive, opt-in).

Includes a unit test of the model glue against real on-disk files.
…offset)

smart_diff now returns a capped 'changes' list alongside the verdict: each
changed cue with its kind, a per-cue timing offset, and (for text changes)
expected/actual snippets. This gives an agent the structured detail to act on
without scraping the raw HTML diff. The web "Smart" popup lists these changes
(HTML-escaped). Result shape stays backward compatible (additive).
In the result cell, place the Smart link below the Fail link separated by a
thin horizontal rule, instead of inline, so the two diff actions read clearly.
- compare: zero-cue outputs (non-srt/vtt formats) no longer report 'identical'
  when they differ — return a new 'unsupported' kind so a real failure isn't
  masked; equal cue-less outputs still report 'identical'.
- normalize: require a non-empty ASCII skeleton for 'encoding', so two
  different non-Latin texts (CJK/Cyrillic) classify as 'text', not 'encoding'.
- controllers: smart_diff_view also catches UnicodeDecodeError (outputs with
  bytes invalid in both utf-8 and cp1252) and returns 'unavailable' instead of
  a 500.
- compare: drop the misleading "(timing aligned)" from the text_change summary
  (per-cue offsets are still in `changes`); compute _content(exp) once.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants