Skip to content

fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248

Open
HugoFara wants to merge 2 commits into
mainfrom
fix/scratch-tables-temporary-247
Open

fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248
HugoFara wants to merge 2 commits into
mainfrom
fix/scratch-tables-temporary-247

Conversation

@HugoFara

@HugoFara HugoFara commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Fixes #247 — "Database crashes when importing any text" (InnoDB error 194: Tablespace is missing for a table).

Root cause

Text parsing and vocabulary import used persistent, globally-shared InnoDB scratch tables (temp_word_occurrences, temp_words, tempexprs) that every parse/import TRUNCATEd and refilled. Two problems flowed from that:

  1. The reported crash. On file-per-table InnoDB, TRUNCATE drops and recreates the table's .ibd. When that file goes missing (notably on Windows / managed MySQL), the table is left with a missing tablespace and every subsequent import fails with error 194.
  2. Silent concurrency corruption. Because the tables were shared, two parses/imports at once (two tabs, a feed refresh overlapping a manual import, or any two users in multi-user mode) read and truncated each other's rows.

The fix (two commits)

Phase 1 — session-scoped scratch tables (f0a167496)
Convert the three tables to per-connection CREATE TEMPORARY TABLE. Temporary InnoDB tables live in the shared session temp tablespace (no per-table .ibd to orphan) and are private to the connection, so error 194 can't recur and concurrent parses can't collide. A migration drops the old persistent tables, which also repairs already-orphaned installs on upgrade. The two self-referencing UPDATEs were rewritten via helper temp tables (MySQL can't reopen a TEMPORARY table twice in one statement).

Phase 2 — parse texts fully in PHP (132722b5d)
Retire the reason the scratch tables exist. Tokenization, multi-word detection, and the sentences / word_occurrences inserts now all happen in PHP.

  • Removed: temp_word_occurrences + tempexprs from parsing entirely; the LOAD DATA LOCAL INFILE path and its two divergent fallbacks; the ~90-line stateful @variable multi-word-detection SQL; TextParsingPersistence and the dead, never-wired ParsingCoordinator.
  • Added: ParsedToken (a former scratch-table row) and TokenPersistence (inserts sentences → reads back real SeIDs → per-sentence windowed-hash multi-word detection → batch word_occurrences). The tokenizers themselves are unchanged.

Net: +941 / −1719 lines.

Bonus: two pre-existing bugs fixed

Both hit the same LOAD-DATA-less installs (managed DB / Windows) as #247:

  1. The fallback's trim($line) ate the \r sentence marker → every text parsed as one sentence with no spacing.
  2. The @variable detector only found multi-word expressions in the first sentence of space-less (MeCab/CJK) languages.

Verification

  • Differential harness vs the canonical LOAD DATA output over 14 cases (Latin / CJK char-split / MeCab, 2- & 3-word expressions, overlaps, paragraphs, quotes, decimal numbers, abbreviations, whitespace): 13 byte-identical; the 14th differs only by correctly detecting a multi-word expression the old SQL missed.
  • MeCab tested end-to-end against a real install; buildTokensFromMecab() also has a synthetic-fixture unit test (no binary needed for CI).
  • Full PHPUnit suite green; Psalm (cold cache) and PHPCS (PSR-12) clean.
  • checkText() is now pure — it previously echoed preview HTML despite its docblock saying it doesn't.

Reviewer note

For installs where LOAD DATA LOCAL INFILE was unavailable, re-parsed texts will now split into sentences correctly and detect CJK/Japanese multi-word expressions across all sentences — i.e. this is a behavior fix, not a regression, if you diff old vs new parsed output.

Compatibility

SqlValidator::ALLOWED_TABLES keeps its temp_word_occurrences / temp_words entries so restoring old backups still validates; backup itself uses a separate list that never included them.

https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi

HugoFara added 2 commits July 2, 2026 14:17
temp_word_occurrences, temp_words and tempexprs were persistent,
globally-shared InnoDB tables that every parse/import TRUNCATEd and
refilled. That caused two problems:

- TRUNCATE on file-per-table InnoDB drops and recreates the .ibd. When
  that file went missing (notably on Windows) the table was left with a
  missing tablespace and every import crashed with InnoDB error 194
  "Tablespace is missing for a table" (issue #247).
- Because the tables were shared, concurrent parses/imports read and
  truncated each other's rows, silently corrupting sentences and word
  occurrences.

Convert all three to per-connection CREATE TEMPORARY TABLE:

- Add ScratchTables helper owning the DDL (moved out of baseline.sql).
- TextParsing / TextParsingPersistence: ensure()+DELETE instead of
  TRUNCATE. The two self-referencing UPDATEs (TiSeID and tempexprs
  realignment) are rewritten via helper temp tables temp_seid_map /
  temp_sent_map, because MySQL/MariaDB cannot open a TEMPORARY table
  twice in the same statement.
- CompleteImportService: create temp_words in initTempTables; cleanup
  uses DROP TEMPORARY TABLE so it does not implicitly commit the import
  transaction.
- ParsingCoordinator (unused duplicate path): route through the helper
  and ensure tempexprs exists before reading it.
- Migration 20260702_120000 drops the old persistent tables, which also
  repairs installs whose tablespace was already orphaned.

Temporary InnoDB tables live in the session temp tablespace (no
per-table .ibd to orphan) and are private to the connection, so error
194 cannot recur and concurrent parses can no longer collide.

SqlValidator::ALLOWED_TABLES entries are kept so old backups still
restore; backup uses its own BACKUP_TABLES list that never included
these tables.

Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi
Phase 2 of the #247 scratch-table work. Text parsing no longer uses any
scratch table: tokenization, multi-word detection, and the sentence /
word_occurrence inserts all happen in PHP.

Removed:
- temp_word_occurrences and tempexprs from the parse path entirely.
- The LOAD DATA LOCAL INFILE path and its two divergent fallbacks
  (saveWithSql / saveWithSqlFallback and the inline INSERT branch).
- The stateful @-variable multi-word-detection SQL (checkExpressions).
- TextParsingPersistence and the dead, never-wired ParsingCoordinator.

Added:
- ParsedToken value object (was a temp_word_occurrences row).
- TokenPersistence: inserts sentences (reads back real SeIDs), detects
  multi-word expressions with a per-sentence windowed hash lookup, and
  batch-inserts word_occurrences.
- StandardTextParser::tokenize()/splitSentences() and
  JapaneseTextParser::tokenize()/buildTokensFromMecab() return
  ParsedToken[]; the tokenizers themselves are unchanged.

This fixes two pre-existing bugs affecting installs without LOAD DATA
LOCAL INFILE (the same managed-DB / Windows population as #247):
- saveWithSqlFallback trimmed the "\r" sentence marker, so every text
  parsed as a single sentence with no spacing.
- The @-variable detector only found multi-word expressions in the first
  sentence of space-less (MeCab/CJK) languages.

Verified with a differential harness against the canonical LOAD DATA
output over 14 cases (Latin/CJK/MeCab, 2- and 3-word expressions,
overlaps, paragraphs, quotes, numbers, abbreviations): 13 byte-identical,
the 14th differing only by correctly detecting a multi-word expression the
old SQL missed. checkText() is now pure (it previously echoed preview HTML
despite its docblock). The Japanese path was tested end-to-end against a
real MeCab install, and buildTokensFromMecab() has a unit test using
captured MeCab output (no binary needed).

Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Database crashs when importing any type of text

1 participant