fix(parsing): remove shared scratch tables; parse fully in PHP (#247) by HugoFara · Pull Request #248 · HugoFara/lwt

HugoFara · 2026-07-02T13:54:09Z

Fixes #247 — "Database crashes when importing any text" (InnoDB error 194: Tablespace is missing for a table).

Root cause

Text parsing and vocabulary import used persistent, globally-shared InnoDB scratch tables (temp_word_occurrences, temp_words, tempexprs) that every parse/import TRUNCATEd and refilled. Two problems flowed from that:

The reported crash. On file-per-table InnoDB, TRUNCATE drops and recreates the table's .ibd. When that file goes missing (notably on Windows / managed MySQL), the table is left with a missing tablespace and every subsequent import fails with error 194.
Silent concurrency corruption. Because the tables were shared, two parses/imports at once (two tabs, a feed refresh overlapping a manual import, or any two users in multi-user mode) read and truncated each other's rows.

The fix (two commits)

Phase 1 — session-scoped scratch tables (f0a167496)
Convert the three tables to per-connection CREATE TEMPORARY TABLE. Temporary InnoDB tables live in the shared session temp tablespace (no per-table .ibd to orphan) and are private to the connection, so error 194 can't recur and concurrent parses can't collide. A migration drops the old persistent tables, which also repairs already-orphaned installs on upgrade. The two self-referencing UPDATEs were rewritten via helper temp tables (MySQL can't reopen a TEMPORARY table twice in one statement).

Phase 2 — parse texts fully in PHP (132722b5d)
Retire the reason the scratch tables exist. Tokenization, multi-word detection, and the sentences / word_occurrences inserts now all happen in PHP.

Removed: temp_word_occurrences + tempexprs from parsing entirely; the LOAD DATA LOCAL INFILE path and its two divergent fallbacks; the ~90-line stateful @variable multi-word-detection SQL; TextParsingPersistence and the dead, never-wired ParsingCoordinator.
Added: ParsedToken (a former scratch-table row) and TokenPersistence (inserts sentences → reads back real SeIDs → per-sentence windowed-hash multi-word detection → batch word_occurrences). The tokenizers themselves are unchanged.

Net: +941 / −1719 lines.

Bonus: two pre-existing bugs fixed

Both hit the same LOAD-DATA-less installs (managed DB / Windows) as #247:

The fallback's trim($line) ate the \r sentence marker → every text parsed as one sentence with no spacing.
The @variable detector only found multi-word expressions in the first sentence of space-less (MeCab/CJK) languages.

Verification

Differential harness vs the canonical LOAD DATA output over 14 cases (Latin / CJK char-split / MeCab, 2- & 3-word expressions, overlaps, paragraphs, quotes, decimal numbers, abbreviations, whitespace): 13 byte-identical; the 14th differs only by correctly detecting a multi-word expression the old SQL missed.
MeCab tested end-to-end against a real install; buildTokensFromMecab() also has a synthetic-fixture unit test (no binary needed for CI).
Full PHPUnit suite green; Psalm (cold cache) and PHPCS (PSR-12) clean.
checkText() is now pure — it previously echoed preview HTML despite its docblock saying it doesn't.

Reviewer note

For installs where LOAD DATA LOCAL INFILE was unavailable, re-parsed texts will now split into sentences correctly and detect CJK/Japanese multi-word expressions across all sentences — i.e. this is a behavior fix, not a regression, if you diff old vs new parsed output.

Compatibility

SqlValidator::ALLOWED_TABLES keeps its temp_word_occurrences / temp_words entries so restoring old backups still validates; backup itself uses a separate list that never included them.

https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi

temp_word_occurrences, temp_words and tempexprs were persistent, globally-shared InnoDB tables that every parse/import TRUNCATEd and refilled. That caused two problems: - TRUNCATE on file-per-table InnoDB drops and recreates the .ibd. When that file went missing (notably on Windows) the table was left with a missing tablespace and every import crashed with InnoDB error 194 "Tablespace is missing for a table" (issue #247). - Because the tables were shared, concurrent parses/imports read and truncated each other's rows, silently corrupting sentences and word occurrences. Convert all three to per-connection CREATE TEMPORARY TABLE: - Add ScratchTables helper owning the DDL (moved out of baseline.sql). - TextParsing / TextParsingPersistence: ensure()+DELETE instead of TRUNCATE. The two self-referencing UPDATEs (TiSeID and tempexprs realignment) are rewritten via helper temp tables temp_seid_map / temp_sent_map, because MySQL/MariaDB cannot open a TEMPORARY table twice in the same statement. - CompleteImportService: create temp_words in initTempTables; cleanup uses DROP TEMPORARY TABLE so it does not implicitly commit the import transaction. - ParsingCoordinator (unused duplicate path): route through the helper and ensure tempexprs exists before reading it. - Migration 20260702_120000 drops the old persistent tables, which also repairs installs whose tablespace was already orphaned. Temporary InnoDB tables live in the session temp tablespace (no per-table .ibd to orphan) and are private to the connection, so error 194 cannot recur and concurrent parses can no longer collide. SqlValidator::ALLOWED_TABLES entries are kept so old backups still restore; backup uses its own BACKUP_TABLES list that never included these tables. Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi

Phase 2 of the #247 scratch-table work. Text parsing no longer uses any scratch table: tokenization, multi-word detection, and the sentence / word_occurrence inserts all happen in PHP. Removed: - temp_word_occurrences and tempexprs from the parse path entirely. - The LOAD DATA LOCAL INFILE path and its two divergent fallbacks (saveWithSql / saveWithSqlFallback and the inline INSERT branch). - The stateful @-variable multi-word-detection SQL (checkExpressions). - TextParsingPersistence and the dead, never-wired ParsingCoordinator. Added: - ParsedToken value object (was a temp_word_occurrences row). - TokenPersistence: inserts sentences (reads back real SeIDs), detects multi-word expressions with a per-sentence windowed hash lookup, and batch-inserts word_occurrences. - StandardTextParser::tokenize()/splitSentences() and JapaneseTextParser::tokenize()/buildTokensFromMecab() return ParsedToken[]; the tokenizers themselves are unchanged. This fixes two pre-existing bugs affecting installs without LOAD DATA LOCAL INFILE (the same managed-DB / Windows population as #247): - saveWithSqlFallback trimmed the "\r" sentence marker, so every text parsed as a single sentence with no spacing. - The @-variable detector only found multi-word expressions in the first sentence of space-less (MeCab/CJK) languages. Verified with a differential harness against the canonical LOAD DATA output over 14 cases (Latin/CJK/MeCab, 2- and 3-word expressions, overlaps, paragraphs, quotes, numbers, abbreviations): 13 byte-identical, the 14th differing only by correctly detecting a multi-word expression the old SQL missed. checkText() is now pure (it previously echoed preview HTML despite its docblock). The Japanese path was tested end-to-end against a real MeCab install, and buildTokensFromMecab() has a unit test using captured MeCab output (no binary needed). Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi

HugoFara added 2 commits July 2, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248

fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248
HugoFara wants to merge 2 commits into
mainfrom
fix/scratch-tables-temporary-247

HugoFara commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HugoFara commented Jul 2, 2026

Root cause

The fix (two commits)

Bonus: two pre-existing bugs fixed

Verification

Reviewer note

Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant