fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248
Open
HugoFara wants to merge 2 commits into
Open
fix(parsing): remove shared scratch tables; parse fully in PHP (#247)#248HugoFara wants to merge 2 commits into
HugoFara wants to merge 2 commits into
Conversation
temp_word_occurrences, temp_words and tempexprs were persistent, globally-shared InnoDB tables that every parse/import TRUNCATEd and refilled. That caused two problems: - TRUNCATE on file-per-table InnoDB drops and recreates the .ibd. When that file went missing (notably on Windows) the table was left with a missing tablespace and every import crashed with InnoDB error 194 "Tablespace is missing for a table" (issue #247). - Because the tables were shared, concurrent parses/imports read and truncated each other's rows, silently corrupting sentences and word occurrences. Convert all three to per-connection CREATE TEMPORARY TABLE: - Add ScratchTables helper owning the DDL (moved out of baseline.sql). - TextParsing / TextParsingPersistence: ensure()+DELETE instead of TRUNCATE. The two self-referencing UPDATEs (TiSeID and tempexprs realignment) are rewritten via helper temp tables temp_seid_map / temp_sent_map, because MySQL/MariaDB cannot open a TEMPORARY table twice in the same statement. - CompleteImportService: create temp_words in initTempTables; cleanup uses DROP TEMPORARY TABLE so it does not implicitly commit the import transaction. - ParsingCoordinator (unused duplicate path): route through the helper and ensure tempexprs exists before reading it. - Migration 20260702_120000 drops the old persistent tables, which also repairs installs whose tablespace was already orphaned. Temporary InnoDB tables live in the session temp tablespace (no per-table .ibd to orphan) and are private to the connection, so error 194 cannot recur and concurrent parses can no longer collide. SqlValidator::ALLOWED_TABLES entries are kept so old backups still restore; backup uses its own BACKUP_TABLES list that never included these tables. Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi
Phase 2 of the #247 scratch-table work. Text parsing no longer uses any scratch table: tokenization, multi-word detection, and the sentence / word_occurrence inserts all happen in PHP. Removed: - temp_word_occurrences and tempexprs from the parse path entirely. - The LOAD DATA LOCAL INFILE path and its two divergent fallbacks (saveWithSql / saveWithSqlFallback and the inline INSERT branch). - The stateful @-variable multi-word-detection SQL (checkExpressions). - TextParsingPersistence and the dead, never-wired ParsingCoordinator. Added: - ParsedToken value object (was a temp_word_occurrences row). - TokenPersistence: inserts sentences (reads back real SeIDs), detects multi-word expressions with a per-sentence windowed hash lookup, and batch-inserts word_occurrences. - StandardTextParser::tokenize()/splitSentences() and JapaneseTextParser::tokenize()/buildTokensFromMecab() return ParsedToken[]; the tokenizers themselves are unchanged. This fixes two pre-existing bugs affecting installs without LOAD DATA LOCAL INFILE (the same managed-DB / Windows population as #247): - saveWithSqlFallback trimmed the "\r" sentence marker, so every text parsed as a single sentence with no spacing. - The @-variable detector only found multi-word expressions in the first sentence of space-less (MeCab/CJK) languages. Verified with a differential harness against the canonical LOAD DATA output over 14 cases (Latin/CJK/MeCab, 2- and 3-word expressions, overlaps, paragraphs, quotes, numbers, abbreviations): 13 byte-identical, the 14th differing only by correctly detecting a multi-word expression the old SQL missed. checkText() is now pure (it previously echoed preview HTML despite its docblock). The Japanese path was tested end-to-end against a real MeCab install, and buildTokensFromMecab() has a unit test using captured MeCab output (no binary needed). Claude-Session: https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #247 — "Database crashes when importing any text" (
InnoDB error 194: Tablespace is missing for a table).Root cause
Text parsing and vocabulary import used persistent, globally-shared InnoDB scratch tables (
temp_word_occurrences,temp_words,tempexprs) that every parse/importTRUNCATEd and refilled. Two problems flowed from that:TRUNCATEdrops and recreates the table's.ibd. When that file goes missing (notably on Windows / managed MySQL), the table is left with a missing tablespace and every subsequent import fails with error 194.The fix (two commits)
Phase 1 — session-scoped scratch tables (
f0a167496)Convert the three tables to per-connection
CREATE TEMPORARY TABLE. Temporary InnoDB tables live in the shared session temp tablespace (no per-table.ibdto orphan) and are private to the connection, so error 194 can't recur and concurrent parses can't collide. A migration drops the old persistent tables, which also repairs already-orphaned installs on upgrade. The two self-referencingUPDATEs were rewritten via helper temp tables (MySQL can't reopen a TEMPORARY table twice in one statement).Phase 2 — parse texts fully in PHP (
132722b5d)Retire the reason the scratch tables exist. Tokenization, multi-word detection, and the
sentences/word_occurrencesinserts now all happen in PHP.temp_word_occurrences+tempexprsfrom parsing entirely; theLOAD DATA LOCAL INFILEpath and its two divergent fallbacks; the ~90-line stateful@variablemulti-word-detection SQL;TextParsingPersistenceand the dead, never-wiredParsingCoordinator.ParsedToken(a former scratch-table row) andTokenPersistence(inserts sentences → reads back realSeIDs → per-sentence windowed-hash multi-word detection → batchword_occurrences). The tokenizers themselves are unchanged.Net: +941 / −1719 lines.
Bonus: two pre-existing bugs fixed
Both hit the same LOAD-DATA-less installs (managed DB / Windows) as #247:
trim($line)ate the\rsentence marker → every text parsed as one sentence with no spacing.@variabledetector only found multi-word expressions in the first sentence of space-less (MeCab/CJK) languages.Verification
buildTokensFromMecab()also has a synthetic-fixture unit test (no binary needed for CI).checkText()is now pure — it previously echoed preview HTML despite its docblock saying it doesn't.Reviewer note
For installs where
LOAD DATA LOCAL INFILEwas unavailable, re-parsed texts will now split into sentences correctly and detect CJK/Japanese multi-word expressions across all sentences — i.e. this is a behavior fix, not a regression, if you diff old vs new parsed output.Compatibility
SqlValidator::ALLOWED_TABLESkeeps itstemp_word_occurrences/temp_wordsentries so restoring old backups still validates; backup itself uses a separate list that never included them.https://claude.ai/code/session_01X5PujuwBEMtMJxK1BA97fi