Lobbying data pipeline#2158
Draft
nesanders wants to merge 7 commits into
Draft
Conversation
Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch)
and writes structured data to Firestore for joining with MAPLE bill data.
New collections:
- /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer
- /lobbyingFilings — one doc per (registrant, client, bill, court), with billId
null for Executive/Other chambers so the join guard is type-level
Key design points:
- billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match
Bill.id in the existing bills collection; raw integer + chamber stored separately
- Entity name normalization pipeline ported from reference implementation (10 steps:
d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.)
- Both raw and *Norm name fields stored for provenance and grouping
- Live Cloud Function scrapes current+prior year on a 24h schedule with a
summaryDiscCache to avoid re-fetching summary pages in steady state
- Backfill admin script handles full 2005-present history with a Firestore
subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales
to ~50k URLs and is safely resumable
Files:
- functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts
- scripts/firebase-admin/backfillLobbying.ts
- firestore.rules + firestore.indexes.json updated
- docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting to classify HTTP clients before examining headers. Python's requests library produces a fingerprint that Imperva allows through; Node.js does not. A standalone Cloud Run container (Python 3.12) is therefore used for the scheduled ingestion instead of a Cloud Function. lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4, google-cloud-firestore): - scrape.py: entry point with --mode weekly (incremental, fast exit if nothing new) and --mode backfill (full 2005-present history, resumable subcollection cursor). Weekly mode caches summary URL→disc URL mappings so prior-year registrants with no new filings require zero additional HTTP requests. - portal.py: HTTP session management + HTML parsing for all three portal page levels (search POST, summary GET, disclosure GET). Handles both modern (>=2013) and legacy (<2013) disclosure formats. - normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity name normalization pipeline, must match the TypeScript version exactly. - writer.py: Firestore document construction and batch writes. Schema matches types.ts (lobbyingRegistrants, lobbyingFilings collections). scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py as a subprocess; all HTTP and Firestore logic moved to Python. functions/src/lobbying/http/ — thin Python HTTP helper kept for reference; not used in the current architecture. Note: server-side IP reputation behavior with Imperva untested. Build and run the container on Cloud Run with --dry-run to validate before full deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per code review feedback: the TypeScript Firebase Function and backfill script added no value — the portal's TLS fingerprinting requirements mean Node.js cannot reach it, so the TS HTTP layer was non-functional and the backfill script was just a thin subprocess wrapper with no benefit over calling scrape.py directly. Removed: - functions/src/lobbying/scrapeLobbying.ts (broken Cloud Function) - functions/src/lobbying/portal.ts (non-functional TS HTTP layer) - functions/src/lobbying/http/ (unused Python fetch helper) - scripts/firebase-admin/backfillLobbying.ts (shell wrapper, no value) - scrapeLobbying export from functions/src/index.ts Kept: - functions/src/lobbying/types.ts — Firestore schema; imported by frontend - functions/src/lobbying/normalize.ts — normalization pipeline - lobbying-scraper/ — the working Cloud Run container (unchanged) The historical backfill is now run directly: python3 lobbying-scraper/scrape.py --mode backfill Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ucture
Portal parser (portal.py):
- Hybrid era (2014-2018): per-client compensation from Panel1 divs — was
silently $0 due to missing code path (e.g. Murphy Donoghue 2016: $990k)
- Legacy era (2009-2013): per-client totals from 'Compensation received'
column, with dedup of (client, amount) pairs before summing — was silently
$0 per client (e.g. ML Strategies 2011: $641k across 23 clients)
- Legacy bill semicolon separator: 'H73; Title' now parsed correctly
- 'Total amount' summary row excluded from compensation pairs
- HTTP retry on 429/500/502/503/504 (was aborting on first transient error)
- parse_summary() and parse_disclosure_detail() split out as pure functions
(no I/O) so the offline reparse driver can call them without a session
GCS archiving (archive.py):
- Write-only cold storage: every fetched Summary/CompleteDisclosure page
saved as gs://{project}-lobbying-archive/raw_html/{sha1(url)}.html
- Enabled by ARCHIVE_RAW=1 env var; no-op otherwise
- Failures are logged but never interrupt the live scrape path
Offline reparse driver (reparse_archive.py):
- Lists CompleteDisclosure blobs from GCS, resolves registrant meta from
Firestore via disclosureUrls array_contains, re-runs pure parsers,
writes back via writer.py; resumable via /scrapers/lobbyingReparse cursor
Pytest suite (tests/test_portal_parser.py, 26 tests):
- All 4 eras verified against committed gzipped fixture pages
- Compensation totals, client counts, bill counts, era detection asserted
- Specific bug-fix regressions: Total-amount artifact, H73 semicolon,
hybrid Panel1 compensation, 2007 _total_salary_ fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- writer.py: move firestore import out of TYPE_CHECKING block so
firestore.ArrayUnion() is available at runtime (was NameError)
- writer.py/scrape.py/reparse_archive.py: strip leading slash from
Firestore path constants (SCRAPER_DOC, BACKFILL_DOC, REPARSE_DOC) —
db.document('/scrapers/x') raises ValueError: odd path element count
- scrape.py: add os import; pass GOOGLE_CLOUD_PROJECT to firestore.Client()
so local runs target the correct project rather than the ADC default
Verified: 3 registrants / 6 disclosures written to digital-testimony-dev;
re-run writes 0 (cursor working).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a Python-based lobbying disclosure ingestion pipeline and establishes the associated Firestore data model. No frontend changes are included; those will follow in a subsequent PR.
c.f. #855 #1365
What's included
lobbying-scraper/): fetches MA SoS lobbying portal pages, parses disclosure HTML, writeslobbyingRegistrantsandlobbyingFilingsdocuments to Firestore. Replaces the deleted TypeScript Cloud Function stub.archive.py): whenARCHIVE_RAW=1, each fetched portal page is written to{project}-lobbying-archiveGCS bucket underraw_html/{sha1(url)}.html. Enables offline reparse without re-scraping.reparse_archive.py): reads archived HTML from GCS, re-runs pure parsers, writes back to Firestore. Used when parser logic changes.tests/): 26 tests covering all 4 eras (employer + individual registrants), compensation totals, bill counts, edge cases (semicolon bill separators, "Total amount" artifact rows, executive chamber null billIds).lobbyingFilings(generalCourt + billId, chamber, entityNameNorm, clientNameNorm).Checklist
firestore.indexes.json— Added 4 composite indexes onlobbyingFilings.Screenshots
N/A (no frontend changes)
Known issues
billIdbut no matching bill document.RegVersionLobbyist.aspxsummary pages instead ofCompleteDisclosure.aspx. These registrants produce no disclosure detail rows — expected portal behavior.google-cloud-storagetorequirements.txt.Steps to test
See the Incremental Test Plan in the doc for the full sequence. Quick smoke test:
cd lobbying-scraper && python -m pytest tests/ -v— all 26 tests should pass.python3 scrape.py --mode backfill --year 2024 --limit 3 --dry-run— verifies the portal is reachable and HTML parsing returns valid data without writing to Firestore.GOOGLE_CLOUD_PROJECT=digital-testimony-dev python3 scrape.py --mode backfill --year 2024 --limit 3— writes 3 registrants and their filings; verify documents appear in the Firestore console with correctbillIdvalues for legislative rows andnullfor Executive rows.