Lobbying data pipeline by nesanders · Pull Request #2158 · codeforboston/maple

nesanders · 2026-06-04T12:46:14Z

Summary

This PR introduces a Python-based lobbying disclosure ingestion pipeline and establishes the associated Firestore data model. No frontend changes are included; those will follow in a subsequent PR.

c.f. #855 #1365

What's included

Python Cloud Run scraper (lobbying-scraper/): fetches MA SoS lobbying portal pages, parses disclosure HTML, writes lobbyingRegistrants and lobbyingFilings documents to Firestore. Replaces the deleted TypeScript Cloud Function stub.
Parser coverage for all 4 HTML eras (2005–2008, 2009–2013, 2014–2018, 2019+): each era of the SoS website uses a different table/grid structure for compensation data; all are now handled and tested.
GCS raw HTML archive (archive.py): when ARCHIVE_RAW=1, each fetched portal page is written to {project}-lobbying-archive GCS bucket under raw_html/{sha1(url)}.html. Enables offline reparse without re-scraping.
Offline reparse driver (reparse_archive.py): reads archived HTML from GCS, re-runs pure parsers, writes back to Firestore. Used when parser logic changes.
Pytest suite (tests/): 26 tests covering all 4 eras (employer + individual registrants), compensation totals, bill counts, edge cases (semicolon bill separators, "Total amount" artifact rows, executive chamber null billIds).
Firestore composite indexes: 4 new indexes on lobbyingFilings (generalCourt + billId, chamber, entityNameNorm, clientNameNorm).

Checklist

On the frontend, I've made my strings translate-able. -- N/A
If I've added shared components, I've added a storybook story. -- N/A
I've made pages responsive and look good on mobile. -- N/A
If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json — Added 4 composite indexes on lobbyingFilings.

Screenshots

N/A (no frontend changes)

Known issues

Initial backfill required. Weekly incremental runs only cover current and prior year. Historical data (2005–present) requires a one-time backfill run. See the test plan in the doc.
Bill joins limited to court 192+ (2021–present). MAPLE's bill collection starts around 2020; older lobbying filings will have a valid billId but no matching bill document.
Portal rate limiting. The MA SoS portal requires ~1s between requests. First-time processing of a year (~500 registrants × 2 disclosures) takes roughly 20–30 minutes.
Pre-2009 individual lobbyists use RegVersionLobbyist.aspx summary pages instead of CompleteDisclosure.aspx. These registrants produce no disclosure detail rows — expected portal behavior.
Docker image rebuild needed after adding google-cloud-storage to requirements.txt.

Steps to test

See the Incremental Test Plan in the doc for the full sequence. Quick smoke test:

Run the parser unit tests: cd lobbying-scraper && python -m pytest tests/ -v — all 26 tests should pass.
Dry-run a live portal fetch: python3 scrape.py --mode backfill --year 2024 --limit 3 --dry-run — verifies the portal is reachable and HTML parsing returns valid data without writing to Firestore.
Write test: GOOGLE_CLOUD_PROJECT=digital-testimony-dev python3 scrape.py --mode backfill --year 2024 --limit 3 — writes 3 registrants and their filings; verify documents appear in the Firestore console with correct billId values for legislative rows and null for Executive rows.

Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch) and writes structured data to Firestore for joining with MAPLE bill data. New collections: - /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer - /lobbyingFilings — one doc per (registrant, client, bill, court), with billId null for Executive/Other chambers so the join guard is type-level Key design points: - billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match Bill.id in the existing bills collection; raw integer + chamber stored separately - Entity name normalization pipeline ported from reference implementation (10 steps: d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.) - Both raw and *Norm name fields stored for provenance and grouping - Live Cloud Function scrapes current+prior year on a 24h schedule with a summaryDiscCache to avoid re-fetching summary pages in steady state - Backfill admin script handles full 2005-present history with a Firestore subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales to ~50k URLs and is safely resumable Files: - functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts - scripts/firebase-admin/backfillLobbying.ts - firestore.rules + firestore.indexes.json updated - docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-06-04T12:46:21Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
maple-dev	Ready	Preview, Comment	Jun 29, 2026 2:33am

The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting to classify HTTP clients before examining headers. Python's requests library produces a fingerprint that Imperva allows through; Node.js does not. A standalone Cloud Run container (Python 3.12) is therefore used for the scheduled ingestion instead of a Cloud Function. lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4, google-cloud-firestore): - scrape.py: entry point with --mode weekly (incremental, fast exit if nothing new) and --mode backfill (full 2005-present history, resumable subcollection cursor). Weekly mode caches summary URL→disc URL mappings so prior-year registrants with no new filings require zero additional HTTP requests. - portal.py: HTTP session management + HTML parsing for all three portal page levels (search POST, summary GET, disclosure GET). Handles both modern (>=2013) and legacy (<2013) disclosure formats. - normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity name normalization pipeline, must match the TypeScript version exactly. - writer.py: Firestore document construction and batch writes. Schema matches types.ts (lobbyingRegistrants, lobbyingFilings collections). scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py as a subprocess; all HTTP and Firestore logic moved to Python. functions/src/lobbying/http/ — thin Python HTTP helper kept for reference; not used in the current architecture. Note: server-side IP reputation behavior with Imperva untested. Build and run the container on Cloud Run with --dry-run to validate before full deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Per code review feedback: the TypeScript Firebase Function and backfill script added no value — the portal's TLS fingerprinting requirements mean Node.js cannot reach it, so the TS HTTP layer was non-functional and the backfill script was just a thin subprocess wrapper with no benefit over calling scrape.py directly. Removed: - functions/src/lobbying/scrapeLobbying.ts (broken Cloud Function) - functions/src/lobbying/portal.ts (non-functional TS HTTP layer) - functions/src/lobbying/http/ (unused Python fetch helper) - scripts/firebase-admin/backfillLobbying.ts (shell wrapper, no value) - scrapeLobbying export from functions/src/index.ts Kept: - functions/src/lobbying/types.ts — Firestore schema; imported by frontend - functions/src/lobbying/normalize.ts — normalization pipeline - lobbying-scraper/ — the working Cloud Run container (unchanged) The historical backfill is now run directly: python3 lobbying-scraper/scrape.py --mode backfill Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ucture Portal parser (portal.py): - Hybrid era (2014-2018): per-client compensation from Panel1 divs — was silently $0 due to missing code path (e.g. Murphy Donoghue 2016: $990k) - Legacy era (2009-2013): per-client totals from 'Compensation received' column, with dedup of (client, amount) pairs before summing — was silently $0 per client (e.g. ML Strategies 2011: $641k across 23 clients) - Legacy bill semicolon separator: 'H73; Title' now parsed correctly - 'Total amount' summary row excluded from compensation pairs - HTTP retry on 429/500/502/503/504 (was aborting on first transient error) - parse_summary() and parse_disclosure_detail() split out as pure functions (no I/O) so the offline reparse driver can call them without a session GCS archiving (archive.py): - Write-only cold storage: every fetched Summary/CompleteDisclosure page saved as gs://{project}-lobbying-archive/raw_html/{sha1(url)}.html - Enabled by ARCHIVE_RAW=1 env var; no-op otherwise - Failures are logged but never interrupt the live scrape path Offline reparse driver (reparse_archive.py): - Lists CompleteDisclosure blobs from GCS, resolves registrant meta from Firestore via disclosureUrls array_contains, re-runs pure parsers, writes back via writer.py; resumable via /scrapers/lobbyingReparse cursor Pytest suite (tests/test_portal_parser.py, 26 tests): - All 4 eras verified against committed gzipped fixture pages - Compensation totals, client counts, bill counts, era detection asserted - Specific bug-fix regressions: Total-amount artifact, H73 semicolon, hybrid Panel1 compensation, 2007 _total_salary_ fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- writer.py: move firestore import out of TYPE_CHECKING block so firestore.ArrayUnion() is available at runtime (was NameError) - writer.py/scrape.py/reparse_archive.py: strip leading slash from Firestore path constants (SCRAPER_DOC, BACKFILL_DOC, REPARSE_DOC) — db.document('/scrapers/x') raises ValueError: odd path element count - scrape.py: add os import; pass GOOGLE_CLOUD_PROJECT to firestore.Client() so local runs target the correct project rather than the ADC default Verified: 3 registrants / 6 disclosures written to digital-testimony-dev; re-run writes 0 (cursor working). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nesanders and others added 2 commits June 4, 2026 07:01

initial plan

0348b29

nesanders self-assigned this Jun 4, 2026

vercel Bot deployed to Preview – maple-dev June 4, 2026 12:50 View deployment

vercel Bot deployed to Preview – maple-dev June 5, 2026 20:59 View deployment

vercel Bot deployed to Preview – maple-dev June 9, 2026 02:10 View deployment

nesanders and others added 2 commits June 26, 2026 10:16

Merge remote-tracking branch 'upstream/main' into lobbying-data-pipeline

df7c78e

vercel Bot deployed to Preview – maple-dev June 28, 2026 23:18 View deployment

vercel Bot deployed to Preview – maple-dev June 29, 2026 02:33 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Lobbying data pipeline#2158

Lobbying data pipeline#2158
nesanders wants to merge 7 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline

nesanders commented Jun 4, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

nesanders commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Checklist

Screenshots

Known issues

Steps to test

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nesanders commented Jun 4, 2026 •

edited

Loading

vercel Bot commented Jun 4, 2026 •

edited

Loading