Skip to content

Lobbying data pipeline#2158

Draft
nesanders wants to merge 7 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline
Draft

Lobbying data pipeline#2158
nesanders wants to merge 7 commits into
codeforboston:mainfrom
nesanders:lobbying-data-pipeline

Conversation

@nesanders

@nesanders nesanders commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR introduces a Python-based lobbying disclosure ingestion pipeline and establishes the associated Firestore data model. No frontend changes are included; those will follow in a subsequent PR.

c.f. #855 #1365

What's included

  • Python Cloud Run scraper (lobbying-scraper/): fetches MA SoS lobbying portal pages, parses disclosure HTML, writes lobbyingRegistrants and lobbyingFilings documents to Firestore. Replaces the deleted TypeScript Cloud Function stub.
  • Parser coverage for all 4 HTML eras (2005–2008, 2009–2013, 2014–2018, 2019+): each era of the SoS website uses a different table/grid structure for compensation data; all are now handled and tested.
  • GCS raw HTML archive (archive.py): when ARCHIVE_RAW=1, each fetched portal page is written to {project}-lobbying-archive GCS bucket under raw_html/{sha1(url)}.html. Enables offline reparse without re-scraping.
  • Offline reparse driver (reparse_archive.py): reads archived HTML from GCS, re-runs pure parsers, writes back to Firestore. Used when parser logic changes.
  • Pytest suite (tests/): 26 tests covering all 4 eras (employer + individual registrants), compensation totals, bill counts, edge cases (semicolon bill separators, "Total amount" artifact rows, executive chamber null billIds).
  • Firestore composite indexes: 4 new indexes on lobbyingFilings (generalCourt + billId, chamber, entityNameNorm, clientNameNorm).

Checklist

  • On the frontend, I've made my strings translate-able. -- N/A
  • If I've added shared components, I've added a storybook story. -- N/A
  • I've made pages responsive and look good on mobile. -- N/A
  • If I've added new Firestore queries, I've added any new required indexes to firestore.indexes.json — Added 4 composite indexes on lobbyingFilings.

Screenshots

N/A (no frontend changes)

Known issues

  • Initial backfill required. Weekly incremental runs only cover current and prior year. Historical data (2005–present) requires a one-time backfill run. See the test plan in the doc.
  • Bill joins limited to court 192+ (2021–present). MAPLE's bill collection starts around 2020; older lobbying filings will have a valid billId but no matching bill document.
  • Portal rate limiting. The MA SoS portal requires ~1s between requests. First-time processing of a year (~500 registrants × 2 disclosures) takes roughly 20–30 minutes.
  • Pre-2009 individual lobbyists use RegVersionLobbyist.aspx summary pages instead of CompleteDisclosure.aspx. These registrants produce no disclosure detail rows — expected portal behavior.
  • Docker image rebuild needed after adding google-cloud-storage to requirements.txt.

Steps to test

See the Incremental Test Plan in the doc for the full sequence. Quick smoke test:

  1. Run the parser unit tests: cd lobbying-scraper && python -m pytest tests/ -v — all 26 tests should pass.
  2. Dry-run a live portal fetch: python3 scrape.py --mode backfill --year 2024 --limit 3 --dry-run — verifies the portal is reachable and HTML parsing returns valid data without writing to Firestore.
  3. Write test: GOOGLE_CLOUD_PROJECT=digital-testimony-dev python3 scrape.py --mode backfill --year 2024 --limit 3 — writes 3 registrants and their filings; verify documents appear in the Firestore console with correct billId values for legislative rows and null for Executive rows.

nesanders and others added 2 commits June 4, 2026 07:01
Scrapes the MA Secretary of State lobbying portal (sec.state.ma.us/LobbyistPublicSearch)
and writes structured data to Firestore for joining with MAPLE bill data.

New collections:
- /lobbyingRegistrants — one doc per (registrant, year), regType Lobbyist|Employer
- /lobbyingFilings — one doc per (registrant, client, bill, court), with billId
  null for Executive/Other chambers so the join guard is type-level

Key design points:
- billId is constructed as {chamberPrefix}{integer} (e.g. H1234, SD56) to match
  Bill.id in the existing bills collection; raw integer + chamber stored separately
- Entity name normalization pipeline ported from reference implementation (10 steps:
  d/b/a stripping, legal entity words, punctuation, THE, ampersand, typo fix, etc.)
- Both raw and *Norm name fields stored for provenance and grouping
- Live Cloud Function scrapes current+prior year on a 24h schedule with a
  summaryDiscCache to avoid re-fetching summary pages in steady state
- Backfill admin script handles full 2005-present history with a Firestore
  subcollection cursor (/scrapers/lobbyingBackfill/processedUrls) that scales
  to ~50k URLs and is safely resumable

Files:
- functions/src/lobbying/{types,normalize,portal,scrapeLobbying,index}.ts
- scripts/firebase-admin/backfillLobbying.ts
- firestore.rules + firestore.indexes.json updated
- docs/lobbying-disclosure-ingestion.md: full plan, test plan, future work

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nesanders nesanders self-assigned this Jun 4, 2026
@vercel

vercel Bot commented Jun 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
maple-dev Ready Ready Preview, Comment Jun 29, 2026 2:33am

Request Review

The MA SoS portal is protected by Imperva WAF, which uses TLS fingerprinting
to classify HTTP clients before examining headers. Python's requests library
produces a fingerprint that Imperva allows through; Node.js does not. A
standalone Cloud Run container (Python 3.12) is therefore used for the
scheduled ingestion instead of a Cloud Function.

lobbying-scraper/ — Cloud Run container (3 pip deps: requests, beautifulsoup4,
google-cloud-firestore):
- scrape.py: entry point with --mode weekly (incremental, fast exit if nothing
  new) and --mode backfill (full 2005-present history, resumable subcollection
  cursor). Weekly mode caches summary URL→disc URL mappings so prior-year
  registrants with no new filings require zero additional HTTP requests.
- portal.py: HTTP session management + HTML parsing for all three portal page
  levels (search POST, summary GET, disclosure GET). Handles both modern
  (>=2013) and legacy (<2013) disclosure formats.
- normalize.py: port of functions/src/lobbying/normalize.ts — 10-step entity
  name normalization pipeline, must match the TypeScript version exactly.
- writer.py: Firestore document construction and batch writes. Schema matches
  types.ts (lobbyingRegistrants, lobbyingFilings collections).

scripts/firebase-admin/backfillLobbying.ts — simplified to spawn scrape.py
as a subprocess; all HTTP and Firestore logic moved to Python.

functions/src/lobbying/http/ — thin Python HTTP helper kept for reference;
not used in the current architecture.

Note: server-side IP reputation behavior with Imperva untested. Build and run
the container on Cloud Run with --dry-run to validate before full deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per code review feedback: the TypeScript Firebase Function and backfill
script added no value — the portal's TLS fingerprinting requirements mean
Node.js cannot reach it, so the TS HTTP layer was non-functional and the
backfill script was just a thin subprocess wrapper with no benefit over
calling scrape.py directly.

Removed:
- functions/src/lobbying/scrapeLobbying.ts (broken Cloud Function)
- functions/src/lobbying/portal.ts (non-functional TS HTTP layer)
- functions/src/lobbying/http/ (unused Python fetch helper)
- scripts/firebase-admin/backfillLobbying.ts (shell wrapper, no value)
- scrapeLobbying export from functions/src/index.ts

Kept:
- functions/src/lobbying/types.ts — Firestore schema; imported by frontend
- functions/src/lobbying/normalize.ts — normalization pipeline
- lobbying-scraper/ — the working Cloud Run container (unchanged)

The historical backfill is now run directly:
  python3 lobbying-scraper/scrape.py --mode backfill

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nesanders and others added 2 commits June 26, 2026 10:16
…ucture

Portal parser (portal.py):
- Hybrid era (2014-2018): per-client compensation from Panel1 divs — was
  silently $0 due to missing code path (e.g. Murphy Donoghue 2016: $990k)
- Legacy era (2009-2013): per-client totals from 'Compensation received'
  column, with dedup of (client, amount) pairs before summing — was silently
  $0 per client (e.g. ML Strategies 2011: $641k across 23 clients)
- Legacy bill semicolon separator: 'H73; Title' now parsed correctly
- 'Total amount' summary row excluded from compensation pairs
- HTTP retry on 429/500/502/503/504 (was aborting on first transient error)
- parse_summary() and parse_disclosure_detail() split out as pure functions
  (no I/O) so the offline reparse driver can call them without a session

GCS archiving (archive.py):
- Write-only cold storage: every fetched Summary/CompleteDisclosure page
  saved as gs://{project}-lobbying-archive/raw_html/{sha1(url)}.html
- Enabled by ARCHIVE_RAW=1 env var; no-op otherwise
- Failures are logged but never interrupt the live scrape path

Offline reparse driver (reparse_archive.py):
- Lists CompleteDisclosure blobs from GCS, resolves registrant meta from
  Firestore via disclosureUrls array_contains, re-runs pure parsers,
  writes back via writer.py; resumable via /scrapers/lobbyingReparse cursor

Pytest suite (tests/test_portal_parser.py, 26 tests):
- All 4 eras verified against committed gzipped fixture pages
- Compensation totals, client counts, bill counts, era detection asserted
- Specific bug-fix regressions: Total-amount artifact, H73 semicolon,
  hybrid Panel1 compensation, 2007 _total_salary_ fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- writer.py: move firestore import out of TYPE_CHECKING block so
  firestore.ArrayUnion() is available at runtime (was NameError)
- writer.py/scrape.py/reparse_archive.py: strip leading slash from
  Firestore path constants (SCRAPER_DOC, BACKFILL_DOC, REPARSE_DOC) —
  db.document('/scrapers/x') raises ValueError: odd path element count
- scrape.py: add os import; pass GOOGLE_CLOUD_PROJECT to firestore.Client()
  so local runs target the correct project rather than the ADC default

Verified: 3 registrants / 6 disclosures written to digital-testimony-dev;
re-run writes 0 (cursor working).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant