Local-first analytics for turning public eCFR data into agency-level burden, change, and auditability signals.
This project ingests agency mappings and dated eCFR XML snapshots, normalizes them into SQLite, computes transparent text-based metrics, and exposes the results through a FastAPI API and a React dashboard. The goal is not to replace legal analysis; it is to make large regulatory text corpora faster to explore by highlighting where restrictive language, process burden, and recent change activity are concentrated.
Highlights
- Public eCFR ingestion, normalization, and agency-to-CFR mapping
- Agency-level prioritization metrics with transparent formulas
- Reliability hardening, audit tooling, and reproducible local data
The app builds a local analytical view of eCFR content by agency. It flattens the agency directory, deduplicates CFR references, fetches current and dated XML snapshots, extracts scoped text blocks, and computes agency-level metrics such as word count, restrictive-term density, process-term density, change volatility, and content fingerprints. A small web UI then surfaces those metrics in an overview table, per-agency drilldown, and methodology page.
The ranked overview highlights agencies with high burden and active change patterns.
The agency detail page explains the scores in plain language and shows the current burden metrics.
The history table exposes the underlying snapshots, change flags, and fingerprints behind each agency row.
The methodology page publishes the formulas, caveats, and weighted term lists used by the app.
The lower sections call out confidence limits, overlap caveats, and how to interpret the metrics responsibly.
More screenshots and captions: docs/SCREENSHOTS.md
At a high level:
- Fetch the eCFR agency directory and title metadata.
- Flatten the nested agency tree and normalize CFR references.
- Plan whole-title or part-level XML fetches for each snapshot date.
- Parse XML into scoped text blocks.
- Match blocks to agency references and aggregate metrics into SQLite.
- Serve ranked and per-agency views through FastAPI.
- Render the API in a React dashboard.
- Validate the data with smoke checks, audits, and repair scripts.
Architecture notes and a rendered diagram live in docs/ARCHITECTURE.md.
Review Priority (0–100): a triage rank combining current burden and change volatility.Restrictive Terms / 1k Words: weighted restrictive-language hits normalized by text size.Process Terms / 1k Words: weighted paperwork/process-language hits normalized by text size.Change Volatility: combines how often, how much, and how quickly the observed snapshots changed.Content Fingerprint: a stable SHA-256 checksum used for auditability and change detection.
These metrics are intentionally transparent. They are heuristics for prioritization, not legal conclusions or official burden estimates.
- Agency mapping overlap: agency references come from the eCFR reader aid and can overlap. The pipeline normalizes reference tuples and stores them separately from current/history tables so the mapping layer stays inspectable.
- Duplicate CFR references: duplicate
(title, subtitle, chapter, subchapter, part)mappings are removed before persistence and reinforced by SQLite keys. - Unreliable dated XML fetches: historical fetches fail closed rather than silently publishing partial history windows.
- Ambiguous change interpretation: fingerprints, stored history points, and plain-language trust notes make it clear when the app is showing observed snapshots rather than amendment-by-amendment legal diffs.
- Current-row drift: audit and repair scripts can recompute current metrics from stored history and flag mismatches.
- Backend: Python 3.11, FastAPI, SQLite, HTTPX
- Frontend: React 18, TypeScript, Vite
- Testing: Pytest, Vitest
- Tooling: Ruff, Make, Docker Compose
cd backend
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cd ..
make run-backendOpen a second terminal:
cd frontend
npm install --cache ../.npm-cache
npm run dev- Frontend:
http://127.0.0.1:5173 - Backend OpenAPI docs:
http://127.0.0.1:8000/docs
The repository includes a bundled validated SQLite seed at backend/data/ecfr_insights.sqlite3, so the initial local run does not require a live reseed.
From the repository root:
make test
make refresh-current-metrics
make audit-metrics
make audit-agencies
make snapshot
make smokeOptional Docker verification:
docker compose up --build
make smoke-dockerWhen you want to rebuild the local dataset from live eCFR sources:
cd backend
source .venv/bin/activate
python -m app.ingestion --profile quick
python -m app.ingestion --profile standard
python -m app.ingestion --profile deepIf you only need to recompute current metrics from stored history without refetching XML:
make refresh-current-metrics
make audit-metrics- This is optimized for a reproducible local workflow, not hosted multi-user deployment.
- The eCFR XML service can be unreliable for dated full-title snapshots, so the bundled local seed is the most predictable way to explore the project.
- Agency mappings come from the eCFR reader aid and can overlap.
- The metrics are prioritization heuristics, not legal or economic burden estimates.
- Observed history is based on bounded snapshots, not a full legal redline engine.
- Broader snapshot coverage with a more resilient fetch/cache strategy
- Better treatment of overlapping agency mappings and shared text ownership
- More expressive linguistic metrics beyond weighted term lists
- A hosted demo environment with automated data refreshes
AI tooling was used selectively for scaffolding, iteration speed, and documentation support. The metric design, reliability decisions, data-shape validation, audit logic, and final project framing were verified through direct code review, tests, and local execution.




