This repository implements a three-zone data architecture designed to maintain data integrity, enable safe transformations, and provide production-ready datasets for consumption.
Purpose: Immutable landing zone for source data
Characteristics:
- Never Modified: Data in this zone is append-only and immutable
- Authoritative Source: Single source of truth for all downstream processing
- Complete Audit Trail: All ingested data is retained with timestamps
- Organized by Source: Data organized by originating system or provider
Directory Structure:
data/raw/
|-- funding_sources/ # Funding data organized by source
| |-- sff/ # Survival & Flourishing Fund
| |-- open_philanthropy/ # Open Philanthropy
| |-- ai2050/ # Schmidt Sciences AI2050
| |-- macroscopic/ # Macroscopic
| |-- givewiki/ # GiveWiki
| |-- cooperative_ai/ # Cooperative AI Foundation
| `-- catalyze_impact/ # Catalyze Impact
`-- _archive/ # Historical/superseded data
Policies:
- Only automated ingestion processes write to this zone
- Files are named with timestamps:
source_YYYY-MM-DD.json - Never delete files - move to
_archive/if superseded - Maintain complete lineage from source to ingestion
Access:
- Read: All processes
- Write: Ingestion processes only
- Delete: Never (archive instead)
Purpose: Intermediate processing stages
Characteristics:
- Validated Data: Schema-validated and quality-checked
- Cleaned Data: Normalized, deduplicated, and corrected
- Enriched Data: With derived fields and calculations
Directory Structure:
data/transformed/
|-- validated/ # Schema-validated, structurally sound
|-- cleaned/ # Normalized and deduplicated
`-- enriched/ # With derived fields and aggregations
Policies:
- Data flows: raw -> validated -> cleaned -> enriched
- Each stage is idempotent and reproducible
- All transformations are script-driven (no manual edits)
- Failed transformations are logged and halt processing
Access:
- Read: All processes
- Write: Transformation pipelines only
- Delete: Safe to recreate from raw data
Purpose: Production-ready, optimized data for consumption
Characteristics:
- Analytics-Ready: Optimized for dashboards and analysis
- API-Ready: Formatted for direct API serving
- Performance-Optimized: Indexed, compressed, or pre-aggregated
Directory Structure:
data/serveable/
|-- analytics/ # For dashboards and analysis tools
`-- api/ # API-ready formats (JSON, etc.)
Policies:
- Generated from transformed zone only
- Optimized for specific use cases
- May include aggregations and summaries
- Safe to regenerate at any time
Access:
- Read: Applications, APIs, dashboards
- Write: Publishing pipelines only
- Delete: Safe to recreate from transformed data
[External Source]
|
v
[data/raw/] <- Ingestion (immutable)
|
v
[data/transformed/ <- Validation
validated/]
|
v
[data/transformed/ <- Cleaning & normalization
cleaned/]
|
v
[data/transformed/ <- Enrichment & derivation
enriched/]
|
v
[data/serveable/] <- Publishing (optimized)
|
v
[Applications/APIs]
Process: Migration script with validation
Script: scripts/migration/migrate.py
Operations:
- Validate source data against schema
- Copy (not move) to validated zone
- Log all operations
- Track checksums for idempotency
When to Run:
- After new data arrives in raw zone
- When schema or validation rules change
- On-demand for reprocessing
Process: Publishing pipeline Operations:
- Apply optimizations (indexing, compression)
- Generate format-specific outputs
- Create aggregations and summaries
- Verify output integrity
When to Run:
- After successful transformation
- When publishing requirements change
- On schedule for regular updates
- Create subdirectory in
data/raw/funding_sources/ - Update ingestion documentation
- Configure validation rules for source
- Test with sample data
- Document source-specific considerations
- Clear transformed and serveable zones (backup if needed)
- Run migration script:
python scripts/migration/migrate.py - Verify logs for errors
- Run publishing pipeline
- Validate outputs
- Identify data to archive (superseded, outdated)
- Move from raw zone to
data/raw/_archive/ - Document reason for archival
- Keep for audit trail - never delete
If data corruption detected:
- Stop all processing pipelines
- Identify last known good state
- Clear affected zones
- Restore from raw zone (always authoritative)
- Rerun transformations
- Validate outputs before resuming
- Files processed per hour
- Validation failure rate
- Average processing time
- Disk space utilization
- Error frequency by type
- Validation failures exceed threshold
- Disk space below 20% free
- Processing time exceeds SLA
- Checksum mismatches detected
- File corruption detected
- Never Modify Raw Data: Always transform through pipelines
- Log Everything: Complete audit trail of all operations
- Validate Early: Catch issues at ingestion, not consumption
- Test Transformations: Use sample data before production runs
- Document Changes: Update schemas and docs with data changes
- Monitor Continuously: Track metrics and set up alerts
- Backup Before Changes: Especially for destructive operations
- Use Checksums: Verify integrity at every stage
- Fail Safely: Halt on errors, don't corrupt good data
- Keep It Simple: Avoid complex transformations in single step
- Check logs in
logs/validation/ - Examine failing records
- Verify schema is current
- Check for data format changes
- Update validation rules if legitimate
- Check logs in
logs/migration/ - Verify disk space available
- Check file permissions
- Review
.migration_state.jsonfor processed files - Manually reset state if needed
- Check file sizes and counts
- Monitor system resources
- Consider batch processing
- Optimize transformations
- Add indexing or compression
- See
RUNBOOK.mdfor step-by-step operational procedures - See
DATA_DICTIONARY.mdfor field definitions - See
LINEAGE.mdfor data flow documentation - See
scripts/migration/README.mdfor migration tool usage