Data Zone Architecture

Overview

This repository implements a three-zone data architecture designed to maintain data integrity, enable safe transformations, and provide production-ready datasets for consumption.

Zone Definitions

Raw Zone (`data/raw/`)

Purpose: Immutable landing zone for source data

Characteristics:

Never Modified: Data in this zone is append-only and immutable
Authoritative Source: Single source of truth for all downstream processing
Complete Audit Trail: All ingested data is retained with timestamps
Organized by Source: Data organized by originating system or provider

Directory Structure:

data/raw/
|-- funding_sources/          # Funding data organized by source
|   |-- sff/                 # Survival & Flourishing Fund
|   |-- open_philanthropy/   # Open Philanthropy
|   |-- ai2050/              # Schmidt Sciences AI2050
|   |-- macroscopic/         # Macroscopic
|   |-- givewiki/            # GiveWiki
|   |-- cooperative_ai/      # Cooperative AI Foundation
|   `-- catalyze_impact/     # Catalyze Impact
`-- _archive/                # Historical/superseded data

Policies:

Only automated ingestion processes write to this zone
Files are named with timestamps: source_YYYY-MM-DD.json
Never delete files - move to _archive/ if superseded
Maintain complete lineage from source to ingestion

Access:

Read: All processes
Write: Ingestion processes only
Delete: Never (archive instead)

Transformed Zone (`data/transformed/`)

Purpose: Intermediate processing stages

Characteristics:

Validated Data: Schema-validated and quality-checked
Cleaned Data: Normalized, deduplicated, and corrected
Enriched Data: With derived fields and calculations

Directory Structure:

data/transformed/
|-- validated/    # Schema-validated, structurally sound
|-- cleaned/      # Normalized and deduplicated
`-- enriched/     # With derived fields and aggregations

Policies:

Data flows: raw -> validated -> cleaned -> enriched
Each stage is idempotent and reproducible
All transformations are script-driven (no manual edits)
Failed transformations are logged and halt processing

Access:

Read: All processes
Write: Transformation pipelines only
Delete: Safe to recreate from raw data

Serveable Zone (`data/serveable/`)

Purpose: Production-ready, optimized data for consumption

Characteristics:

Analytics-Ready: Optimized for dashboards and analysis
API-Ready: Formatted for direct API serving
Performance-Optimized: Indexed, compressed, or pre-aggregated

Directory Structure:

data/serveable/
|-- analytics/    # For dashboards and analysis tools
`-- api/          # API-ready formats (JSON, etc.)

Policies:

Generated from transformed zone only
Optimized for specific use cases
May include aggregations and summaries
Safe to regenerate at any time

Access:

Read: Applications, APIs, dashboards
Write: Publishing pipelines only
Delete: Safe to recreate from transformed data

Data Flow

[External Source]
     |
     v
[data/raw/]           <- Ingestion (immutable)
     |
     v
[data/transformed/    <- Validation
 validated/]
     |
     v
[data/transformed/    <- Cleaning & normalization
 cleaned/]
     |
     v
[data/transformed/    <- Enrichment & derivation
 enriched/]
     |
     v
[data/serveable/]     <- Publishing (optimized)
     |
     v
[Applications/APIs]

Zone Transition Rules

Raw -> Transformed

Process: Migration script with validation Script: scripts/migration/migrate.py Operations:

Validate source data against schema
Copy (not move) to validated zone
Log all operations
Track checksums for idempotency

When to Run:

After new data arrives in raw zone
When schema or validation rules change
On-demand for reprocessing

Transformed -> Serveable

Process: Publishing pipeline Operations:

Apply optimizations (indexing, compression)
Generate format-specific outputs
Create aggregations and summaries
Verify output integrity

When to Run:

After successful transformation
When publishing requirements change
On schedule for regular updates

Maintenance Procedures

Adding New Data Sources

Create subdirectory in data/raw/funding_sources/
Update ingestion documentation
Configure validation rules for source
Test with sample data
Document source-specific considerations

Reprocessing Data

Clear transformed and serveable zones (backup if needed)
Run migration script: python scripts/migration/migrate.py
Verify logs for errors
Run publishing pipeline
Validate outputs

Archiving Old Data

Identify data to archive (superseded, outdated)
Move from raw zone to data/raw/_archive/
Document reason for archival
Keep for audit trail - never delete

Recovery Procedures

If data corruption detected:

Stop all processing pipelines
Identify last known good state
Clear affected zones
Restore from raw zone (always authoritative)
Rerun transformations
Validate outputs before resuming

Monitoring and Alerting

Key Metrics

Files processed per hour
Validation failure rate
Average processing time
Disk space utilization
Error frequency by type

Alert Conditions

Validation failures exceed threshold
Disk space below 20% free
Processing time exceeds SLA
Checksum mismatches detected
File corruption detected

Best Practices

Never Modify Raw Data: Always transform through pipelines
Log Everything: Complete audit trail of all operations
Validate Early: Catch issues at ingestion, not consumption
Test Transformations: Use sample data before production runs
Document Changes: Update schemas and docs with data changes
Monitor Continuously: Track metrics and set up alerts
Backup Before Changes: Especially for destructive operations
Use Checksums: Verify integrity at every stage
Fail Safely: Halt on errors, don't corrupt good data
Keep It Simple: Avoid complex transformations in single step

Troubleshooting

Validation Failures

Check logs in logs/validation/
Examine failing records
Verify schema is current
Check for data format changes
Update validation rules if legitimate

Migration Stuck

Check logs in logs/migration/
Verify disk space available
Check file permissions
Review .migration_state.json for processed files
Manually reset state if needed

Performance Issues

Check file sizes and counts
Monitor system resources
Consider batch processing
Optimize transformations
Add indexing or compression

References

See RUNBOOK.md for step-by-step operational procedures
See DATA_DICTIONARY.md for field definitions
See LINEAGE.md for data flow documentation
See scripts/migration/README.md for migration tool usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Zone Architecture

Overview

Zone Definitions

Raw Zone (`data/raw/`)

Transformed Zone (`data/transformed/`)

Serveable Zone (`data/serveable/`)

Data Flow

Zone Transition Rules

Raw -> Transformed

Transformed -> Serveable

Maintenance Procedures

Adding New Data Sources

Reprocessing Data

Archiving Old Data

Recovery Procedures

Monitoring and Alerting

Key Metrics

Alert Conditions

Best Practices

Troubleshooting

Validation Failures

Migration Stuck

Performance Issues

References

FilesExpand file tree

DATA_ZONES.md

Latest commit

History

DATA_ZONES.md

File metadata and controls

Data Zone Architecture

Overview

Zone Definitions

Raw Zone (data/raw/)

Transformed Zone (data/transformed/)

Serveable Zone (data/serveable/)

Data Flow

Zone Transition Rules

Raw -> Transformed

Transformed -> Serveable

Maintenance Procedures

Adding New Data Sources

Reprocessing Data

Archiving Old Data

Recovery Procedures

Monitoring and Alerting

Key Metrics

Alert Conditions

Best Practices

Troubleshooting

Validation Failures

Migration Stuck

Performance Issues

References

Raw Zone (`data/raw/`)

Transformed Zone (`data/transformed/`)

Serveable Zone (`data/serveable/`)