The first weekly, spatial fire prediction system for Indian forests.
India's forest fire apparatus detects 3.5 lakh fires per year. It predicts zero. EcoFire changes that — predicting which 1km grid cells will ignite next week using satellite vegetation indices, weather reanalysis, fire danger indices, terrain, and historical fire patterns.
40,000 km² of forest → 50 km² shortlist. 800x spatial reduction.
Evaluated on 26 weeks of 2025 validation data (Karnataka, India). Source: eval.py → data/metrics.json.
| Season Band | Weeks | P@50 | Fires Intercepted | Concentration vs Base Rate |
|---|---|---|---|---|
| Overall (W1-26) | 26 | 7.8% | 102 | 13.1x |
| Fire season (W1-20) | 20 | 9.8% | 98 | 12.7x |
| Peak (W5-17, Feb-Apr) | 13 | 10.3% | 67 | 9.9x |
| Hot peak (W6-12, Feb-Mar) | 7 | 16.9% | 59 | 10.8x |
| Monsoon (W21+) | 6 | 1.3% | 4 | — |
- Best single week: W10 = 38% P@50 (19 of 50 flagged cells had fires, out of 1,141 total fires)
- Naive baseline (historical fire frequency only): 7.5% overall — model adds 4-34% relative lift depending on season band
- 756 experiments across 6 sweeps confirmed the ceiling with offline data
Precision@50 = of the top 50 cells flagged per week, how many actually burned? This is the operationally relevant metric — a forest division can realistically patrol ~50 km² per week.
Concentration = fire probability in the 50 flagged cells vs the forest-wide base rate. At 10.8x during hot peak, the model concentrates fire risk into 0.125% of the search space.
800x spatial reduction = 40,000 km² of forest → 50 km² shortlist (arithmetic, always true regardless of model quality).
| Current State (India) | With EcoFire |
|---|---|
| Zero fire prediction | Weekly 1km predictions |
| FSI FAST alerts detect fires after ignition | Predictions before ignition |
| Rangers patrol 40,000 km² blind | 50 km² shortlist per week |
| FWI pilot (2019) stalled at 2 regions | Statewide coverage, extensible |
India's fire management stack — FSI FAST v1→v3, fire prone mapping, FWI pilot — is entirely reactive or static. EcoFire is the first system that produces dynamic, weekly, spatial fire predictions at 1km resolution. See docs/09-MOEFCC-FIRE-INTELLIGENCE.md for the full gap analysis.
# Clone
git clone https://github.com/nikhilvelpanur/ecofire.git
cd ecofire
# Install dependencies
pip install -r requirements.txt
# Evaluate with saved model
python eval.py --model best
# Retrain best config and evaluate
python eval.py
# Run full sweep on Modal GPU (requires Modal account)
modal run -d modal_app.py::sweepThe processed datasets (data/*.parquet) are too large for GitHub (~3.4GB). To rebuild from raw sources:
# 1. Download raw data (requires API keys — see Data Sources below)
python download_firms_batch.py # NASA FIRMS fire detections
python download_ndvi.py # Sentinel-2 NDVI (requires GEE service account)
python download_srtm.py # SRTM elevation
python download_worldcover.py # ESA WorldCover
python download_era5land.py # ERA5-Land weather (requires CDS API key)
python download_smap.py # SMAP soil moisture (requires GEE)
python download_fwi.py # CEMS FWI fire danger (requires EWDS API key)
python download_ndvi_weekly.py # Weekly Sentinel-2 composites (requires GEE)
# 2. Build grid + join all features
python prepare.py
# 3. Add Phase 3 features (ERA5-Land, SMAP, FWI)
python rebuild_phase3.py
# 4. Add weekly NDVI/LSWI features
python rebuild_weekly_ndvi.py
# 5. Evaluate
python eval.pyXGBoost with binary:logistic objective, trained on all 40,000 cells per week. The model ranks cells by predicted fire probability; the top 50 per week are flagged.
Key hyperparameters (best config):
max_depth=7,learning_rate=0.05,subsample=0.7min_child_weight=50,gamma=1.0,lambda=5.0- Negative sampling: 10% of non-fire cells retained during training
- Early stopping on validation AUCPR
| Category | Features | Importance |
|---|---|---|
| Fire history | fire_freq, fire_history_1yr, fire_neighbor_1w/2w, fire_radius_5km_2w |
Dominant (fire_freq = #1) |
| Fire danger indices | fwi_mean/max, ffmc_mean, dmc_mean, dc_mean, isi_mean, bui_mean |
High (fwi_mean = #2, bui_mean = #3) |
| Weather | vpd_mean, temp_max_c, temp_mean_c, rh_mean, wind_speed_mean, precip_sum_mm, days_no_rain, precip_cumul_30d |
Moderate |
| Vegetation | ndvi_mean, ndvi_diff_4w, ndvi_weekly, ndvi_diff_1w, lswi, lswi_diff_*, forest_fraction |
Low (redundant with FWI) |
| Terrain | elevation_m, slope_deg, lat, lon |
Low-moderate |
| Soil moisture | swvl1-4, stl1, smap_sm_mean/min, lai_hv/lv |
Low |
| Human geography | dist_to_settlement/town_km, pop_within_5km, n_settlements_5km, nightlight_mean |
Negligible |
| Temporal | week_sin, week_cos |
Moderate |
What works: Historical fire frequency + fire danger indices (FWI/BUI) + basic weather. These have been the top 3 features since Sweep 1. No new data source has displaced them.
What doesn't work (confirmed):
- Higher-resolution weather (ERA5-Land 9km vs ERA5 30km) — redundant
- Satellite soil moisture (SMAP) — redundant with
days_no_rain - Weekly vegetation (Sentinel-2 NDVI/LSWI) — redundant with FWI
- Human geography (SHRUG settlements, population, nightlights) — zero lift
- Model architecture changes (LightGBM, ranking objectives, ensembles) — no improvement
- All-cell classification — collapses; only ranking works at 40K-cell scale
Root cause of the ceiling: The model learns where conditions allow fire, but not where someone will light one. Features predict fire weather and vegetation dryness (the "conditions" axis), but ignition in Indian forests is overwhelmingly anthropogenic and essentially random at 1km/1week resolution.
All data is freely available from public sources:
| Source | What | Resolution | API/Access |
|---|---|---|---|
| NASA FIRMS | Fire detections (ground truth) | Point | Free API, CSV download |
| ERA5-Land | Weather + soil + LAI | 9km | CDS API (free registration) |
| ERA5 | Weather reanalysis | 30km | CDS API |
| CEMS FWI | Fire Weather Index components | 8km | EWDS API (free registration) |
| SMAP | Surface soil moisture | 9km | Google Earth Engine |
| Sentinel-2 | NDVI + LSWI | 10m→1km | Google Earth Engine |
| SRTM GL1 | Elevation + slope | 30m | GEE or direct download |
| ESA WorldCover | Land cover / forest fraction | 10m | Direct download |
| SHRUG | Settlements, population | Village | Registration required |
Karnataka state divided into 40,000 1km grid cells (from 113K total cells, subsampled to 23K fire-history + 17K non-fire cells). Grid stored in grid/karnataka_grid.parquet.
| Split | Period | Rows | Purpose |
|---|---|---|---|
| Train | 2020 W1 – 2024 W52 | ~10.4M | Model training |
| Val | 2025 W1 – W26 | ~1.04M | Hyperparameter tuning, metric reporting |
| Test | 2025 W27 – W52 | ~1.04M | Held out (monsoon-heavy, lower signal) |
756 experiments across 6 sweeps. Full details in docs/04-EXPERIMENT-LOG.md.
| Sweep | Experiments | Best P@50 | Key Finding |
|---|---|---|---|
| 1: Baseline | 145 | 9.00% | Established ceiling with base features |
| 2: SHRUG + contagion | 155 | 9.00% | Human geography features = zero lift |
| 3: Architectures | 158 | 9.08% | LightGBM, ranking objectives — model not the bottleneck |
| 4: Phase 3 data | 162 | 9.23% | ERA5-Land + SMAP + FWI — redundant with existing weather |
| 5: All-cell scoring | 68 | 9.00% | Removed candidate filter — classification collapses at scale |
| 6: Weekly NDVI+LSWI | 68 | 8.54% | Weekly vegetation signal redundant with FWI |
This project follows the autoresearch pattern (Karpathy): prepare.py is fixed (data pipeline), while train.py is iterated by an AI research agent guided by program.md. The agent runs experiments autonomously, evaluating against Precision@50 on the validation set.
eval.py is the single source of truth for all metrics. It retrains the best configuration (or loads a saved model), computes per-week P@50 for every validation week, aggregates into season bands, and outputs data/metrics.json.
python eval.py # Retrain + evaluate
python eval.py --model best # Load saved model + evaluate
python eval.py --split test # Evaluate on test setThis was built after a lesson learned: ad-hoc metric calculations during long research sessions produced inflated numbers that propagated to documentation. All claims now trace back to metrics.json.
ecofire/
├── README.md # This file
├── eval.py # Evaluation pipeline (source of truth)
├── train.py # Model training + sweep (iterated by AI)
├── prepare.py # Data pipeline: raw → features → parquet
├── modal_app.py # Modal serverless GPU runner
├── program.md # Autoresearch agent instructions
├── requirements.txt # Python dependencies
│
├── download_firms_batch.py # NASA FIRMS fire detections
├── download_ndvi.py # Monthly Sentinel-2 NDVI (GEE)
├── download_ndvi_weekly.py # Weekly Sentinel-2 NDVI + LSWI (GEE)
├── download_era5land.py # ERA5-Land weather (CDS API)
├── download_smap.py # SMAP soil moisture (GEE)
├── download_fwi.py # CEMS FWI fire danger (EWDS API)
├── download_srtm.py # SRTM elevation
├── download_worldcover.py # ESA WorldCover land use
├── download_shrug.py # SHRUG settlement data
│
├── rebuild_features.py # Feature engineering (v1-v7 variants)
├── rebuild_phase3.py # ERA5-Land + SMAP + FWI integration
├── rebuild_weekly_ndvi.py # Weekly NDVI/LSWI integration
│
├── data/
│ ├── metrics.json # Authoritative evaluation metrics
│ ├── best_model.json # Saved XGBoost model
│ ├── feature_stats.json # Feature normalization statistics
│ ├── train.parquet # Training data (not in repo — too large)
│ ├── val.parquet # Validation data (not in repo)
│ └── test.parquet # Test data (not in repo)
│
├── grid/
│ └── karnataka_grid.parquet # 40K cell grid with terrain features
│
├── baselines/
│ └── baselines.json # Pre-computed baseline results
│
├── docs/
│ ├── 01-PROJECT-OVERVIEW.md # Motivation, timeline, infrastructure
│ ├── 02-DATA-PIPELINE.md # All data sources, grid, splits, features
│ ├── 03-ARCHITECTURE.md # Model evolution, what didn't work
│ ├── 04-EXPERIMENT-LOG.md # All 756 experiments across 6 sweeps
│ ├── 05-DEPLOYMENT-ROADMAP.md # Gap analysis, deployment architecture
│ ├── 06-IMPROVEMENT-PLAN.md # 5-phase improvement plan
│ ├── 07-LANDSCAPE-STUDY.md # 33 papers/systems literature review
│ ├── 08-COMMERCIAL-DATA-SOURCES.md # Free vs paid data analysis
│ ├── 09-MOEFCC-FIRE-INTELLIGENCE.md # India fire management gaps
│ ├── 10-ONLINE-LEARNING-ROADMAP.md # Path to 30%+ via deployment
│ └── 11-ADJACENT-OPPORTUNITIES.md # Platform extension opportunities
│
├── sweep_results.txt # Sweep 1 output
├── sweep_results_v2.txt # Sweep 2 output
└── sweep_results_v3.txt # Sweep 3 output
A literature review of 33 papers and systems found:
- No weekly fire prediction system exists for India. All Indian studies do static susceptibility mapping.
- ECMWF PoF (Probability of Fire) is the global state-of-the-art — 1km, 10-day forecast. Our work validates XGBoost as competitive with their approach for regional deployment.
- XGBoost matches or beats deep learning for tabular fire prediction (AUC 0.83-0.96 across published studies).
- The biggest gap in Indian fire research is temporal/lagged features — exactly what EcoFire addresses.
The 9% ceiling with offline data is confirmed. The path to 30%+ requires deployment + online learning:
- Ship & observe — deploy model, log predictions, auto-label via FIRMS
- Online recalibration — monthly retrain with fresh FIRMS labels
- Active learning — Thompson sampling to explore under-predicted regions
- Causal/counterfactual — solve the prevention paradox (successful prevention removes positive labels)
- Multi-state transfer — extend beyond Karnataka for more training data diversity
See docs/10-ONLINE-LEARNING-ROADMAP.md for the full roadmap.
If you use this work, please cite:
@software{ecofire2026,
author = {Velpanur, Nikhil},
title = {EcoFire: Weekly Forest Fire Prediction for India},
year = {2026},
url = {https://github.com/nikhilvelpanur/ecofire}
}
Apache 2.0. See LICENSE.
Built by Emergent Narrative as part of the Ecological DPI initiative. This work was conducted using the autoresearch methodology with Claude (Anthropic) as the AI research agent.
Data sources: NASA FIRMS, Copernicus Climate Data Store (ERA5, ERA5-Land, CEMS FWI), Google Earth Engine (Sentinel-2, SMAP), USGS (SRTM), ESA (WorldCover), Development Data Lab (SHRUG).