A multi-omics AI benchmark for spaceflight biomedical data, featuring 21 ML tasks across 9 modalities and a 100-question LLM evaluation framework. Data from the SpaceX Inspiration4 (I4) civilian astronaut mission, NASA Twins Study, and JAXA Cell-Free Epigenome (CFE) study.
All benchmark tables are derived from OSDR public releases and/or published supplementary tables. Any human sequence-level or restricted files are excluded from the open track; a controlled-access track may require an approved OSDR Data Access Request (DAR).
| ML Tasks | 21 tasks (19 main + 2 supplementary) |
| LLM Evaluation | 100 questions, 5-dimension Claude-as-judge scoring, 9 models evaluated |
| Modalities | Clinical, cfRNA, Proteomics, Metabolomics, Spatial Transcriptomics, Microbiome, Multi-modal, Cross-tissue, Cross-mission |
| Difficulty Tiers | Calibration (1) / Standard (5) / Advanced (9) / Frontier (6) |
| Missions | Inspiration4 (4 crew, 3 days LEO), NASA Twins (340 days ISS), JAXA CFE (6 astronauts, >120 days ISS) |
| Evaluation | Leave-One-Crew-Out, Leave-One-Timepoint-Out, 80/20 feature splits (5 reps) |
| ML Baselines | Random, Majority, LogReg (scikit-learn), RF (scikit-learn), MLP (scikit-learn), XGBoost, LightGBM |
| LLM Evaluated | Claude Sonnet 4.6, Haiku 4.5, Sonnet 4, GPT-4o, GPT-4o Mini, DeepSeek-V3, Gemini 2.5 Flash, Llama-3.3-70B (×2 backends) |
git clone https://github.com/jang1563/SpaceOmicsBench.git
cd SpaceOmicsBench
# Create conda environment
conda create -n spaceomics python=3.11 -y
conda activate spaceomics
pip install -r requirements.txt
# Optional: LLM evaluation dependencies
pip install -r requirements-llm.txtpython baselines/run_baselines.pyThis runs all 7 baseline models on all 21 tasks and outputs:
- Per-task metrics (primary + secondary)
- B1 feature ablation study
- Normalized composite scores
- Results saved to
baselines/baseline_results.json
# Dry run — verify all tasks and splits load correctly
python evaluation/eval_harness.py --dry-run
# Evaluate predictions
python evaluation/eval_harness.py --task all --predictions your_results/ --output results.jsonPrediction file format: one JSON per task in the predictions directory. See evaluation/eval_harness.py header for format details.
Open demo.html in a browser for an interactive visualization of benchmark results, task descriptions, and baseline comparisons.
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| A1 | Flight Phase Classification (Blood Panel) | 3-class | Standard | 28 | macro_f1 | LOCO (4-fold) |
| A2 | Flight Phase Classification (Immune Markers) | 3-class | Standard | 28 | macro_f1 | LOCO (4-fold) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| B1 | Spaceflight-Responsive Gene Ranking | Binary | Advanced | 26,845 | AUPRC | Feature 80/20 (5-rep) |
| B2 | Coregulated Gene Cluster Prediction | Multilabel (16) | Advanced | 466 | micro_f1 | Feature 80/20 (5-rep) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| C1 | Proteomics Phase Classification | 3-class | Standard | 21 | macro_f1 | LOCO (4-fold) |
| C2 | Cross-Biofluid Protein DE Concordance | Binary | Frontier | 380 | AUROC | Feature 80/20 (5-rep) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| D1 | Metabolite Spaceflight Response Prediction | Binary | Advanced | 433 | AUROC | Feature 80/20 (5-rep) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| E1 | Cross-Layer DE (outer_epidermis) | Binary | Advanced | 18,677 | AUPRC | Feature 80/20 (5-rep) |
| E4 | Cross-Layer DE (epidermis) | Binary | Advanced | 18,677 | AUPRC | Feature 80/20 (5-rep) |
| E2* | Cross-Layer DE (inner_epidermis) | Binary | Frontier | 18,677 | AUPRC | Feature 80/20 (5-rep) |
| E3* | Cross-Layer DE (outer_dermis) | Binary | Frontier | 18,677 | AUPRC | Feature 80/20 (5-rep) |
* Supplementary: extreme class imbalance (11-18 positives out of 18,677), metric instability expected.
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| F1 | Body Site Classification (Taxonomy) | 10-class | Standard | 275 | macro_f1 | LOCO (4-fold) |
| F2 | Flight Phase Detection (Taxonomy) | 4-class | Frontier | 275 | macro_f1 | LOCO (4-fold) |
| F3 | Human vs Environmental Classification | Binary | Calibration | 314 | AUROC | LOTO (7-fold) |
| F4 | Body Site Classification (Pathways) | 10-class | Standard | 275 | macro_f1 | LOCO (4-fold) |
| F5 | Flight Phase Detection (Pathways) | 4-class | Frontier | 275 | macro_f1 | LOCO (4-fold) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| G1 | Multi-Modal Phase Classification | 3-class | Advanced | 21 | macro_f1 | LOCO (4-fold) |
Fuses clinical biomarkers + PCA(proteomics) + PCA(metabolomics).
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| H1 | Cross-Tissue Gene Conservation | Binary | Advanced | 731 | AUPRC | Feature 80/20 (5-rep) |
| ID | Task | Type | Tier | N | Metric | Split |
|---|---|---|---|---|---|---|
| I1 | Hemoglobin Gene DE Prediction | Binary | Frontier | 26,845 | AUPRC | Feature 80/20 (5-rep) |
| I2 | Cross-Mission Pathway Conservation | Binary | Advanced | 452 | AUROC | Feature 80/20 (5-rep) |
| I3 | Cross-Mission Gene DE Conservation | Binary | Advanced | 15,540 | AUPRC | Feature 80/20 (5-rep) |
Uses NASA Twins Study (340-day ISS, N=1 astronaut with twin control) to predict Inspiration4 patterns. I1 tests whether Twins fold-changes identify hemoglobin pathway genes. I2/I3 test cross-mission conservation at pathway and gene levels.
| Tier | Description | Baseline Behavior |
|---|---|---|
| Calibration | Easy validation tasks | Best baseline AUROC > 0.8 |
| Standard | Learnable with standard methods | Best baseline clearly above random |
| Advanced | Challenging, meaningful signal exists | Some baselines above random |
| Frontier | At the boundary of learnability | Near-random baseline performance |
- Classification (multi-class): macro F1, accuracy, per-class F1
- Binary classification: AUROC, AUPRC, F1
- Multilabel: micro F1, macro F1, Hamming loss
- Direction concordance: for cross-biofluid tasks (C2)
Individual task scores are normalized against the random baseline to handle metric scale differences:
normalized_score = (model_score - random_baseline) / (1.0 - random_baseline)
Category scores are averaged within each category, then the composite is the mean across all categories:
composite = mean(category_averages)
| Strategy | Used By | Description |
|---|---|---|
| LOCO | A1, A2, C1, F1-F5, G1 | Leave-One-Crew-Out (4 folds for I4 crew) |
| LOTO | F3 | Leave-One-Timepoint-Out (7 folds) |
| Feature 80/20 | B1, B2, C2, D1, E1-E4, H1, I1-I3 | Stratified 80/20 (5 repetitions, seed=42) |
Tables below are generated from baselines/baseline_results.json.
To refresh them after re-running baselines:
python scripts/generate_readme_tables.py --update-readme README.md
| Task | Tier | Metric | Random | Majority | LogReg | RF | MLP | XGBoost | LightGBM |
|---|---|---|---|---|---|---|---|---|---|
| A1 | Standard | macro_f1 | 0.214 | 0.200 | 0.546 | 0.294 | 0.310 | 0.332 | 0.200 |
| A2 | Standard | macro_f1 | 0.214 | 0.200 | 0.493 | 0.374 | 0.331 | 0.353 | 0.200 |
| B1 | Advanced | AUPRC | 0.020 | 0.017 | 0.533 | 0.885 | 0.854 | 0.912 | 0.922 |
| B2 | Advanced | micro_f1 | 0.083 | 0.000 | 0.154 | 0.131 | 0.000 | — | — |
| C1 | Standard | macro_f1 | 0.170 | 0.228 | 0.512 | 0.464 | 0.517 | 0.355 | 0.228 |
| C2 | Frontier | AUROC | 0.529 | 0.500 | 0.500 | 0.555 | 0.524 | 0.533 | 0.565 |
| D1 | Advanced | AUROC | 0.481 | 0.500 | 0.561 | 0.676 | 0.557 | 0.617 | 0.638 |
| E1 | Advanced | AUPRC | 0.008 | 0.002 | 0.017 | 0.015 | 0.003 | 0.010 | 0.005 |
| E4 | Advanced | AUPRC | 0.003 | 0.002 | 0.023 | 0.002 | 0.003 | 0.006 | 0.009 |
| E2* | Frontier | AUPRC | 0.001 | 0.001 | 0.031 | 0.050 | 0.011 | 0.020 | 0.005 |
| E3* | Frontier | AUPRC | 0.002 | 0.001 | 0.172 | 0.223 | 0.168 | 0.160 | 0.088 |
| F1 | Standard | macro_f1 | 0.112 | 0.018 | 0.147 | 0.199 | 0.108 | 0.193 | 0.200 |
| F2 | Frontier | macro_f1 | 0.205 | 0.111 | 0.236 | 0.238 | 0.204 | 0.263 | 0.280 |
| F3 | Calibration | AUROC | 0.402 | 0.500 | 0.574 | 0.841 | 0.320 | 0.838 | 0.838 |
| F4 | Standard | macro_f1 | 0.112 | 0.018 | 0.163 | 0.151 | 0.096 | 0.134 | 0.160 |
| F5 | Frontier | macro_f1 | 0.205 | 0.111 | 0.240 | 0.254 | 0.229 | 0.300 | 0.304 |
| G1 | Advanced | macro_f1 | 0.253 | 0.228 | 0.517 | 0.254 | 0.285 | 0.328 | 0.228 |
| H1 | Advanced | AUPRC | 0.060 | 0.048 | 0.176 | 0.266 | 0.062 | 0.213 | 0.284 |
| I1 | Frontier | AUPRC | 0.003 | 0.002 | 0.003 | 0.005 | 0.003 | 0.005 | 0.006 |
| I2 | Advanced | AUROC | 0.504 | 0.500 | 0.586 | 0.706 | 0.580 | 0.716 | 0.735 |
| I3 | Advanced | AUPRC | 0.059 | 0.052 | 0.090 | 0.081 | 0.090 | 0.081 | 0.086 |
Bold = best performing baseline per task. — = not applicable (multilabel task). * = supplementary task (extreme class imbalance; excluded from composite score).
| Model | Composite | Best Categories |
|---|---|---|
| RF | 0.258 | B_cfrna (0.882), F_source (0.735), D_metabolomics (0.375) |
| XGBoost | 0.250 | B_cfrna (0.910), F_source (0.728), D_metabolomics (0.262) |
| LightGBM | 0.238 | B_cfrna (0.921), F_source (0.730), D_metabolomics (0.302) |
| LogReg | 0.201 | B_cfrna (0.523), A_clinical (0.389), G_multimodal (0.353) |
| MLP | 0.133 | B_cfrna (0.851), C_proteomics (0.209), D_metabolomics (0.147) |
The B1 task includes effect-size features (fold-changes, differences) alongside distribution features. Ablation reveals:
| Variant | Features | LogReg | RF | MLP | XGBoost | LightGBM |
|---|---|---|---|---|---|---|
| B1 (all) | All 29 features | 0.533 | 0.885 | 0.854 | 0.912 | 0.922 |
| B1 (effect-only) | Only fold-change/diff | 0.248 | 0.813 | 0.741 | 0.780 | 0.801 |
| B1 (no-effect) | Exclude fold-change/diff | 0.527 | 0.863 | 0.847 | 0.899 | 0.884 |
Distribution-based features (means, ranges, IQRs) carry most of the predictive signal, confirming the task tests genuine biological pattern recognition rather than simple effect-size thresholding. Gradient boosting methods (XGBoost, LightGBM) achieve the highest B1 scores, with LightGBM reaching AUPRC=0.922.
SpaceOmicsBench includes a question-based evaluation framework for assessing LLM understanding of spaceflight multi-omics data.
100 questions across 9 modalities and 4 difficulty levels:
| Modality | Easy | Medium | Hard | Expert | Total |
|---|---|---|---|---|---|
| Clinical | 3 | 3 | 3 | 1 | 10 |
| Transcriptomics | 2 | 3 | 3 | 2 | 10 |
| Proteomics | 2 | 3 | 3 | 2 | 10 |
| Metabolomics | 2 | 3 | 3 | 2 | 10 |
| Spatial | 1 | 4 | 3 | 2 | 10 |
| Microbiome | 2 | 4 | 3 | 1 | 10 |
| Cross-Mission | 2 | 5 | 6 | 5 | 18 |
| Multi-Omics | 1 | 3 | 5 | 3 | 12 |
| Methods | 2 | 4 | 2 | 2 | 10 |
| Total | 17 | 32 | 31 | 20 | 100 |
Question types: factual, interpretation, reasoning, counterfactual, experimental design, cross-mission comparison.
| Dimension | Weight | Description |
|---|---|---|
| Factual Accuracy | 0.25 | Are stated facts correct? |
| Reasoning Quality | 0.25 | Is scientific logic sound? |
| Completeness | 0.20 | Are key factors addressed? |
| Uncertainty Calibration | 0.15 | Appropriate hedging for small-N data? |
| Domain Integration | 0.15 | Cross-omics/mission knowledge? |
# ── Proprietary models ────────────────────────────────────────────────────
# Claude
export ANTHROPIC_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py --model claude-sonnet-4-6 --sample 10
# OpenAI
export OPENAI_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py --model gpt-4o --full
# ── Open-source via API (OpenAI-compatible) ───────────────────────────────
# Groq (Llama 3.3 70B — free tier, fast)
export GROQ_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py --model llama-3.3-70b-versatile \
--base-url https://api.groq.com/openai/v1 --api-key-env GROQ_API_KEY --full
# Together.ai (Llama 3.3 70B — serverless, no dedicated endpoint needed)
export TOGETHER_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py \
--model meta-llama/Llama-3.3-70B-Instruct-Turbo \
--base-url https://api.together.xyz/v1 --api-key-env TOGETHER_API_KEY --full
# DeepSeek-V3
export DEEPSEEK_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py --model deepseek-chat \
--base-url https://api.deepseek.com/v1 --api-key-env DEEPSEEK_API_KEY --full
# Gemini 2.5 Flash (via OpenAI-compatible endpoint; billing must be active)
export GEMINI_API_KEY="your-key"
python evaluation/llm/run_llm_evaluation.py --model models/gemini-2.5-flash \
--base-url https://generativelanguage.googleapis.com/v1beta/openai/ \
--api-key-env GEMINI_API_KEY --full
# ── Open-source via Ollama (fully local, Apple Silicon supported) ─────────
# Pull model first: ollama pull llama3.3
python evaluation/llm/run_llm_evaluation.py --model llama3.3:70b \
--base-url http://localhost:11434/v1 --full
# ── HuggingFace local (Apple Silicon MPS auto-detected) ──────────────────
python evaluation/llm/run_llm_evaluation.py \
--model meta-llama/Llama-3.3-70B-Instruct --sample 10
# ── Scoring (Claude / GPT-4o / open-source as judge) ─────────────────────
# Claude as judge (default)
python evaluation/llm/score_responses.py results/eval_*.json
# GPT-4o as judge
python evaluation/llm/score_responses.py results/eval_*.json \
--judge-backend openai --judge-model gpt-4o
# Open-source as judge via Groq
python evaluation/llm/score_responses.py results/eval_*.json \
--judge-backend compatible --judge-model llama-3.3-70b-versatile \
--judge-base-url https://api.groq.com/openai/v1 \
--judge-api-key-env GROQ_API_KEY
# ── Generate comparison report ────────────────────────────────────────────
python evaluation/llm/generate_report.py results/scored_*.json --compareLLM reproducibility notes:
- Default generation settings:
temperature=0.3,max_tokens=2000 - No fixed random seed is set by default; expect small variability across runs
- For model comparisons, report the exact model name, temperature, and max_tokens
Reproducing the results table — commands used to generate the numbers in the Results section below:
# Step 1: Run full evaluation (100 questions)
python evaluation/llm/run_llm_evaluation.py --model claude-sonnet-4-6 --full
python evaluation/llm/run_llm_evaluation.py --model gpt-4o --full
# Step 2: Score with Sonnet 4.6 as judge (primary judge for the table)
python evaluation/llm/score_responses.py results/eval_claude-sonnet-4-6_*.json \
--judge-model claude-sonnet-4-6 --judge-backend anthropic \
--output results/scored_eval_claude-sonnet-4-6_judged-by-sonnet46.json
python evaluation/llm/score_responses.py results/eval_claude-sonnet-4-20250514_*.json \
--judge-model claude-sonnet-4-6 --judge-backend anthropic \
--output results/scored_eval_claude-sonnet-4_judged-by-sonnet46.json
python evaluation/llm/score_responses.py results/eval_gpt-4o_*.json \
--judge-model claude-sonnet-4-6 --judge-backend anthropic \
--output results/scored_eval_gpt-4o_judged-by-sonnet46.json
# Step 3: Score with Sonnet 4 as judge (for cross-judge column)
python evaluation/llm/score_responses.py results/eval_claude-sonnet-4-6_*.json \
--judge-model claude-sonnet-4-20250514 --judge-backend anthropic \
--output results/scored_eval_claude-sonnet-4-6_judged-by-sonnet4.json
# Step 4: Score with GPT-4o as judge (for cross-judge column)
python evaluation/llm/score_responses.py results/eval_claude-sonnet-4-20250514_*.json \
--judge-backend openai --judge-model gpt-4o \
--output results/scored_eval_claude-sonnet-4_judged-by-gpt4o.json
python evaluation/llm/score_responses.py results/eval_gpt-4o_*.json \
--judge-backend openai --judge-model gpt-4o \
--output results/scored_eval_gpt-4o_judged-by-gpt4o.jsonA summary of the scored outputs used to generate the table is in docs/samples/llm_eval_summary.json.
Supported backends summary:
| Backend flag | Covers | API key env |
|---|---|---|
anthropic (default) |
Claude models | ANTHROPIC_API_KEY |
openai |
GPT-4o, o1/o3 | OPENAI_API_KEY |
compatible + --base-url |
Groq, Together, DeepSeek, Mistral, OpenRouter, Ollama | via --api-key-env |
huggingface |
Local models (CUDA / Apple MPS / CPU) | — |
For publication-quality benchmarking, aim to evaluate at minimum:
- 2 proprietary frontier models (Claude + GPT-4o)
- 2–3 open-source flagship models (Llama 3.3 70B + DeepSeek R1 + Qwen 2.5 72B)
- 1 biomedical-specialized model (BioMistral, Meditron, etc.)
- Cross-judge verification with ≥ 2 judges (inter-rater reliability)
📊 Interactive Leaderboard → — bar charts, difficulty profiles, radar, modality heatmap, sortable table
9-Model Ranking (Judge: Claude Sonnet 4.6, 100 questions, 1–5 scale, v2.1):
🔒 = proprietary API 🔓 = open-weights
| Rank | Model | Score | Easy | Med | Hard | Expert | Factual | Reasoning | Complete | Uncert | Domain |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 🔒 Claude Sonnet 4.6 | 4.62 | 4.57 | 4.55 | 4.68 | 4.70 | 4.65 | 4.97 | 4.77 | 4.09 | 4.33 |
| 2 | 🔒 Claude Haiku 4.5 | 4.41 | 4.34 | 4.48 | 4.41 | 4.36 | 4.39 | 4.84 | 4.54 | 3.83 | 4.12 |
| 3 | 🔓 DeepSeek-V3 | 4.34 | 4.12 | 4.29 | 4.43 | 4.46 | 4.40 | 4.75 | 4.39 | 3.71 | 4.11 |
| 4 | 🔒 Claude Sonnet 4 | 4.03 | 4.02 | 4.09 | 4.01 | 3.99 | 4.28 | 4.47 | 4.07 | 3.14 | 3.74 |
| 5 | 🔒 Gemini 2.5 Flash | 4.00 | 3.52 | 4.11 | 4.11 | 4.04 | 4.45 | 4.36 | 3.96 | 3.22 | 3.45 |
| 6 | 🔒 GPT-4o Mini | 3.32 | 3.36 | 3.45 | 3.30 | 3.13 | 3.93 | 3.54 | 3.21 | 2.78 | 2.64 |
| 7 | 🔓 Llama-3.3-70B (Groq) | 3.31 | 3.48 | 3.38 | 3.30 | 3.06 | 4.03 | 3.52 | 3.21 | 2.61 | 2.57 |
| 8 | 🔓 Llama-3.3-70B (Together) | 3.31 | 3.58 | 3.35 | 3.28 | 3.04 | 4.00 | 3.50 | 3.20 | 2.65 | 2.62 |
| 9 | 🔒 GPT-4o | 3.30 | 3.27 | 3.39 | 3.27 | 3.25 | 3.98 | 3.61 | 3.13 | 2.57 | 2.62 |
v2.1 corrections: Q27/Q28/Q64
ground_truth_key_factsfixed; judgemax_tokens1000→2048 (fixes Haiku Q10); Gemini re-evaluated withmax_tokens=8192(thinking mode, no truncation).
Key findings:
- Claude models dominate the top tier; Haiku 4.5 notably outperforms Sonnet 4 (+0.38) despite being a smaller model
- DeepSeek-V3 (#3, 4.34) is the strongest open-weights model, surpassing Claude Sonnet 4 and all GPT/Gemini variants — particularly strong on Hard and Expert questions
- Gemini 2.5 Flash (#5, 4.00) with thinking mode enabled performs uniformly across all difficulty tiers (Easy 3.52 → Expert 4.04); previous apparent Easy/Expert gap was truncation artifact (
max_tokens=2000consumed by thinking tokens) - GPT-4o Mini slightly edges out GPT-4o (3.32 vs 3.30) in this specialized domain; all bottom-tier models cluster at ~3.30
- Uncertainty Calibration is the weakest dimension across all models; small-N spaceflight data requires careful hedging that all models underperform on
- Novel insight flags: DeepSeek-V3 (64), Claude models (45–93), and Gemini (26) generate novel cross-modal reasoning; GPT/Llama variants generate none
Cross-Judge Verification — Sonnet 4, Sonnet 4.6, and GPT-4o were additionally scored by Sonnet 4 and GPT-4o judges for bias analysis:
| Respondent | Sonnet 4 Judge | Sonnet 4.6 Judge | GPT-4o Judge |
|---|---|---|---|
| Claude Sonnet 4.6 | 4.73 | 4.62 | — |
| Claude Sonnet 4 | 4.55 | 4.03 | 4.76 |
| GPT-4o | 3.64 | 3.30 | 4.36 |
Sonnet 4.6 is the strictest judge (scores 0.3–0.5 lower); GPT-4o as judge inflates scores by ~0.2–0.7 but does not change ranking order.
SpaceOmicsBench/
├── README.md
├── demo.html # Interactive benchmark visualization
├── data/
│ └── processed/ # Benchmark data (CSV)
│ ├── clinical_cbc.csv # Clinical CBC features
│ ├── cfrna_3group_de_noleak.csv # cfRNA gene features (no leakage)
│ ├── cross_mission_*.csv # I-series cross-mission data
│ └── ...
├── tasks/ # Task definitions (JSON)
│ ├── A1.json ... H1.json # 19 main + 2 supplementary
│ └── I1.json, I2.json, I3.json # Cross-mission tasks
├── splits/ # Train/test split indices (JSON)
│ ├── loco_clinical.json
│ ├── feature_split_B1.json
│ ├── feature_split_I1.json
│ └── ...
├── evaluation/
│ ├── eval_harness.py # ML evaluation harness
│ ├── metrics.py # Metric implementations
│ ├── signature_query.py # Compare new DE data vs. benchmark signatures
│ └── llm/ # LLM evaluation framework
│ ├── question_bank.json # 100 questions
│ ├── annotation_schema.json # 5-dimension scoring schema
│ ├── data_context/ # 12 markdown context files
│ ├── run_llm_evaluation.py # Run LLM on questions
│ ├── score_responses.py # Claude-as-judge scoring
│ └── generate_report.py # Report generation
├── baselines/
│ ├── run_baselines.py # Baseline runner
│ └── baseline_results.json # Precomputed results
├── scripts/ # Preprocessing and utility scripts
│ ├── preprocess_cross_mission.py # I-series data preprocessing
│ ├── generate_readme_tables.py # Regenerate baseline tables in README
│ ├── validate_tasks.py # Validate all task JSON files against schema
│ └── generate_tasks_and_splits.py # [LEGACY] original task/split generator
└── docs/ # Additional documentation
All data originates from:
- NASA Open Science Data Repository (OSDR) — GeneLab-processed omics files from OSD-569 to OSD-687 (Inspiration4) and OSD-530 (JAXA CFE)
- Published supplementary data — Processed results from peer-reviewed publications on the Inspiration4 and JAXA missions, including:
- SOMA Multi-Omics Atlas (doi:10.1038/s41586-024-07639-y)
- Secretome Proteomics & Metabolomics (doi:10.1038/s41467-024-48841-w)
- Spatial Skin Transcriptomics (doi:10.1038/s41467-024-48625-2)
- JAXA CFE cfRNA (doi:10.1038/s41467-023-41995-z)
- Cross-mission Hemoglobin (doi:10.1038/s41467-024-49289-8)
See docs/CITATIONS.bib for the complete list of source publications.
This repository includes only publicly shareable processed/summary tables. Raw sequencing data and controlled-access human data are not redistributed. For any controlled-access material (e.g., human sequence-level data), users must obtain access directly via the original source (e.g., OSDR DAR, dbGaP/LSDA).
The results/ directory is intended for example outputs and local experiments. If you publish benchmark results, treat baselines/baseline_results.json as the canonical baseline reference and provide your own model results separately.
For a consolidated source/track/license view, see docs/PROVENANCE.md.
- Read task definitions from
tasks/to understand input/output specifications - Load data from
data/processed/and splits fromsplits/ - Generate predictions in the required JSON format
- Run evaluation:
python evaluation/eval_harness.py --task all --predictions your_results/
Each task JSON specifies:
data_files: which CSV(s) to loadinput_spec: feature description and countoutput_spec: target type, classes, and class distributionevaluation: primary and secondary metricssplit: which split file to use
If you have DE results from a new spaceflight mission, compare them against SpaceOmicsBench reference signatures to identify biological overlap with known spaceflight responses.
Input: any CSV with gene, log2FC (or logFC/log2FoldChange), and padj columns.
python evaluation/signature_query.py --input my_de_results.csvOptions:
--padj-threshold FLOAT Adjusted p-value cutoff for DE (default: 0.05)
--fc-threshold FLOAT Minimum |log2FC| for DE (default: 0, disabled)
--signatures IDs... Subset of signatures to query (default: all 8)
--output-dir DIR Output directory for JSON + Markdown report (default: results/)
Available reference signatures:
| Signature ID | Description | N |
|---|---|---|
I4_cfRNA_DRR |
Inspiration4 cfRNA spaceflight-responsive genes (JAXA IHSP) | 466 |
I4_Plasma_Proteomics |
Inspiration4 plasma DE proteins | 57 |
I4_PBMC_CD4T |
Inspiration4 CD4+ T cell DE genes | 736 |
I4_PBMC_CD8T |
Inspiration4 CD8+ T cell DE genes | 661 |
I4_PBMC_CD14Mono |
Inspiration4 CD14+ Monocyte DE genes | 709 |
GeneLab_Mouse |
GeneLab rodent spaceflight DE genes (conserved with I4) | 134 |
JAXA_cfRNA |
JAXA cfRNA DE genes | 36 |
CrossMission_Conserved |
Cross-mission conserved spaceflight genes | 814 |
Metrics computed: Jaccard index, overlap coefficient, hypergeometric enrichment (FDR-corrected), direction concordance, and Spearman correlation of log2FC values.
Note: This is an exploratory overlap tool. I4 had n=4 crew; reference signatures reflect
one specific mission cohort. See docs/extension_plan.md for the multi-mission ingestion pipeline.
Task JSON files are expected to follow the schema in docs/task_schema.json. This can be used for local validation or CI checks if you extend the benchmark.
v3 expands the benchmark with new missions, advanced ML methods, and biomedical-specialized model evaluation. Paper draft complete; targeting NeurIPS 2026 D&B submission (May 7).
| v2 | v3 | |
|---|---|---|
| ML Tasks | 21 (7 baselines) | 26 tasks (25 leaderboard, 16 methods) |
| LLM Questions | 100 (9 modalities) | 270 (12 categories) |
| LLM Models | 9 (general-purpose) | 9 (4 general + 5 bio-specialized) |
| Missions | I4, JAXA, Twins | + Axiom-2 Epigenetic |
| Key ML Results | LightGBM AUPRC=0.922 (B1) | TabPFN AUPRC=0.957 (SOTA) |
| Foundation Models | — | ESM2, GNN (negative results) |
- Bio fine-tuning hurts: OpenBioLLM-70B (2.50) scored −0.53 vs base Llama-3.3-70B (3.03) across all categories
- Signal hierarchy: effect-size >> tabular prior (TabPFN) >> protein sequence (ESM2) >> PPI topology (GNN)
- 4-tier LLM structure: Claude/DeepSeek (4.3+) > GPT-4o Mini/Llama (3.0) > OpenBioLLM (2.0-2.5) > Galactica/BioMedLM (1.0-1.2)
- Track A: 26 ML tasks including AX-2 epigenetic clocks, multi-omics fusion, TabPFN, ESM2, GNN
- Track B: 270 LLM questions across 12 categories — 3 new categories (Space Biology Basics, AX-2 Epigenetic, Clinical Applications)
v3 is developed in a separate repository: SpaceOmicsBench-v3. All v2 tasks and questions are preserved in v3.
- Benchmark code (scripts, evaluation framework, baselines): MIT License — free for any use including commercial.
- Benchmark data (processed tables, task definitions, question bank, scored results): CC BY-NC 4.0 — free for academic and research use; commercial use requires a separate license.
- Data sources: Benchmark tables are derived from NASA OSDR public releases and published supplementary materials, which retain their respective terms (see docs/PROVENANCE.md).
Copyright (c) 2026 JangKeun Kim. For commercial licensing inquiries: silveray1563@gmail.com