Replay Azure Machine Learning jobs from a SOURCE workspace into a TARGET workspace—preserving multi‑level pipeline hierarchy, metrics, and original artifacts & logs—without re‑running the original code.
- Extract job metadata (+ optional artifact path manifest) from the source workspace.
- Replay into the target workspace (synthetic jobs that log original metrics; optionally re-upload artifacts/logs by downloading them into
./outputs).
az login
pip install -r requirements.txt # or: uv install
# 1) Extract
python -m extractor.extract_jobs --source config/source_config.json --output data/jobs.json
# (Faster with threads & skip artifact enumeration if not needed yet)
# python -m extractor.extract_jobs --source config/source_config.json --output data/jobs.json --parallel 8 --no-artifacts
# 2) Replay (multi-level hierarchy preserved)
# dry-run to inspect structure (add --debug-hierarchy for tree view)
python -m replayer.build_pipeline --source config/source_config.json --target config/target_config.json --input data/jobs.json --dry-run --debug-hierarchy
# full replay (submits synthetic jobs)
python -m replayer.build_pipeline --source config/source_config.json --target config/target_config.json --input data/jobs.json --debug-hierarchyLimit top-level jobs:
python -m extractor.extract_jobs --source config/source_config.json --output data/jobs.json --limit 5
python -m replayer.build_pipeline --input data/jobs.json --source config/source_config.json --target config/target_config.json --limit 5Filter from include file:
python -m extractor.extract_jobs --source config/source_config.json --include-file include.txt --output data/selected.json
python -m replayer.build_pipeline --input data/selected.json --source config/source_config.json --target config/target_config.jsonParallel extraction (includes artifact path manifest):
python -m extractor.extract_jobs --source config/source_config.json --output data/jobs.json --parallel 12Skip artifact enumeration (faster, no replay artifacts later):
python -m extractor.extract_jobs --source config/source_config.json --output data/jobs.json --no-artifacts --parallel 12Hierarchy debug (prints pipeline tree before submitting):
python -m replayer.build_pipeline --input data/jobs.json --target config/target_config.json --debug-hierarchy --dry-run- Extraction – Collect job & pipeline step metadata, metrics, params, tags, plus optional artifact/log relative paths (no downloads).
- Replay – Build lightweight synthetic jobs that log the original metrics & tags and (optionally) download the original artifacts/logs into
./outputsso they appear in Studio. - Multi-level Pipeline Preservation – Each original pipeline (including nested pipelines) becomes a synthetic pipeline containing only its direct leaf (non‑pipeline) steps; nested pipelines are replayed separately and linked via tags.
No original training/inference code is executed.
| Capability | Status | Notes |
|---|---|---|
| Cross-workspace migration | ✅ | Source → Target |
| Multi-level pipeline reconstruction | ✅ | Nested pipelines preserved (no flattening) |
| Metrics / params / tags replay | ✅ | MLflow logging |
| Artifact path manifest enumeration | ✅ | Optional; lightweight (no bytes) |
| In-run artifact & log replay | ✅ | Downloads into ./outputs |
| AutoML trial expansion | ❌ | Disabled (previous flag removed; open issue if needed) |
| Dry-run planning | ✅ | Build without submit |
| Filtering (names, list, limit) | ✅ | Flexible selection |
| Logs preservation (namespaced) | ✅ | Under original_logs/ |
| Hierarchy debug tree | ✅ | --debug-hierarchy flag |
| Dataset / data asset recreation | ❌ | Out of scope |
Create config/source_config.json & config/target_config.json from the provided examples:
AutoML trial expansion has been disabled in the current version to prioritize accurate multi-level hierarchy reconstruction.
If you need historical trial node replication, pin to an earlier commit or open an issue describing your use case.
- Extraction phase records relative blob paths (outputs + log families) if not skipped.
- Replay phase (when enabled) downloads those blobs into the run’s local
./outputs. - Azure ML automatically surfaces everything written under
./outputsin the Studio “Outputs + logs” tab. - Original logs are placed under
outputs/original_logs/<family>/...to avoid collisions. - No duplication through MLflow unless explicitly added later.
If you skip enumeration (--no-artifacts), artifact replay is naturally absent.
| Flag | Purpose |
|---|---|
--source PATH |
Source workspace config JSON |
--output FILE |
Output JSON file |
--include name1,name2 |
Comma-separated top-level job names |
--include-file PATH |
File with one name per line |
--limit N |
Cap number of exported top-level jobs |
--parallel N |
Concurrent extraction workers |
--no-artifacts |
Skip artifact path enumeration |
| Flag | Purpose |
|---|---|
--input FILE |
Extracted jobs JSON |
--target PATH |
Target workspace config JSON |
--source PATH |
Source workspace config JSON (for artifacts) |
--limit N |
Cap number of replay units |
--dry-run |
Build only (no submission) |
--copy-artifacts |
Enable artifact & log download into ./outputs |
--debug-hierarchy |
Print multi-level pipeline tree (dry or submit) |
Each replayed run/step carries lineage tags:
original_job_id– Source run identifier.original_parent_pipeline_id– (Pipelines only) The immediate parent pipeline’s original ID (if nested).original_pipeline_depth– 0 for root pipelines, increasing with nesting.original_parent_job_id– For leaf steps inside a replayed pipeline (points to original pipeline parent).
replayed_from_job(superseded byoriginal_job_id).- All
automl_*expansion tags (feature disabled).
| Category | Replayed? | Notes |
|---|---|---|
| Hierarchy (pipelines) | ✅ | Synthetic structure |
| Metrics / params / tags | ✅ | Logged via MLflow |
| Timestamps (wall clock) | Partial | Original stored as metadata only |
| Original code execution | ❌ | Not re-run |
| AutoML trials (expanded) | ❌ | Expansion disabled in current rev |
| Hierarchy (nested) | ✅ | Multi-level preserved |
| Artifacts / logs | ✅ opt | Downloaded into ./outputs |
| Registered datasets/assets | ❌ | Not recreated |
| Symptom | Check |
|---|---|
| Empty Outputs | Artifact enumeration skipped? Permissions? |
| Missing jobs | Name filters / include file / limit reached |
| Slow extraction | Increase --parallel or use --no-artifacts |
| Long replay time | Large artifact set; consider temporarily skipping |
| Logs missing | Original run lacked logs or manifest omitted them |
Logs: logs/extract_jobs_*.log, logs/replayer_*.log.
- Metadata Extraction → JSON manifest of jobs/metrics/paths
- Replay Construction → Build synthetic pipeline/jobs (one per original pipeline level)
- Replay Execution → Metrics logged; optional artifact download; hierarchy & lineage tagging
- Studio Visibility → Files under
./outputsauto-surfaced
- Unified CLI (single entrypoint)
- More filters (dates, tags, depth)
- Optional dataset / data asset recreation
- Adaptive artifact size strategies
MIT
Focused PRs (docs, tests, small flags) welcome.
- Azure ML Docs: https://learn.microsoft.com/azure/machine-learning/
- Open an issue for support / ideas
Happy migrating!

{ "subscription_id": "<SUBSCRIPTION_ID>", "resource_group": "<RESOURCE_GROUP>", "workspace_name": "<WORKSPACE_NAME>", "tenant_id": "<TENANT_ID>" }