Task 7: Integration Testing
Depends on: All previous issues (#3-#8)
Smoke Tests
Run each condition for 2-3 batches to verify the full pipeline works end-to-end.
Test 1: Condition B embedding (simplest, no GNN)
./run_pipeline.sh pretrained_noprop_embedding snli "" --primary-batches 3 --finetune-batches 3
Expected: primary (3 batches) → finetune (3 batches) → eval completes.
Test 2: Condition D embedding (full pipeline)
./run_pipeline.sh pretrained_tree_embedding snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3
Expected: contrastive → primary → finetune → eval completes.
Test 3: Condition E embedding (frozen transformer)
./run_pipeline.sh pretrained_tree_frozen_xfmr_embedding snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3
Expected: Same as D. Verify in logs that "Froze pretrained transformer" appears.
Test 4: Condition F embedding (frozen GNN)
./run_pipeline.sh pretrained_tree_frozen_gnn_embedding snli contrastive \
--contrastive-checkpoint /home/jlunder/temp_temp_storage/infonce_wikiqs_20260201_234850/checkpoints/best_model.pt \
--primary-batches 3 --finetune-batches 3
Expected: Logs show missing/unexpected keys (architecture mismatch), then primary → finetune → eval completes.
Test 5: Condition A embedding (text mode)
./run_pipeline.sh pretrained_text_embedding snli "" --primary-batches 3 --finetune-batches 3
Expected: Loads HF tokenizer, runs primary → finetune → eval on text data.
Test 6: One matching variant
./run_pipeline.sh pretrained_tree_matching snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3
Expected: Same as D but with matching paradigm.
Freeze Verification
After smoke tests, add a quick check that frozen params actually stay frozen:
# Add temporarily to train_unified.py after model creation for Condition E:
if hasattr(model, '_aggregator'):
agg = model._aggregator
elif hasattr(model, 'gmn'):
agg = model.gmn._aggregator
for name, param in agg.named_parameters():
if name.startswith('encoder.') and param.requires_grad:
print(f"ERROR: {name} should be frozen!")
break
else:
print("Freeze verification: PASS")
VRAM Check
Monitor GPU memory during Condition D smoke test to confirm the model fits:
Expected: <20GB peak for all-MiniLM-L6-v2 (22M params) + prop_heavy GNN (~14.6M) at batch_size=256.
Important Note
Do NOT run integration tests while other training pipelines are using the GPU. Wait for current runs to complete, or use --primary-batch-size 32 to reduce VRAM.
Also: You must pip install -e . from the branch before testing if you've made code changes. This will affect any currently running pipeline stages — only do this when no training is in progress.
Task 7: Integration Testing
Depends on: All previous issues (#3-#8)
Smoke Tests
Run each condition for 2-3 batches to verify the full pipeline works end-to-end.
Test 1: Condition B embedding (simplest, no GNN)
./run_pipeline.sh pretrained_noprop_embedding snli "" --primary-batches 3 --finetune-batches 3Expected: primary (3 batches) → finetune (3 batches) → eval completes.
Test 2: Condition D embedding (full pipeline)
./run_pipeline.sh pretrained_tree_embedding snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3Expected: contrastive → primary → finetune → eval completes.
Test 3: Condition E embedding (frozen transformer)
./run_pipeline.sh pretrained_tree_frozen_xfmr_embedding snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3Expected: Same as D. Verify in logs that "Froze pretrained transformer" appears.
Test 4: Condition F embedding (frozen GNN)
Expected: Logs show missing/unexpected keys (architecture mismatch), then primary → finetune → eval completes.
Test 5: Condition A embedding (text mode)
./run_pipeline.sh pretrained_text_embedding snli "" --primary-batches 3 --finetune-batches 3Expected: Loads HF tokenizer, runs primary → finetune → eval on text data.
Test 6: One matching variant
./run_pipeline.sh pretrained_tree_matching snli "" --contrastive-batches 3 --primary-batches 3 --finetune-batches 3Expected: Same as D but with matching paradigm.
Freeze Verification
After smoke tests, add a quick check that frozen params actually stay frozen:
VRAM Check
Monitor GPU memory during Condition D smoke test to confirm the model fits:
Expected: <20GB peak for all-MiniLM-L6-v2 (22M params) + prop_heavy GNN (~14.6M) at batch_size=256.
Important Note
Do NOT run integration tests while other training pipelines are using the GPU. Wait for current runs to complete, or use
--primary-batch-size 32to reduce VRAM.Also: You must
pip install -e .from the branch before testing if you've made code changes. This will affect any currently running pipeline stages — only do this when no training is in progress.