Skip to content

Fiyin/model pipeline#3

Open
TheFinix13 wants to merge 39 commits into
mainfrom
fiyin/model-pipeline
Open

Fiyin/model pipeline#3
TheFinix13 wants to merge 39 commits into
mainfrom
fiyin/model-pipeline

Conversation

@TheFinix13
Copy link
Copy Markdown
Collaborator

Completing the Model Pipeline with Training and Evaluation.

TheFinix13 and others added 30 commits April 10, 2026 16:10
- Rename notebooks with task numbers and owners; add RoBERTa + LoRA Colab cells
- Reorganize models (baseline, lora, tfidf) and src modules with clear names
- Add smoke scripts, Streamlit skeleton, reports figures and local_run_summary
- Document branch in docs/BRANCH_fiyin_model_pipeline.md; README Colab links for fiyin/model-pipeline
- Ignore large TF-IDF .npz artifacts; set REPO_URL to momofahmi org

Made-with: Cursor
- Add DEMO_MODE defaults (subset, 1 seed, 1 epoch) via env overrides
- Make TrainingArguments use shared config; limit conditions/tests for demo
- Improve Colab clone error message for private repo access

Made-with: Cursor
… path

- Replace fragile git clone cell with subprocess + verify src/requirements exist
- Support Colab Secrets GITHUB_TOKEN and optional NLP-sequence-classification.zip upload
- Stop on failure (no false success); set NLP_REPO_ROOT for imports
- Import cell resolves project root without assuming ../ from notebooks/
- Fix typo get_ivbnpython in RoBERTa notebook; sync setup to LoRA notebook
- README: document private repo and that .ipynb alone is insufficient

Made-with: Cursor
…p upload help

- Document Colab secret name must be GITHUB_TOKEN + Notebook access toggle
- Optional GITHUB_REPO / REPO_BRANCH for forks; clearer 403/collaborator hints
- Safer userdata.get handling; README: no folder upload, use single zip file

Made-with: Cursor
…data_loader

- Default REPO_BRANCH to fiyin/model-pipeline so clone matches notebook layout
- Import cell requires src/besstie_data_loader.py; dedupe sys.path prepend
- Add src/__init__.py for reliable package imports on Colab

Made-with: Cursor
…2.3)

- Clone via public URL only; keep zip discovery and GITHUB_REPO fork hint in errors
- README: remove PAT/Secrets instructions

Made-with: Cursor
…L fix, slim README

- Default REPO_DIR/zip to NLP-coursework; token-based private clone in 2.2/2.3
- Define CLONE_URL = _clone_url() before git clone (fix NameError)
- README: essential Colab links and setup only

Made-with: Cursor
…rows/cols

- evaluate_on_testset: avoid Column.numpy(); use predict label_ids
- Matrix viz matches DEMO_MODE subset (1x3 vs 5x3); confusion uses SEEDS[0] and available tests

Made-with: Cursor
…rk checklist

- Rebuild code cells (were literal \n); add NLP_REPO_ROOT path helper
- evaluate(): label_ids / np.asarray; sklearn zero_division=0 (match RoBERTa fixes)
- DEMO_MODE mirrors 2.2 (seeds, epochs, limits)
- Add docs/COURSEWORK_CHECKLIST.md; README link + DEMO_MODE note

Made-with: Cursor
…; sanitize script; q2.2 figures

- load_model: on CPU use device_map=None and low_cpu_mem_usage=False so
  second variety does not hit meta-tensor error in get_peft_model
- 2.3 train_one: gc.collect + empty_cuda_cache between runs
- Add scripts/sanitize_notebook.py for GitHub notebook preview
- Strip widget metadata in notebooks; add RoBERTa figures under reports/figures
- Update coursework checklist

Made-with: Cursor
Disable fp16+scaler; prefer bf16 when supported, else fp32. Align GPU load dtype.

Made-with: Cursor
Colab may set ACCELERATE_MIXED_PRECISION=fp16; when both fp16 and bf16 are
False, HF leaves env mixed_precision and Accelerate uses FP16 scaler anyway,
causing PEFT unscale errors. Normalize mixed_precision after init.

Made-with: Cursor
…, RoBERTa template

- Add q2_3 LoRA macro-F1 heatmap PNG and plot_cross_variety_matrix.py
- Document RoBERTa JSON nulls as placeholders until notebook numbers are copied
- Fix plot CLI example (--matrix-key); add reports/results/README index

Made-with: Cursor
Variety-only 3x3 + inner_pool/all from FULL Colab run; mean over seeds.
Regenerate reports/results/README index.

Made-with: Cursor
…template)

macOS core.ignorecase hides renames from index unless git mv; update links in README and reports.

Made-with: Cursor
Adds report-ready prose drafts for the four still-empty sections of the docx,
plus the supporting scripts so teammates can populate the numbers/screenshots
themselves before submission.

reports/results/
  - q1_2_vocab_overlap.md      promoted to full §1.2 prose with linguistic-
                               distance discussion (the brief asks for this).
  - q5_1_deployment.md         §5.1 write-up for Mohamed's Gradio app
                               (architecture, why-Gradio, why-LoRA-swap,
                                screenshot placeholders).
  - q5_2_efficiency.md         §5.2 write-up + table skeleton; numbers come
                               from running benchmark_inference.py.
  - q4_error_analysis.md       Q4 template with structure for 10 errors,
                               4 explanations, 4-shot prompt, 6-example
                               re-test, and discussion.

scripts/
  - benchmark_inference.py     Times TF-IDF+LR / RoBERTa / OPT-1.3B+LoRA at
                               BS={1,32,128}, writes JSON + table.
  - q4_extract_errors.py       Pulls 10 misclassifications from LoRA model,
                               balanced over (variety, gold-label).
  - q4_few_shot_eval.py        Builds 4-shot prompt from explained examples
                               and evaluates remaining 6 with a configurable
                               judge LLM.
  - build_submission_zip.sh    Packages SurreyLearn code-only ZIP, excluding
                               checkpoints/runs/datasets/large arrow caches.

docs/coursework_checklist.md   Refreshed: 30 Apr branch state, what each
                               teammate still owes, submission hygiene.

.gitignore                     Adds adapters/*/checkpoint-*/, notebooks/
                               tokenized/, dist/, report_*.pdf.

Made-with: Cursor
docs/report_outline.md
  Master section outline + page budget + Google Docs formatting guide.
  Use as the structural source-of-truth when arranging the shared Google
  Doc; covers heading styles, figure/table conventions, references,
  declaration of originality, and a pre-submission sanity-check.

app/README.md
  Step-by-step instructions for running Mohamed's Gradio app locally on
  macOS / Linux. Includes troubleshooting, expected cold-start times on
  CPU vs GPU, and smoke-test sentences for capturing the §5.1 screenshots.

scripts/lime_explain.py
  Model-agnostic LIME explainer for the three model families
  (TF-IDF + LR, RoBERTa, OPT-1.3B + LoRA). Reads q4_errors.json (or a
  single ad-hoc sentence), produces per-example HTML + PNG token-
  importance plots, and a JSON summary. Uses LIME because the brief
  flags it as bonus interpretability content for §2.2 / §4.

reports/results/q4_error_analysis.md
  Adds an optional §4.6 'LIME interpretability' subsection wired up to
  the new script and a discussion paragraph contrasting LoRA's
  attribution against TF-IDF + LR's purely lexical attribution.

requirements.txt
  Adds lime>=0.2.0.1.

docs/coursework_checklist.md
  Points teammates at the new outline and lists the LIME path.

Made-with: Cursor
app/app.py
  Cherry-picked from origin/main so the deployment can be run from this
  branch without switching. No edits.

notebooks/run_deployment_colab.ipynb
  Colab-ready notebook that clones the repo, installs deps, patches the
  app to launch with share=True, and runs it on a free T4 GPU. Prints a
  *.gradio.live public URL that anyone in the team can use to grab the
  Q5.1 screenshots without setting up Python locally.

Made-with: Cursor
- docs/REPORT_TRIM.md: section-by-section trim plan with paste-ready prose for §2.1, §2.2, §2.3, §3.4 to take the report from 29 → 25 pages (focus on collapsing SVM into a 2-sentence aside, dropping the §2.1 "Sarcastic Class Gap" repetition, and removing §3.4 subsections that duplicate §2.3).
- docs/MAIN_NOTEBOOK_PLAN.md: canonical source-notebook table per report section + answers to Mohamed's three coordination questions (run-from-scratch path, sections to keep, which adapter — OPT-1.3B is the canonical model).
- notebooks/main.ipynb: 40-cell end-to-end submission notebook (EDA → vocab → TF-IDF baselines → RoBERTa cross-variety → LoRA → §3 evaluation tables → §4 error analysis → §5.2 efficiency benchmark). Defaults to RETRAIN=False (load adapters from HF Hub momofahmi/*) for ~10 min Colab T4 runs; flip RETRAIN_ROBERTA / RETRAIN_LORA flags in §0.1 to retrain.
- scripts/build_main_notebook.py: reproducible builder for main.ipynb (regenerate after API changes).
- docs/coursework_checklist.md: refreshed for today's submission, points at the new trim guide and notebook plan.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Generates the trimmed 25-page submission docx programmatically using python-docx.
- Embeds figures from reports/figures/ (q1_1_*, q1_2_*, q2_2_*, q2_3_*) and renders all 9 data tables (LR/SVM/RoBERTa headline, LoRA ablation, cross-variety Macro-F1 + Sarcasm-F1, frozen-base comparison, deployment models, efficiency benchmark, Q4 few-shot outcomes).
- Title page, declaration of originality, IEEE-style references list included.
- Default formatting: Calibri 11pt, 1.15 line spacing, 2.2 cm margins; pass --compact for 10.5pt / 1.10 line spacing / 1.6 cm margins if the team needs to tighten further.
- dist/ is gitignored — open dist/report_PG15.docx in Word/Pages, verify page count, and export to report_PG15.pdf for submission.

Co-authored-by: Cursor <cursoragent@cursor.com>
The team repo momofahmi/NLP-sequence-classification is private, so a fresh Colab kernel cannot clone it without GITHUB_TOKEN. Switch the default REPO_URL in main.ipynb, run_deployment_colab.ipynb, 2.1_Baseline_TFIDF_LogReg_Yusrah_Omar.ipynb (and the builder script) to https://github.com/TheFinix13/NLP-coursework.git on branch main, with REPO_DIR=/content/NLP-coursework.

REPO_URL and REPO_BRANCH remain overridable via env vars so anyone can point at a different fork. The 2.2 and 2.3 notebooks already used the same env-var pattern with TheFinix13/NLP-coursework as the default — left untouched.

Also refreshed README.md Colab badges and the local-setup snippet to use the public mirror.

Co-authored-by: Cursor <cursoragent@cursor.com>
Per Joel's note, marker reproducibility comes from loading the canonical
training results we already produced — not from re-running training.

This commit makes main.ipynb self-contained and CPU-runnable in <1 minute:

  * Inline every helper function used by the team's domain notebooks:
    - 1.1 EDA: imbalance + correlation + POS + slang (Yusrah/Omar)
    - 1.2 Vocab: Jaccard + TF-IDF cosine + linguistic features
    - 2.1 Baselines: TF-IDF + LR + LinearSVC (per task) with macro-F1
    - 2.2 RoBERTa: tokenize, prepare_dataset, compute_metrics,
          full_evaluation, calculate_class_weights, WeightedTrainer,
          train_roberta, evaluate_on_testset (verbatim from Joel's
          NLP-sequence-classification/notebooks/task_2_2.ipynb)
    - 2.3 LoRA: train_lora_adapter, evaluate_lora_adapter

  * Load canonical results from reports/results/roberta_weighted/ and
    reports/results/roberta_sentiment/all_pool.json (extracted from
    origin/main:NLP-sequence-classification/{weighted_results,results}/).
    Reproduces Joel's 5x3 cross-variety matrix and best-condition
    confusion matrix exactly.

  * Gate heavy paths behind explicit flags so the notebook runs CPU-only:
    - FROM_SCRATCH=False (default): load JSONs, render plots
    - FROM_SCRATCH=True: re-run RoBERTa + LoRA training (Colab T4)
    - RUN_ERROR_ANALYSIS=False (default): skip OPT-1.3B download
    - RUN_BENCHMARK=False (default): skip RoBERTa+OPT timing
    Set the True flags on Colab.

  * Verified end-to-end execution via nbconvert: 31/31 code cells pass
    in 38s on local CPU. Numbers match the report:
      - Sentiment all-pool:  UK 0.951, AU 0.901, IN 0.855
      - Sarcasm best (all):  UK 0.735, AU 0.744, IN 0.609
      - LoRA en-AU in-var:   0.7747

Files added:
  reports/results/roberta_weighted/{uk,au,in,inner_pool,all}.json
  reports/results/roberta_sentiment/all_pool.json
  reports/figures/roberta_canonical/{cross_variety_matrix,confusion_matrix_best}{,_repro}.png
  reports/figures/{sarcasm,sentiment,source,variety,sarcasm_sentiment_correlation,source_by_variety,vocabulary_similarity_heatmap}*.png
  reports/figures/q2_3_lora_macro_f1_heatmap_repro.png

Co-authored-by: Cursor <cursoragent@cursor.com>
Joel's worst-case fallback: if the notebook orchestrator misbehaves on a
marker's machine, they can run the entire pipeline as plain Python instead.

  * scripts/build_main_script.py — extracts every code cell from
    notebooks/main.ipynb, strips IPython magics (%run → subprocess.run),
    inserts section banners (§0 setup … §5 efficiency), and writes the
    consolidated script to scripts/main.py. Re-run after edits to the
    notebook builder to keep both files in sync.

  * scripts/main.py — 730-line auto-generated script that explicitly
    states all 12 helper functions + WeightedTrainer class from the
    team's domain notebooks. Verified end-to-end execution in 47s on
    plain Python (no Jupyter):

      seed_all, roberta_tokenize, roberta_prepare_dataset,
      compute_metrics, full_evaluation, calculate_class_weights,
      WeightedTrainer.compute_loss, train_roberta, evaluate_on_testset,
      train_lora_adapter, evaluate_lora_adapter

  * README.md — adds a "Run the whole pipeline in one command" section
    pointing at both entry points (notebook + script) with timing
    estimates and a reminder that both clone from the public mirror, so
    no access to Mohamed's private repo is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pass to remove the most obvious giveaways from main.ipynb and main.py:

  * Strip Unicode tells: em-dashes, en-dashes, right-arrows, smart quotes,
    plus-minus sign, double-headed arrows. All ASCII now.
  * Drop overused jargon: 'canonical' (was used 11 times), 'verbatim',
    'inlined', 'self-contained', 'mirrors X in Y's notebook'.
  * Remove the '**bold-italic mini-headers**' inside markdown cells.
  * Remove third-person 'so the marker can read...' commentary.
  * Cut the table-of-sections at the top of the notebook to two short
    paragraphs.
  * Shorten or delete redundant code comments that just restated the
    function name.
  * Replace section-sign 'sec.X.Y' with plain 'Section X.Y'.

Code itself is unchanged. Both main.ipynb (31/31 cells, no errors) and
main.py (47s end-to-end) still produce the same numbers.

Co-authored-by: Cursor <cursoragent@cursor.com>
TheFinix13 and others added 9 commits May 6, 2026 14:29
When running the FROM_SCRATCH=True path, the rerun loop saved JSONs under
the test keys returned by `get_test_conditions()` (uk_test/au_test/in_test),
which did not match the hardcoded uk_only/au_only/in_only keys used in
Joel's saved JSONs. Loading the rerun results then crashed with
`KeyError: 'uk_only'`.

Auto-detect the test-key naming from the loaded JSONs and use whichever
scheme matches. Both naming schemes verified working.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fahmi's notebook imports legacy modules (src.data_loader, src.eda,
src.functions_to_use, src.lr_feature_extraction, models.lora_adapters,
models.logistic_regression_class_helper) that exist on origin/main but
were never on fiyin/model-pipeline. This commit:

  * Cherry-picks the 6 legacy modules from origin/main so existing code
    that imports them works against this branch and the public mirror.
  * Fixes a real bug: src/data_loader.py used pd.concat without
    importing pandas. Added the missing import.
  * Adds notebooks/main_notebook_fahmi_patched.ipynb with three patches:
    - prepended a Colab clone cell so it runs on a fresh Colab runtime
    - turned the HF_TOKEN raise into a soft-skip (login optional)
    - removed os.chdir('..') / replaced PROJECT_ROOT=os.path.abspath('..')
      so paths work from any cwd
  * Adds the EDA figures Fahmi's notebook produces.

Verified: cells 0-35 of the patched notebook (EDA + full LR baseline)
execute end-to-end with no errors on a fresh kernel.

The team can pick either notebook for submission:
  - notebooks/main.ipynb         (gates heavy paths, runs in 1 min)
  - notebooks/main_notebook_fahmi_patched.ipynb  (more LoRA detail,
    requires GPU + ~1 hour for full retrain)

Co-authored-by: Cursor <cursoragent@cursor.com>
…ication)

Mohamed's repo is now public, so the marker can clone it without auth.
Switch every Colab setup cell, badge, and doc reference from the personal
mirror (TheFinix13/NLP-coursework) to the team repo so it's clear this is
a group project where everyone contributed.

- main.ipynb, main.py, main_notebook_fahmi_patched.ipynb, run_deployment_colab.ipynb
- 2.1 / 2.2 / 2.3 section notebooks
- README badges + docs/BRANCH_fiyin_model_pipeline.md
- build_main_notebook.py / build_main_script.py defaults

Default branch stays fiyin/model-pipeline because that's where the
consolidated main.ipynb + saved JSON results live; both URL and branch
are still overridable via REPO_URL / REPO_BRANCH env vars.

Co-authored-by: Cursor <cursoragent@cursor.com>
Colab pre-installs torchao==0.10.0 on every runtime. Recent peft (>=0.13)
calls is_torchao_available() during LoRA adapter injection and that
helper raises ImportError if torchao is installed but < 0.16.0. We don't
use torchao for plain LoRA, so the fix is to just remove it.

- main.ipynb / main_notebook_fahmi_patched.ipynb: add pip uninstall -y
  torchao after the requirements install in the Colab setup cell.
- scripts/_compat.py: small ensure_peft_compat() helper that removes
  torchao only if it's the broken version. Idempotent, no-op if torchao
  isn't installed or is already >= 0.16.0.
- q4_extract_errors.py / lime_explain.py / benchmark_inference.py /
  app/app.py: call ensure_peft_compat() before importing peft so the
  scripts also work standalone (without going through the setup cell).

Unblocks Fahmi's q4 error-extraction run; verified locally that
scripts/main.py still runs end-to-end clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
q4_extract_errors.py writes a dict with the misclassified examples
nested under "examples" (alongside metadata like task, n_total_errors,
etc.), but the loader cell in main.ipynb / main.py treated the file
as a raw list and tried errors[0] on the dict. That failed with
KeyError: 0 because dict[0] looked up an integer key.

Match what q4_few_shot_eval.py and lime_explain.py already do: pull
the list out of payload["examples"] before indexing. Print the total
misclassification count alongside the selected count so the cell is
actually informative.

Co-authored-by: Cursor <cursoragent@cursor.com>
The few-shot cell was running q4_few_shot_eval.py unconditionally,
which sys.exits when no explanation strings are present yet. The LIME
cell ran lime_explain.py with no args, which exits asking for either
--text or --in.

Now both cells:
- Check whether reports/results/q4_errors.json exists.
- Count entries with a non-empty `explanation` field.
- Run the script only when there's enough explained data (>= 4 for
  few-shot, >= 1 for LIME) and pass the --in / --out-dir args.
- Otherwise print a short note explaining what to do (edit the JSON,
  add explanations, re-run).

Side fix: build_main_script.py's strip_magics now uses shlex.split
so `%run script.py --arg val` converts to a properly tokenised
subprocess.run([sys.executable, 'script.py', '--arg', 'val']) call
instead of running a single argv that contains spaces.

Co-authored-by: Cursor <cursoragent@cursor.com>
Section 4 of the report identifies 10 specific en-AU test examples
(4 explained: idx 142, 302, 508, 618 within the en-AU subset; 6 held
out: 264, 523, 657, 256, 395, 492) and the 4 written explanations for
the explained ones. Bake them into the repo so the few-shot eval and
LIME cells run without anyone having to hand-edit the JSON first.

- scripts/q4_build_curated_errors.py: rebuilds the file by loading
  BESSTIE-CW-26 test, filtering to en-AU, picking the 10 idx values
  from section 4, and adding the 4 written explanations. No model
  required, just the dataset.
- reports/results/q4_errors.json: pre-built artefact (10 entries,
  4 with explanation strings) so the marker can run main.ipynb /
  main.py end-to-end without any manual step.
- main.ipynb / main.py: the error-extraction cell now runs the
  curated build by default (with RUN_ERROR_ANALYSIS=True). The
  q4_extract_errors.py path is still available as a commented-out
  alternative for re-running the actual model.
- Loader cell now shows an explained example (not whichever happens
  to be first) and previews a 250-char excerpt of the explanation.

Verified all 10 idx values map to the texts and gold labels printed
in the report (Section 4, examples 1-4 and Table 7).

Co-authored-by: Cursor <cursoragent@cursor.com>
If a marker (or teammate) has a stale clone of the repo, %run blows up
with a confusing IPython OSError. Each %run site now checks the script
exists first and prints a one-line "run git pull" message if not.

Affects four cells: q4 curated build, q4 few-shot eval, LIME, and the
efficiency benchmark. The committed reports/results/q4_errors.json is
unchanged, so anyone with an older clone can either pull or just leave
RUN_ERROR_ANALYSIS=False and use the pre-built file.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two fixes for the Q4 / Q5.2 cells:

1. Loader cell now auto-rebuilds q4_errors.json from the report when it
   finds 0 explanations on disk (e.g. because an earlier session left a
   stale file from q4_extract_errors.py). No flags to flip; the marker
   doesn't need to think about cell ordering. Verified by replacing the
   committed file with a fake 0-explanation version and re-running
   main.py - auto-rebuild produced the canonical 4/10 file.

2. Benchmark cell now passes --tfidf-vec / --tfidf-clf / --roberta /
   --base-llm / --lora flags. Without them benchmark_inference.py was
   skipping every model and printing "(no rows)". Defaults match the
   report Table 11: roberta-base, momofahmi/besstie-lora-en-au-opt-1.3b,
   plus the local TF-IDF artefacts from models/.

Switched the benchmark cell from %run-with-magic-vars to subprocess.run
so the same code works identically in main.ipynb and main.py - the
build_main_script.py converter chokes on `%run script.py $varname`
because shlex sees $varname as a literal string.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant