Fiyin/model pipeline#3
Open
TheFinix13 wants to merge 39 commits into
Open
Conversation
- Rename notebooks with task numbers and owners; add RoBERTa + LoRA Colab cells - Reorganize models (baseline, lora, tfidf) and src modules with clear names - Add smoke scripts, Streamlit skeleton, reports figures and local_run_summary - Document branch in docs/BRANCH_fiyin_model_pipeline.md; README Colab links for fiyin/model-pipeline - Ignore large TF-IDF .npz artifacts; set REPO_URL to momofahmi org Made-with: Cursor
- Add DEMO_MODE defaults (subset, 1 seed, 1 epoch) via env overrides - Make TrainingArguments use shared config; limit conditions/tests for demo - Improve Colab clone error message for private repo access Made-with: Cursor
… path - Replace fragile git clone cell with subprocess + verify src/requirements exist - Support Colab Secrets GITHUB_TOKEN and optional NLP-sequence-classification.zip upload - Stop on failure (no false success); set NLP_REPO_ROOT for imports - Import cell resolves project root without assuming ../ from notebooks/ - Fix typo get_ivbnpython in RoBERTa notebook; sync setup to LoRA notebook - README: document private repo and that .ipynb alone is insufficient Made-with: Cursor
…p upload help - Document Colab secret name must be GITHUB_TOKEN + Notebook access toggle - Optional GITHUB_REPO / REPO_BRANCH for forks; clearer 403/collaborator hints - Safer userdata.get handling; README: no folder upload, use single zip file Made-with: Cursor
…data_loader - Default REPO_BRANCH to fiyin/model-pipeline so clone matches notebook layout - Import cell requires src/besstie_data_loader.py; dedupe sys.path prepend - Add src/__init__.py for reliable package imports on Colab Made-with: Cursor
…2.3) - Clone via public URL only; keep zip discovery and GITHUB_REPO fork hint in errors - README: remove PAT/Secrets instructions Made-with: Cursor
…n notebooks Made-with: Cursor
…inix13 example Made-with: Cursor
…L fix, slim README - Default REPO_DIR/zip to NLP-coursework; token-based private clone in 2.2/2.3 - Define CLONE_URL = _clone_url() before git clone (fix NameError) - README: essential Colab links and setup only Made-with: Cursor
…rows/cols - evaluate_on_testset: avoid Column.numpy(); use predict label_ids - Matrix viz matches DEMO_MODE subset (1x3 vs 5x3); confusion uses SEEDS[0] and available tests Made-with: Cursor
…rk checklist - Rebuild code cells (were literal \n); add NLP_REPO_ROOT path helper - evaluate(): label_ids / np.asarray; sklearn zero_division=0 (match RoBERTa fixes) - DEMO_MODE mirrors 2.2 (seeds, epochs, limits) - Add docs/COURSEWORK_CHECKLIST.md; README link + DEMO_MODE note Made-with: Cursor
…; sanitize script; q2.2 figures - load_model: on CPU use device_map=None and low_cpu_mem_usage=False so second variety does not hit meta-tensor error in get_peft_model - 2.3 train_one: gc.collect + empty_cuda_cache between runs - Add scripts/sanitize_notebook.py for GitHub notebook preview - Strip widget metadata in notebooks; add RoBERTa figures under reports/figures - Update coursework checklist Made-with: Cursor
Disable fp16+scaler; prefer bf16 when supported, else fp32. Align GPU load dtype. Made-with: Cursor
Colab may set ACCELERATE_MIXED_PRECISION=fp16; when both fp16 and bf16 are False, HF leaves env mixed_precision and Accelerate uses FP16 scaler anyway, causing PEFT unscale errors. Normalize mixed_precision after init. Made-with: Cursor
…, RoBERTa template - Add q2_3 LoRA macro-F1 heatmap PNG and plot_cross_variety_matrix.py - Document RoBERTa JSON nulls as placeholders until notebook numbers are copied - Fix plot CLI example (--matrix-key); add reports/results/README index Made-with: Cursor
Variety-only 3x3 + inner_pool/all from FULL Colab run; mean over seeds. Regenerate reports/results/README index. Made-with: Cursor
…README Made-with: Cursor
Updated Checklist
…template) macOS core.ignorecase hides renames from index unless git mv; update links in README and reports. Made-with: Cursor
Adds report-ready prose drafts for the four still-empty sections of the docx,
plus the supporting scripts so teammates can populate the numbers/screenshots
themselves before submission.
reports/results/
- q1_2_vocab_overlap.md promoted to full §1.2 prose with linguistic-
distance discussion (the brief asks for this).
- q5_1_deployment.md §5.1 write-up for Mohamed's Gradio app
(architecture, why-Gradio, why-LoRA-swap,
screenshot placeholders).
- q5_2_efficiency.md §5.2 write-up + table skeleton; numbers come
from running benchmark_inference.py.
- q4_error_analysis.md Q4 template with structure for 10 errors,
4 explanations, 4-shot prompt, 6-example
re-test, and discussion.
scripts/
- benchmark_inference.py Times TF-IDF+LR / RoBERTa / OPT-1.3B+LoRA at
BS={1,32,128}, writes JSON + table.
- q4_extract_errors.py Pulls 10 misclassifications from LoRA model,
balanced over (variety, gold-label).
- q4_few_shot_eval.py Builds 4-shot prompt from explained examples
and evaluates remaining 6 with a configurable
judge LLM.
- build_submission_zip.sh Packages SurreyLearn code-only ZIP, excluding
checkpoints/runs/datasets/large arrow caches.
docs/coursework_checklist.md Refreshed: 30 Apr branch state, what each
teammate still owes, submission hygiene.
.gitignore Adds adapters/*/checkpoint-*/, notebooks/
tokenized/, dist/, report_*.pdf.
Made-with: Cursor
docs/report_outline.md Master section outline + page budget + Google Docs formatting guide. Use as the structural source-of-truth when arranging the shared Google Doc; covers heading styles, figure/table conventions, references, declaration of originality, and a pre-submission sanity-check. app/README.md Step-by-step instructions for running Mohamed's Gradio app locally on macOS / Linux. Includes troubleshooting, expected cold-start times on CPU vs GPU, and smoke-test sentences for capturing the §5.1 screenshots. scripts/lime_explain.py Model-agnostic LIME explainer for the three model families (TF-IDF + LR, RoBERTa, OPT-1.3B + LoRA). Reads q4_errors.json (or a single ad-hoc sentence), produces per-example HTML + PNG token- importance plots, and a JSON summary. Uses LIME because the brief flags it as bonus interpretability content for §2.2 / §4. reports/results/q4_error_analysis.md Adds an optional §4.6 'LIME interpretability' subsection wired up to the new script and a discussion paragraph contrasting LoRA's attribution against TF-IDF + LR's purely lexical attribution. requirements.txt Adds lime>=0.2.0.1. docs/coursework_checklist.md Points teammates at the new outline and lists the LIME path. Made-with: Cursor
app/app.py Cherry-picked from origin/main so the deployment can be run from this branch without switching. No edits. notebooks/run_deployment_colab.ipynb Colab-ready notebook that clones the repo, installs deps, patches the app to launch with share=True, and runs it on a free T4 GPU. Prints a *.gradio.live public URL that anyone in the team can use to grab the Q5.1 screenshots without setting up Python locally. Made-with: Cursor
- docs/REPORT_TRIM.md: section-by-section trim plan with paste-ready prose for §2.1, §2.2, §2.3, §3.4 to take the report from 29 → 25 pages (focus on collapsing SVM into a 2-sentence aside, dropping the §2.1 "Sarcastic Class Gap" repetition, and removing §3.4 subsections that duplicate §2.3). - docs/MAIN_NOTEBOOK_PLAN.md: canonical source-notebook table per report section + answers to Mohamed's three coordination questions (run-from-scratch path, sections to keep, which adapter — OPT-1.3B is the canonical model). - notebooks/main.ipynb: 40-cell end-to-end submission notebook (EDA → vocab → TF-IDF baselines → RoBERTa cross-variety → LoRA → §3 evaluation tables → §4 error analysis → §5.2 efficiency benchmark). Defaults to RETRAIN=False (load adapters from HF Hub momofahmi/*) for ~10 min Colab T4 runs; flip RETRAIN_ROBERTA / RETRAIN_LORA flags in §0.1 to retrain. - scripts/build_main_notebook.py: reproducible builder for main.ipynb (regenerate after API changes). - docs/coursework_checklist.md: refreshed for today's submission, points at the new trim guide and notebook plan. Co-authored-by: Cursor <cursoragent@cursor.com>
- Generates the trimmed 25-page submission docx programmatically using python-docx. - Embeds figures from reports/figures/ (q1_1_*, q1_2_*, q2_2_*, q2_3_*) and renders all 9 data tables (LR/SVM/RoBERTa headline, LoRA ablation, cross-variety Macro-F1 + Sarcasm-F1, frozen-base comparison, deployment models, efficiency benchmark, Q4 few-shot outcomes). - Title page, declaration of originality, IEEE-style references list included. - Default formatting: Calibri 11pt, 1.15 line spacing, 2.2 cm margins; pass --compact for 10.5pt / 1.10 line spacing / 1.6 cm margins if the team needs to tighten further. - dist/ is gitignored — open dist/report_PG15.docx in Word/Pages, verify page count, and export to report_PG15.pdf for submission. Co-authored-by: Cursor <cursoragent@cursor.com>
The team repo momofahmi/NLP-sequence-classification is private, so a fresh Colab kernel cannot clone it without GITHUB_TOKEN. Switch the default REPO_URL in main.ipynb, run_deployment_colab.ipynb, 2.1_Baseline_TFIDF_LogReg_Yusrah_Omar.ipynb (and the builder script) to https://github.com/TheFinix13/NLP-coursework.git on branch main, with REPO_DIR=/content/NLP-coursework. REPO_URL and REPO_BRANCH remain overridable via env vars so anyone can point at a different fork. The 2.2 and 2.3 notebooks already used the same env-var pattern with TheFinix13/NLP-coursework as the default — left untouched. Also refreshed README.md Colab badges and the local-setup snippet to use the public mirror. Co-authored-by: Cursor <cursoragent@cursor.com>
Per Joel's note, marker reproducibility comes from loading the canonical
training results we already produced — not from re-running training.
This commit makes main.ipynb self-contained and CPU-runnable in <1 minute:
* Inline every helper function used by the team's domain notebooks:
- 1.1 EDA: imbalance + correlation + POS + slang (Yusrah/Omar)
- 1.2 Vocab: Jaccard + TF-IDF cosine + linguistic features
- 2.1 Baselines: TF-IDF + LR + LinearSVC (per task) with macro-F1
- 2.2 RoBERTa: tokenize, prepare_dataset, compute_metrics,
full_evaluation, calculate_class_weights, WeightedTrainer,
train_roberta, evaluate_on_testset (verbatim from Joel's
NLP-sequence-classification/notebooks/task_2_2.ipynb)
- 2.3 LoRA: train_lora_adapter, evaluate_lora_adapter
* Load canonical results from reports/results/roberta_weighted/ and
reports/results/roberta_sentiment/all_pool.json (extracted from
origin/main:NLP-sequence-classification/{weighted_results,results}/).
Reproduces Joel's 5x3 cross-variety matrix and best-condition
confusion matrix exactly.
* Gate heavy paths behind explicit flags so the notebook runs CPU-only:
- FROM_SCRATCH=False (default): load JSONs, render plots
- FROM_SCRATCH=True: re-run RoBERTa + LoRA training (Colab T4)
- RUN_ERROR_ANALYSIS=False (default): skip OPT-1.3B download
- RUN_BENCHMARK=False (default): skip RoBERTa+OPT timing
Set the True flags on Colab.
* Verified end-to-end execution via nbconvert: 31/31 code cells pass
in 38s on local CPU. Numbers match the report:
- Sentiment all-pool: UK 0.951, AU 0.901, IN 0.855
- Sarcasm best (all): UK 0.735, AU 0.744, IN 0.609
- LoRA en-AU in-var: 0.7747
Files added:
reports/results/roberta_weighted/{uk,au,in,inner_pool,all}.json
reports/results/roberta_sentiment/all_pool.json
reports/figures/roberta_canonical/{cross_variety_matrix,confusion_matrix_best}{,_repro}.png
reports/figures/{sarcasm,sentiment,source,variety,sarcasm_sentiment_correlation,source_by_variety,vocabulary_similarity_heatmap}*.png
reports/figures/q2_3_lora_macro_f1_heatmap_repro.png
Co-authored-by: Cursor <cursoragent@cursor.com>
Joel's worst-case fallback: if the notebook orchestrator misbehaves on a
marker's machine, they can run the entire pipeline as plain Python instead.
* scripts/build_main_script.py — extracts every code cell from
notebooks/main.ipynb, strips IPython magics (%run → subprocess.run),
inserts section banners (§0 setup … §5 efficiency), and writes the
consolidated script to scripts/main.py. Re-run after edits to the
notebook builder to keep both files in sync.
* scripts/main.py — 730-line auto-generated script that explicitly
states all 12 helper functions + WeightedTrainer class from the
team's domain notebooks. Verified end-to-end execution in 47s on
plain Python (no Jupyter):
seed_all, roberta_tokenize, roberta_prepare_dataset,
compute_metrics, full_evaluation, calculate_class_weights,
WeightedTrainer.compute_loss, train_roberta, evaluate_on_testset,
train_lora_adapter, evaluate_lora_adapter
* README.md — adds a "Run the whole pipeline in one command" section
pointing at both entry points (notebook + script) with timing
estimates and a reminder that both clone from the public mirror, so
no access to Mohamed's private repo is needed.
Co-authored-by: Cursor <cursoragent@cursor.com>
Pass to remove the most obvious giveaways from main.ipynb and main.py:
* Strip Unicode tells: em-dashes, en-dashes, right-arrows, smart quotes,
plus-minus sign, double-headed arrows. All ASCII now.
* Drop overused jargon: 'canonical' (was used 11 times), 'verbatim',
'inlined', 'self-contained', 'mirrors X in Y's notebook'.
* Remove the '**bold-italic mini-headers**' inside markdown cells.
* Remove third-person 'so the marker can read...' commentary.
* Cut the table-of-sections at the top of the notebook to two short
paragraphs.
* Shorten or delete redundant code comments that just restated the
function name.
* Replace section-sign 'sec.X.Y' with plain 'Section X.Y'.
Code itself is unchanged. Both main.ipynb (31/31 cells, no errors) and
main.py (47s end-to-end) still produce the same numbers.
Co-authored-by: Cursor <cursoragent@cursor.com>
When running the FROM_SCRATCH=True path, the rerun loop saved JSONs under the test keys returned by `get_test_conditions()` (uk_test/au_test/in_test), which did not match the hardcoded uk_only/au_only/in_only keys used in Joel's saved JSONs. Loading the rerun results then crashed with `KeyError: 'uk_only'`. Auto-detect the test-key naming from the loaded JSONs and use whichever scheme matches. Both naming schemes verified working. Co-authored-by: Cursor <cursoragent@cursor.com>
Fahmi's notebook imports legacy modules (src.data_loader, src.eda,
src.functions_to_use, src.lr_feature_extraction, models.lora_adapters,
models.logistic_regression_class_helper) that exist on origin/main but
were never on fiyin/model-pipeline. This commit:
* Cherry-picks the 6 legacy modules from origin/main so existing code
that imports them works against this branch and the public mirror.
* Fixes a real bug: src/data_loader.py used pd.concat without
importing pandas. Added the missing import.
* Adds notebooks/main_notebook_fahmi_patched.ipynb with three patches:
- prepended a Colab clone cell so it runs on a fresh Colab runtime
- turned the HF_TOKEN raise into a soft-skip (login optional)
- removed os.chdir('..') / replaced PROJECT_ROOT=os.path.abspath('..')
so paths work from any cwd
* Adds the EDA figures Fahmi's notebook produces.
Verified: cells 0-35 of the patched notebook (EDA + full LR baseline)
execute end-to-end with no errors on a fresh kernel.
The team can pick either notebook for submission:
- notebooks/main.ipynb (gates heavy paths, runs in 1 min)
- notebooks/main_notebook_fahmi_patched.ipynb (more LoRA detail,
requires GPU + ~1 hour for full retrain)
Co-authored-by: Cursor <cursoragent@cursor.com>
…ication) Mohamed's repo is now public, so the marker can clone it without auth. Switch every Colab setup cell, badge, and doc reference from the personal mirror (TheFinix13/NLP-coursework) to the team repo so it's clear this is a group project where everyone contributed. - main.ipynb, main.py, main_notebook_fahmi_patched.ipynb, run_deployment_colab.ipynb - 2.1 / 2.2 / 2.3 section notebooks - README badges + docs/BRANCH_fiyin_model_pipeline.md - build_main_notebook.py / build_main_script.py defaults Default branch stays fiyin/model-pipeline because that's where the consolidated main.ipynb + saved JSON results live; both URL and branch are still overridable via REPO_URL / REPO_BRANCH env vars. Co-authored-by: Cursor <cursoragent@cursor.com>
Colab pre-installs torchao==0.10.0 on every runtime. Recent peft (>=0.13) calls is_torchao_available() during LoRA adapter injection and that helper raises ImportError if torchao is installed but < 0.16.0. We don't use torchao for plain LoRA, so the fix is to just remove it. - main.ipynb / main_notebook_fahmi_patched.ipynb: add pip uninstall -y torchao after the requirements install in the Colab setup cell. - scripts/_compat.py: small ensure_peft_compat() helper that removes torchao only if it's the broken version. Idempotent, no-op if torchao isn't installed or is already >= 0.16.0. - q4_extract_errors.py / lime_explain.py / benchmark_inference.py / app/app.py: call ensure_peft_compat() before importing peft so the scripts also work standalone (without going through the setup cell). Unblocks Fahmi's q4 error-extraction run; verified locally that scripts/main.py still runs end-to-end clean. Co-authored-by: Cursor <cursoragent@cursor.com>
q4_extract_errors.py writes a dict with the misclassified examples nested under "examples" (alongside metadata like task, n_total_errors, etc.), but the loader cell in main.ipynb / main.py treated the file as a raw list and tried errors[0] on the dict. That failed with KeyError: 0 because dict[0] looked up an integer key. Match what q4_few_shot_eval.py and lime_explain.py already do: pull the list out of payload["examples"] before indexing. Print the total misclassification count alongside the selected count so the cell is actually informative. Co-authored-by: Cursor <cursoragent@cursor.com>
The few-shot cell was running q4_few_shot_eval.py unconditionally, which sys.exits when no explanation strings are present yet. The LIME cell ran lime_explain.py with no args, which exits asking for either --text or --in. Now both cells: - Check whether reports/results/q4_errors.json exists. - Count entries with a non-empty `explanation` field. - Run the script only when there's enough explained data (>= 4 for few-shot, >= 1 for LIME) and pass the --in / --out-dir args. - Otherwise print a short note explaining what to do (edit the JSON, add explanations, re-run). Side fix: build_main_script.py's strip_magics now uses shlex.split so `%run script.py --arg val` converts to a properly tokenised subprocess.run([sys.executable, 'script.py', '--arg', 'val']) call instead of running a single argv that contains spaces. Co-authored-by: Cursor <cursoragent@cursor.com>
Section 4 of the report identifies 10 specific en-AU test examples (4 explained: idx 142, 302, 508, 618 within the en-AU subset; 6 held out: 264, 523, 657, 256, 395, 492) and the 4 written explanations for the explained ones. Bake them into the repo so the few-shot eval and LIME cells run without anyone having to hand-edit the JSON first. - scripts/q4_build_curated_errors.py: rebuilds the file by loading BESSTIE-CW-26 test, filtering to en-AU, picking the 10 idx values from section 4, and adding the 4 written explanations. No model required, just the dataset. - reports/results/q4_errors.json: pre-built artefact (10 entries, 4 with explanation strings) so the marker can run main.ipynb / main.py end-to-end without any manual step. - main.ipynb / main.py: the error-extraction cell now runs the curated build by default (with RUN_ERROR_ANALYSIS=True). The q4_extract_errors.py path is still available as a commented-out alternative for re-running the actual model. - Loader cell now shows an explained example (not whichever happens to be first) and previews a 250-char excerpt of the explanation. Verified all 10 idx values map to the texts and gold labels printed in the report (Section 4, examples 1-4 and Table 7). Co-authored-by: Cursor <cursoragent@cursor.com>
If a marker (or teammate) has a stale clone of the repo, %run blows up with a confusing IPython OSError. Each %run site now checks the script exists first and prints a one-line "run git pull" message if not. Affects four cells: q4 curated build, q4 few-shot eval, LIME, and the efficiency benchmark. The committed reports/results/q4_errors.json is unchanged, so anyone with an older clone can either pull or just leave RUN_ERROR_ANALYSIS=False and use the pre-built file. Co-authored-by: Cursor <cursoragent@cursor.com>
Two fixes for the Q4 / Q5.2 cells: 1. Loader cell now auto-rebuilds q4_errors.json from the report when it finds 0 explanations on disk (e.g. because an earlier session left a stale file from q4_extract_errors.py). No flags to flip; the marker doesn't need to think about cell ordering. Verified by replacing the committed file with a fake 0-explanation version and re-running main.py - auto-rebuild produced the canonical 4/10 file. 2. Benchmark cell now passes --tfidf-vec / --tfidf-clf / --roberta / --base-llm / --lora flags. Without them benchmark_inference.py was skipping every model and printing "(no rows)". Defaults match the report Table 11: roberta-base, momofahmi/besstie-lora-en-au-opt-1.3b, plus the local TF-IDF artefacts from models/. Switched the benchmark cell from %run-with-magic-vars to subprocess.run so the same code works identically in main.ipynb and main.py - the build_main_script.py converter chokes on `%run script.py $varname` because shlex sees $varname as a literal string. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completing the Model Pipeline with Training and Evaluation.