Skip to content

llama-bench: fix accumulated load_time in perf timings#21794

Open
abhinavuser wants to merge 1 commit intoggml-org:masterfrom
abhinavuser:fix/llama-bench-accumulated-load-time
Open

llama-bench: fix accumulated load_time in perf timings#21794
abhinavuser wants to merge 1 commit intoggml-org:masterfrom
abhinavuser:fix/llama-bench-accumulated-load-time

Conversation

@abhinavuser
Copy link
Copy Markdown

Overview

Fix accumulated load_time in llama_perf_context_print when running llama-bench with multiple parameter sets (e.g. -n 4,8,16,32).

The load_time keeps growing because each new context inherits t_start_us from the model's original load timestamp, but the model is reused across runs. Added llama_perf_context_reset(ctx) after context creation to reset the timing baseline per iteration.
Fixes #9286

Additional information

Spotted this while benchmarking with verbose output — the load_time went from ~1s to ~33s across 7 runs even though the model was only loaded once.

Requirements

@ggml-gh-bot

This comment was marked as off-topic.

@abhinavuser
Copy link
Copy Markdown
Author

not AI generated, i traced the bug through the code, each new context copies t_start_us from the model's original load timestamp (llama-context.cpp line 35), so when the model is reused across bench iterations the load_time just keeps growing
the fix is just resetting the perf context so t_start_us gets freshened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: llama_print_timings seems to accumulate load_time/total_time in llama-bench

2 participants