Code for NAACL paper When Quantization Affects Confidence of Large Language Models?
-
Updated
Dec 30, 2024 - Jupyter Notebook
Code for NAACL paper When Quantization Affects Confidence of Large Language Models?
Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90
Evaluation of Llama-3.1-8B Base vs Instruct on TruthfulQA using few-shot prompting and automatic judge models
A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).
Official code for "From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems" (IWSDS 2026)
Multi-agent framework for hallucination detection and correction in LLM outputs using retrieval-grounded verification. MSc AI/ML dissertation (LJMU).
Add a description, image, and links to the truthfulqa topic page so that developers can more easily learn about it.
To associate your repository with the truthfulqa topic, visit your repo's landing page and select "manage topics."