Evaluation Metric

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

1. [SquaD Dataset](https://huggingface.co/datasets/squad) - F1 Score
2. [WikiText Dataset](https://huggingface.co/datasets/wikitext) - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Metric #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation Metric #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions