Evaluating LLMs
Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models.
However, model performance is often governed by minor implementation details. It is prohibitively difficult to expect results from one codebase to transfer directly to another. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully.
To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility.
Releases
Papers