May 21, 2024

How to Benchmark Large Language Model Performance with LlamaIndex Evaluators

This blog post equips you with the knowledge and tools necessary to effectively benchmark large language models.

How to Benchmark Large Language Model Performance with LlamaIndex Evaluators
We Help You Engage the Top 1% AI Researchers to Harness the Power of Generative AI for Your Business.

Benchmarking the performance of LLMs accurately is a complex task. LlamaIndex Evaluators provide a robust and reliable framework for assessing the capabilities of these models. They offer a suite of tests designed to measure various aspects of a model's performance, from its understanding of syntax and semantics to its ability to generate coherent and contextually appropriate responses.

Benchmarking the performance of large language models is important as it allows for systematic comparison of different models on standardized tasks, which enables stakeholders to identify which models excel in particular domains or applications. It facilitates tracking the progress of AI and language model development over time, by providing insights into the pace of innovation and advancement in the field. Benchmarking helps identify the strengths and weaknesses of LLMs, by guiding further research and development efforts to improve model performance and address limitations. 

Moreover, it informs resource allocation decisions, by ensuring efficient investment of resources in the development and optimization of models. It can evaluate models for fairness, bias, and ethical considerations, which are increasingly important as LLMs are deployed in diverse applications that impact human lives. It serves as an important tool for evaluating, improving, and ensuring the effectiveness and fairness of large language models in various contexts.

In this blog post, we aim to equip you with the knowledge and tools necessary to effectively benchmark large language models, by enabling you to make informed decisions about their deployment and use. Let's start.

Datasets for Benchmarking LLMs

When it comes to querying the LLMs and for language understanding, we prefer some popular datasets that are used for benchmarking.

MMLU Benchmark

The MMLU (Massive Multitask Language Understanding) Benchmark is a comprehensive evaluation framework designed to measure the multitask accuracy of Large Language Models across a wide range of tasks. It is developed to assess an LLMs' proficiency in understanding and processing natural language in various domains; the MMLU Benchmark covers 57 tasks spanning different branches of knowledge, including humanities, social sciences, STEM, and other areas. The MMLU dataset comprises 15,908 questions manually collected by graduate and undergraduate students. These questions are carefully curated to represent real-world scenarios and cover a broad spectrum of topics, ensuring the benchmark's validity and relevance. Questions in the MMLU dataset come in different formats, including multiple-choice, short-answer, and open-ended questions. This diverse set of question types challenges LLMs to adapt to different response formats and demonstrate their understanding of textual information in various contexts.

To evaluate LLM performance, MMLU averages each model's performance per category and then averages these four scores for a final overall score. This scoring mechanism provides a comprehensive overview of the model's performance across different knowledge domains. The MMLU benchmark provides an evaluation platform where researchers and developers can submit their models for assessment. The platform offers detailed insights into the model's performance across different tasks and allows for comparisons with other state-of-the-art models.

MT Bench Benchmark

The MT Benchmark is a comprehensive evaluation framework designed to assess the conversational and instruction-following abilities of Large Language Models in multi-turn dialogue scenarios. It is developed to evaluate the performance of LLMs in understanding and responding to human prompts across various contexts; MT Bench consists of carefully curated sets of questions that cover a diverse range of topics and interaction styles. The MT Bench dataset is meticulously curated to include high-quality, multi-turn questions that accurately reflect real-world conversational and instructional scenarios. Each question is crafted to assess specific aspects of language understanding, which allows for targeted evaluation of an LLM’s capabilities.

MT Benchmark employs rigorous evaluation metrics to measure the performance of LLMs across different tasks and question types. Metrics may include accuracy, coherence, relevance, and task completion rates, providing detailed insights into the model's strengths and weaknesses in various linguistic contexts. Models submitted to the MT Bench benchmark are evaluated based on predefined criteria, and their performance scores are recorded on a leaderboard. The leaderboard allows researchers and developers to track model performance over time and compare the effectiveness of different LLMs in handling multi-turn dialogue and instruction-following tasks. 

HumanEval Benchmark

The HumanEval benchmark is a fundamental evaluation tool designed to measure the performance of Large Language Models specifically in code generation tasks. Developed to assess LLMs' ability to generate functional and accurate code solutions to programming challenges, HumanEval incorporates a hand-crafted dataset and a novel evaluation metric tailored for code generation tasks. The cornerstone of the benchmark is the HumanEval dataset, which comprises a set of 164 handwritten programming problems meticulously crafted to assess language comprehension, algorithmic reasoning, and mathematical proficiency. Each programming problem within the dataset includes a function signature, docstring, code body, and several unit tests, which provide a comprehensive evaluation of any LLM’s code generation capabilities.

HumanEval introduces a novel evaluation metric known as Pass@k, which is specifically designed to assess the functional correctness of code generated by LLMs. The Pass@k metric measures the probability that at least one of the top k-generated code samples for a given problem passes the accompanying unit tests. This metric mirrors the process by which human developers evaluate code correctness based on whether it satisfies predefined test cases. To evaluate an LLM's performance using the HumanEval benchmark, the model's generated code solutions are subjected to the Pass@k metric, where k represents the number of top code samples considered for evaluation. A higher Pass@k score indicates a greater likelihood of generating functionally correct code solutions across multiple attempts, which reflects the model's proficiency in code generation tasks.

The HumanEval benchmark maintains a leaderboard that ranks LLMs based on their performance in code generation tasks. Models are evaluated and compared using the Pass@k metric, which allows researchers and developers to assess the effectiveness of different LLM architectures, training methodologies, and fine-tuning strategies in producing accurate and reliable code solutions.

TriviaQA Benchmark

The TriviaQA Benchmark is a specialized evaluation framework designed to measure the ability of Large Language Models to generate accurate and truthful answers to questions. Unlike traditional question-answering benchmarks that focus solely on correctness, TriviaQA places a strong emphasis on the veracity of the generated responses, which aims to ensure that language models provide factually accurate information. TriviaQA covers a wide range of topics and question types to ensure a comprehensive evaluation of LLMs' truthfulness in answer generation. Questions span various domains, including health, law, finance, politics, and more, providing a diverse set of challenges for language models to tackle. The TriviaQA dataset comprises carefully curated questions that are designed to test the factual accuracy of a language model’s responses. Questions are sourced from real-world contexts and undergo rigorous validation to ensure their relevance and accuracy.

LLMs participating in the TriviaQA Benchmark are evaluated based on their ability to generate truthful answers across different question categories. Models that consistently produce accurate and factually correct responses receive higher scores, by reflecting their proficiency in truthfully answering questions. TriviaQA maintains a leaderboard where researchers and developers can submit their language models for evaluation. The leaderboard allows for tracking model performance over time and comparing the truthfulness of different LLMs' answer-generation capabilities.

ARC Benchmark

The ARC (AI2 Reasoning Challenge) Benchmark is a rigorous evaluation framework designed to assess the reasoning abilities of Large Language Models through a series of challenging science-related questions. Developed by the Allen Institute for Artificial Intelligence (AI2), ARC focuses on testing the knowledge and reasoning capabilities of LLMs in understanding and answering complex science questions. The ARC dataset is partitioned into two subsets: the Challenge Set and the Easy Set. The Challenge Set consists of particularly difficult questions, which require advanced reasoning skills to answer correctly. In contrast, the Easy Set includes questions that are relatively simpler and serve as a baseline for comparison. LLMs participating in the ARC Benchmark are evaluated based on their performance in answering science questions accurately. Metrics may include accuracy rates, precision, recall, and F1 score, which provide a comprehensive assessment of model performance across different question types and difficulty levels.

The ARC dataset is meticulously curated to ensure the quality and relevance of the questions. Questions are sourced from various scientific sources and undergo rigorous validation to verify their accuracy and correctness. ARC maintains a leaderboard where researchers and developers can submit their LLMs for evaluation. The leaderboard allows tracking model performance over time and comparing the reasoning abilities of different language models.

MBPP Benchmark

The Mostly Basic Python Programming (MBPP) Benchmark is a specialized evaluation framework designed to assess the ability of Large Language Models to synthesize short Python programs from natural language descriptions. This benchmark focuses on evaluating the programming proficiency of LLMs, particularly in the context of Python, by presenting them with a series of programming tasks ranging from fundamental concepts to standard library functionality. The MBPP dataset consists of carefully curated programming tasks, each accompanied by a natural language description, code solution, and automated test cases to verify functional correctness. Tasks are designed to cover a broad spectrum of Python programming concepts and scenarios, which provide a comprehensive evaluation of the LLM’s programming capabilities.

LLMs participating in the MBPP Benchmark are evaluated based on their ability to generate Python code solutions that pass the provided test cases. Metrics such as accuracy, efficiency, and adherence to programming best practices may be used to assess the quality of the generated code. Models submitted to the MBPP Benchmark are scored based on their performance in generating correct and functional Python code solutions for the given programming tasks. A leaderboard tracks the performance of different LLMs, which allows researchers and developers to compare their programming proficiency across various tasks. The MBPP Benchmark is particularly relevant for assessing LLMs' suitability for tasks that involve automated code generation, such as code completion, code summarization, and program synthesis. It provides insights into the model’s ability to understand natural language descriptions of programming tasks and translate them into executable code.

Utilizing Cloud GPUs for Benchmarking

With the evolving size and capability of large language models, the requirement of Cloud GPUs is going up. We will use E2E Networks.  

Set up your SSH key by visiting Settings.

After creating the SSH key, visit Compute to create a node instance.

Open your Visual Studio code, and download the extension Remote Explorer and Remote SSH. Open a new terminal. Login into your local system with the following code:

ssh root@`your-ip-address`

With this, you’ll be logged in to your node. 

Evaluating LLMs with LlamaIndex

When it comes to evaluating LLMs, we will consider RAG (Retrieval Augmented Generation). In this blog post, we will evaluate the Danube 1.8B model with the MMLU Professional Medicine dataset. The Danube 1.8B is a notable open-source natural language model developed by, which boasts an impressive scale with 1.8 billion parameters and training on a massive corpus of 1 trillion tokens sourced from diverse web content. It leverages the techniques refined from predecessors like LLama 2 and Mistral; this model showcases remarkable performance across a spectrum of tasks including common sense reasoning, reading comprehension, summarization, and translation. Even with its relatively modest training data, benchmark results indicate that this model competes favorably with or outperforms other models in its parameter class. 

Let’s evaluate the performance of Danube 1.8B with LlamaIndex evaluators on the MMLU Benchmark dataset.

First, we’ll install the dependencies.

%pip install -q llama-index
%pip install -q datasets transformers
%pip install -q llama-index-embeddings-huggingface
%pip install -q llama-index-llms-huggingface
%pip install -q spacy

Then, we will load the MMLU dataset and save it into a directory.

from datasets import load_dataset
mmlu = load_dataset('cais/mmlu','professional_medicine',split='test')

mmlu_df = mmlu.to_pandas()
!mkdir mmlu

Now, we’ll load the dataset using LlamaIndex SimpleDirectoryReader.

from llama_index.core import SimpleDirectoryReader
mmlu_dir = SimpleDirectoryReader('./mmlu').load_data()

Then, we’ll initiate the embeddings using Hugging Face.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

We’ll initiate the large language model with the help of Hugging Face and save the LLM, 

embed_model and chunk size in the settings.
from llama_index.llms.huggingface import HuggingFaceLLM
import torch
from llama_index.core import Settings

llm = HuggingFaceLLM(
 generate_kwargs={"temperature": 0.7, "do_sample": False},
 stopping_ids=[50278, 50279, 50277, 1, 0],
 tokenizer_kwargs={"max_length": 4096},
 model_kwargs={"torch_dtype": torch.float16}

Settings.llm = llm
Settings.chunk_size = 512
Settings.embed_model = embed_model

Then, we’ll initiate the Splitter and use it in the vector index for transformations along with the MMLU document. 

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
splitter = SentenceSplitter(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(
    mmlu_dir, transformations=[splitter]

For LlamaIndex evaluators, we’ll initiate the asyncio.

import nest_asyncio

After that, we will initiate the evaluators with the LLM.

from llama_index.core import Response
from llama_index.core.evaluation import (
faithfulness = FaithfulnessEvaluator(llm=llm)
relevancy = RelevancyEvaluator(llm=llm)
correctness = CorrectnessEvaluator(llm=llm)

Let’s make a question-answer pair using the dataset.

qa = ["What is the most likely cause of decreased sensation in the left thigh of a 67-year-old woman who had a pulmonary embolism and underwent placement of an inferior vena cava (IVC) filter?",
"What is the most likely diagnosis for a 25-year-old gravida 3 para 2 female in active labor at 39 weeks' gestation who is experiencing intermittent, weak contractions despite having a cervix that is 100% effaced and 7 cm dilated?",
"How should the findings of a randomized, double-blind, placebo-controlled clinical trial evaluating a new drug for the treatment of the common cold, which showed a statistically significant difference in symptom resolution time compared to placebo, be interpreted?",
"Which complication is a 9-year-old boy, who immigrated to the United States 2 months ago and presents with a systolic murmur, weak femoral pulses, and left ventricular hypertrophy on ECG, at greatest risk for?",
"What is the most likely cause of numbness and tingling in the right ring and small fingers of a 25-year-old woman with a history of playing softball and severe pain elicited by palpation of the right elbow?"]

ref = ["A 67-year-old woman comes to the physician for a follow-up examination. She had a pulmonary embolism and required treatment in the hospital for 3 weeks. She had a retroperitoneal hemorrhage; anticoagulant therapy was temporarily discontinued, and she underwent placement of an inferior vena cava (IVC) filter. She had a hematoma that was resolving on discharge from the hospital 2 weeks ago. Today, she says she has had a persistent sensation of tingling and numbness of her left thigh that she did not report in the hospital because she thought it would go away; the sensation has improved somewhat during the past week. Her only medication is warfarin. Vital signs are within normal limits. Examination of the skin shows no abnormalities. Muscle strength is normal. Sensation to light touch is decreased over a 5 x 5-cm area on the lateral aspect of the left anterior thigh. Which of the following is the most likely cause of this patient's decreased sensation?,professional_medicine,['Cerebral infarction during the hospitalization'\n 'Complication of the IVC filter placement'\n 'Compression of the lateral femoral cutaneous nerve'\n 'Hematoma of the left thigh'],2",
 "A 25-year-old gravida 3 para 2 female is admitted to the hospital at 39 weeks' gestation in active labor. She had been having regular contractions every 4 minutes, but is now having only a few intermittent, weak contractions. She has received medication for pain twice in the past 6 hours. Examination shows no reason for obstructed labor. The fetal head is engaged, the membranes are intact, the fetal heart tones are normal, and the cervix is 100% effaced and 7 cm dilated. The most likely diagnosis is,professional_medicine,['Braxton Hicks contractions' 'lower uterine retraction ring'\n 'hypotonic uterine dysfunction' 'primary dysfunctional labor'],2", 
"A 5-year-old boy is brought to the physician by his mother because of a 2-day history of a low-grade fever, cough, and runny nose. His temperature is 38°C (100.4°F). Examination findings are consistent with a diagnosis of a common cold. The physician refers to a randomized, double-blind, placebo-controlled clinical trial that evaluated the effectiveness of a new drug for the treatment of the common cold. The mean time for resolution of symptoms for patients receiving the new drug was 6.4 days, compared with a mean time of 6.7 days for patients receiving the placebo (p=0.04). Which of the following is the most appropriate interpretation of these study results?,professional_medicine,['The findings are clinically and statistically significant'\n 'The findings are clinically insignificant but statistically significant'\n 'The findings are clinically significant but statistically insignificant'\n 'The findings are neither clinically nor statistically significant'],1" , 
"A 9-year-old boy is brought to the office by his parents for a well-child examination. The patient and his family immigrated to the United States 2 months ago and he has not been evaluated by a physician in 4 years. He has been generally healthy. Medical history is significant for pneumonia at age 3 years. He takes no medications. He is at the 25th percentile for height, weight, and BMI. Vital signs are temperature 37.0°C (98.6°F), pulse 82/min, respirations 20/min, and blood pressure 112/74 mm Hg. Cardiac examination discloses a grade 3/6 systolic murmur audible along the left sternal border at the third and fourth intercostal spaces. Femoral pulses are weak and brachial pulses are strong; there is a radiofemoral delay. Chest xray discloses mild cardiomegaly with left ventricular prominence. ECG shows left ventricular hypertrophy. This patient is at greatest risk for which of the following complications?,professional_medicine,['Atrial fibrillation' 'Cor pulmonale' 'Systemic hypertension'\n 'Tricuspid valve regurgitation'],2",
 "A 25-year-old woman comes to the physician because of a 2-month history of numbness in her right hand. During this period, she has had tingling in the right ring and small fingers most of the time. She has no history of serious illness and takes no medications. She is employed as a cashier and uses a computer at home. She played as a pitcher in a softball league for 5 years until she stopped 2 years ago. Vital signs are within normal limits. Examination shows full muscle strength. Palpation of the right elbow produces a jolt of severe pain in the right ring and small fingers. Sensation to pinprick and light touch is decreased over the medial half of the right ring finger and the entire small finger. The most likely cause of these findings is entrapment of which of the following on the right?,professional_medicine,['Median nerve at the wrist' 'Musculocutaneous nerve at the forearm'\n 'Radial nerve at the forearm' 'Ulnar nerve at the elbow'],3"]

Now, we’ll run the batch evaluation.

from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness, "relevancy": relevancy, "correctness": correctness},

eval_results = await runner.aevaluate_queries(
    vector_index.as_query_engine(llm=llm), queries=qa, reference=ref,

Then, we’ll define the function for getting evaluation results.

def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Score: {score}")
    return score

Let’s see what are our scores:

cor_score = get_eval_results("correctness", eval_results)
rel_score = get_eval_results("relevancy", eval_results)
fai_score = get_eval_results("faithfulness", eval_results)

print('Correctness: ',cor_score)
print('Relevancy: ',rel_score)
print('Faithfulness: ',fai_score)

The outputs are:

Correctness:  0.0
Relevancy:  0.93
Faithfulness:  0.89


As we saw, the performance of the Danube 1.8B model was evaluated with LlamaIndex with the RAG technique. The relevance and faithfulness were good; however, it did not improve the accuracy of the response.