Evaluating your RAG Application using RAGAS | In Easy 3 Steps

3 min readFeb 19, 2024

Evaluations are meant for betterment! (amélioration)

Hey folks! I will take through understanding and evaluations of a Retrieval Augmented Generation (RAG) Systems

Need for RAG Evaluation

In the world of LLMs and Chatbots, hallucination is the most common disease we are fighting against. Haullucinations is usually handled by two frequent technique;

Fine-Tuning for Specific Task
Retrieval Augmented Generation

Out of these two option, RAG System are more popular among the individuals.

However, there are quite a lot of options to choose while building an RAG Application as illustrated in the fig.1

Fig. 1. Various Options For RAG Apps ( Credits: LangChain AI )

As you can see in the figure, there are lot of options available to choose from while building your RAG app. However, it is more crucial to choose the one that fits your optimal needs.

Here, comes the awesome framework — RAGAS: Automated Evaluation of Retrieval Augmented Generation to evaluate the RAG based apps. This focuses on Metric Driven Development (MDD) to improve the performance of the RAG Apps. You can read more about the framework — here.

Alright, Come on! Let’s get our hands dirty!!!

IMPLEMENTATION

1. Install and Import Packages

(NOTE: We’ll use Open AI’s GPT-4 to evaluate the prepared data, make sure you are ready with your Open AI Api Key)

Install the packages using your favourite package manager. Here, I am using PIP to install and manage the dependencies.

pip install -U -q ragas tqdm datasets

Import the installed packages.

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

I assume, that you already have the data to evaluate, if not kindly feel to use the below sample data. ( Optional )

git clone https://github.com/mahimairaja/sample_ragas_dataset.git
cd sample_ragas_dataset

2. Setup API Keys and Load Data

Using the copied API Key from the Open AI Platform Dashboard setup the api key environmental variables. Here, I am passing the variable, through the colab secrets. So, before running the cell, make sure, you have assigned the secret variable with the api key value.

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Here, I am loading the data from a json file.

from datasets import load_dataset

ragas_dataset = load_dataset('json', data_files='data.json')
data = ragas_dataset['train']

3. Evaluate and Visualize

Using the imported metrics from ragas, evaluate the dataset with the columns; question, answer, context and ground_truth.

result = evaluate(
  data,
  metrics=[
      context_precision,
      faithfulness,
      answer_relevancy,
      context_recall,
      context_relevancy,
      answer_correctness,
      answer_similarity
  ],
  raise_exceptions=False
)

print(result)

The result, I got for evaluating the sample dataset is as below:

{
   'context_precision': 0.9000, 
   'faithfulness': 1.0000, 
   'answer_relevancy': 0.9245, 
   'context_recall': 1.0000, 
   'context_relevancy': 0.1061, 
   'answer_correctness': 0.6074, 
   'answer_similarity': 0.9396
}

Visualize the computed metrics results into a Radar Plot using Plotly. ( Do leave a comment if you feel, other plots can be intrinsic for this… )

import plotly.graph_objects as go


data = {
    'context_precision': result['context_precision'],
    'faithfulness': result['faithfulness'],
    'answer_relevancy': result['answer_relevancy'],
    'context_recall': result['context_recall'],
    'context_relevancy': result['context_relevancy'],
    'answer_correctness': result['answer_correctness'],
    'answer_similarity': result['answer_similarity']
}

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
    r=list(data.values()),
    theta=list(data.keys()),
    fill='toself',
    name='Ensemble RAG'
))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )),
    showlegend=True,
    title='Retrieval Augmented Generation - Evaluation',
    width=800,
)

fig.show()

Hurray! Here is our awesome Radar Plot to visualize the RAG evaluation metric.

Fig. 2. Visualization of RAG Evaluation — RADAR Plot

(NOTE: You can use this plot to compare multiple RAG variants.)

Thanks for reading!

If you are interested in building your own dataset for RAGAs evaluation from scratch, stay tuned for the next blog!!!!

You can find the complete code at the end of the page. See you Again…

REFERENCE

RAGAS Official Documentation

Introduction | Ragas

Skip to content Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG…

docs.ragas.io

2. Complete Code — Google Colab

Google Colaboratory

Edit description

colab.research.google.com