## Library to load the PDF
%pip install --upgrade --quiet install pypdf
## Library for chunking
%pip install --upgrade --quiet tiktoken
%pip install --upgrade --quiet nltk
## Library for the embedding
%pip install --upgrade --quiet gpt4all
%pip install --upgrade --quiet sentence-transformers
## Library to store the embeddng in a vector DB
%pip install --upgrade --quiet chromadb
## Library for information retrieval
%pip install --upgrade --quiet rank_bm25
## Library for the LLM interaction
%pip install --upgrade --quiet install langchain
%pip install --upgrade --quiet langchain-community
## Library to save the results in a word document
%pip install --upgrade --quiet python-docx
%pip install --upgrade --quiet markdown
## Library to evaluate the RAG process
%pip install --upgrade --quiet datasets
%pip install --upgrade --quiet ragas
## Library to save evaluation dataset in excel
%pip install --upgrade --quiet panda
%pip install --upgrade --quiet openpyxl
%pip install --upgrade --quiet plotly
Retrieval-Augmented Generation (RAG) - Technical approach paper on the systematic application of AI in evaluation synthesis and summarization
Evaluating AI
Usage for Evaluation Purpose
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.” - Edward Osborne Wilson
Executive Summary
Artificial Inteligence (AI) is presented as the potential trigger for the fifth wave of the evidence revolution (following the 4 previous ones: 1.Outcome Monitoring, 2.Impact Evaluation, 3.Systematic Reviews and 4.Knowledge Brokering). This actually reflects a situation where considering the numbers of published evaluation reports across the UN system, information retrieval and evidence generalization challenges have arisen: How to extract lessons and learning across contexts, institutions, programs, and evaluations in order to inform strategies and decision-making in other similar contexts?
The key deliverable from an evaluation is usually a long report (often over 60 pages PDF file). From this report, two-pagers executive “briefs” are usually designed for the consumption of a broader audience including senior executives. Striking the balance between breadth and depth is a common challenge but what remains even more challenging is the subjective dimension involved in choosing what to include and what to exclude. Highlighting critical aspects while deciding on what are the less relevant details to omit rely on people’s judgment as to what is important for specific audiences… The potential fear of being, like Cassandra in the greek mythology, the bearer of bad news comes with the structural risk of “cushioning” the real evaluation findings to a point where they get hidden. Relying on automated retrieval can therefore help improving the objectivity and independence of the evaluation report summarization.
Retrieval-augmented generation (RAG) is an AI Question-Answering framework that surfaced in 2020 and that synergizes the capabilities of Large Language Models (LLMs) and information retrieval systems from specific domain of expertise (hereafter “evaluation reports”). This paper is presenting the challenges and opportunities associated with this approach in the context of evaluation. It then suggests a potential solution and way forward.
First, we explain how to create an initial two-pagers evaluation brief using an orchestration of functions and models from Hugging Face Hub. Rather than relying on ad-hoc user interactions through a black-box point & click chat interface, a relevant alternative is to use a data science approach with documented and reproducible scripts that can directly output a word document. The same approach could actually be applied to other textual analysis needs, for instance: extracting causal chains from the transcriptions of Focus Group Discussions, performing Quality Assurance review on key documents, generating potential theories of change from needs assessment reports or assessing sufficient usage of programmatic evidence when developing Strategic Plan for Operation.
Second, we review the techniques that can be used to evaluate the performance of summarisation scripts both to optimize them but also to minimize the risk of AI hallucinations and misalignment. We generate alternative briefs (#2, #3, #4) and then create an specific test dataset to explore the different metrics that can be used to evaluate the information retrieval process.
Last we discuss how such approach can actually inform decisions and strategies for an efficient AI deployment: While improving RAG pipeline is the first important step, creating training dataset with human-in-the-loop allows to “ground truth” and “fine-tune” an existing model. This not only further increase its performance and but also ensure its reliability both for evidence retrieval and at latter stage for learning across systems and contexts.
A short presentation is also available here
Introduction
Building a robust information retrieval system requires the configuration of different components:
A Retrieval & Generation Pipeline: Build a knowledge base and configure how to retrieve the information from it then define efficient prompt to query the system;
A Continuous Evaluation Process: Explore and combine various options for both Retrieval and Generation to compare the results.
A Production Deployment Strategy: Organise AI-ready human feedback and prepare data for fine-tuning.
This paper compiles the results of experimentation applied to a practical use case. It includes a cookbook with reproducible recipes so that colleagues can rerun and learn from it. It also contains broader suggestions on the usage of AI for summarizing and synthesizing evaluations products and reports.
Non-technical audience can consume the executive summary above, the conclusions and the linked presentation.
Environment Set up
The body of this document targets a technical audience that may consider including such techniques within their personal information management toolkit, and this working safely, fully offline on their own computer. To get this audience interest, we used the 2019 Evaluation of UNHCR’s data use and information management approaches for the demo. Readers shall be able adjust this tutorial to their own use cases and are welcome to ask questions and share comments through a ticket in the source repository!
The world of LLMs is a Python one. The scripts below are based on langChain python module but the same pipeline could also be built with another LLM orchestration module like LlamaIndex
Make sure to install the last stable version of python and create a dedicated python environment to have a fresh install where to manage correctly all the dependencies between packages. This can be done with conda python modules management utility.
First directly in your OS Shell, create a new environment - here called evalenv
conda create –name evalenv python=3.11
Then activate it! Et voila!
conda activate evalenv
Once this environment selected as a kernel to run the notebook, we can install the required python modules for RAG:
# then Restart the jupyter kernel for this notebook
%reset -f
Retrieval & Generation Pipeline
The illustration from HuggingFace RAG Evaluation below nicely visualize the first two elements of the system architecture: retrieval (that includes: chunking, embedding, storing and retrieving) and generation (that includes prompting an LLM).
Information Retrieval
Load the PDF
There plenty of potential python packages to load pdf files… More details here. Note that more loader also exist for other type of data!!!
from langchain_community.document_loaders import PyPDFLoader
= PyPDFLoader("files/Info_Mngt_eval_2019.pdf")
loader = loader.load_and_split() docs
Chunking
If you have a large document, because of memory management, you will not be able to process it in one chunk. LangChain offers several built-in text splitters to divide text into smaller chunks based on different criteria.
Example of options that can be tested are:
Simple character-level processing with
CharacterTextSplitter,
Recursive Splitting with
RecursiveCharacterTextSplitter
,Words or semantic units with
TokenTextSplitter
,Context-aware splitting with
NLTKTextSplitter
.
See example to understand how chunking works, see this online viz.
from langchain.text_splitter import CharacterTextSplitter
= CharacterTextSplitter(
splitter_text =1000,
chunk_size=200
chunk_overlap
)= splitter_text.split_documents(docs) chunks_text
from langchain.text_splitter import RecursiveCharacterTextSplitter
= RecursiveCharacterTextSplitter(
splitter_recursivecharactertext =1000,
chunk_size=200,
chunk_overlap=True,
add_start_index=["\n\n", "\n", ".", " ", ""],
separators
)= splitter_recursivecharactertext.split_documents(docs) chunks_recursivecharactertext
from langchain.text_splitter import TokenTextSplitter
= TokenTextSplitter(
splitter_tokentext =1000,
chunk_size=200
chunk_overlap
)= splitter_tokentext.split_documents(docs) chunks_tokentext
from langchain_text_splitters import NLTKTextSplitter
= NLTKTextSplitter(chunk_size=1000)
splitter_nltktext = splitter_nltktext.split_documents(docs) chunks_nltktext
Instantiate a Vector Database and Generate Embedding
A vector database is a database that allows to efficiently store and query embeddings. Embeddings are actually numeric representations of text data. This conversion from text to numeric is used to represent words, sentences, or even entire documents in a compact and meaningful way. It captures the essence of a word’s meaning, context, and relationships with other words.
Vector databases extend the capabilities of traditional relational databases to embedding. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns data that are similar to a query.
There are here again numerous options in terms of Open Source vector DB that can be used - for instance: ChromaDB, Qdrant, Milvus or FAISS. Here we will just use Chroma.
from langchain_community.vectorstores import Chroma
import chromadb
= chromadb.PersistentClient(path="persist/")
chroma_client ## A collection is created with the following
#chroma_collection = chroma_client.create_collection('collection')
To generate embedding, we need a dedicated model and there’s no single “best” option to select it. Words with similar contexts tend to have closer vector representations. Some static word embeddings models are good capturing basic semantic relationships and are computationally efficient and fast but might not capture complex semantics or context-dependent meanings. Contextual Embeddings models have been developped to capture word meaning based on context, considering surrounding words in a sentence and handling ambiguity. But this can lead to computationally expensive training and usage and resulting embedding model with a large size.
Here we start with testing with the small 44MB MiniLM embedding
from langchain_community.embeddings import GPT4AllEmbeddings
= GPT4AllEmbeddings(
embeddings_bert = "all-MiniLM-L6-v2.gguf2.f16.gguf"
model_name )
Now we can store the embeddings and associated metadata in the chroma vector
database using a specific collection name. Below we create distinct stores for each chunking options.
= Chroma.from_documents(
vectorstore_text_bert =chunks_text,
documents=embeddings_bert,
embedding= "text_bert",
collection_name= "persist") persist_directory
= Chroma.from_documents(
vectorstore_recursivecharactertext_bert =chunks_recursivecharactertext,
documents=embeddings_bert,
embedding= "recursivecharactertext_bert",
collection_name= "persist") persist_directory
= Chroma.from_documents(
vectorstore_tokentext_bert =chunks_tokentext,
documents=embeddings_bert,
embedding= "tokentext_bert",
collection_name= "persist") persist_directory
= Chroma.from_documents(
vectorstore_nltktext_bert =chunks_nltktext,
documents=embeddings_bert,
embedding= "nltktext_bert",
collection_name= "persist") persist_directory
Retrieve embeddings from persistent storage
We can re-open a previous database using its folder path:
import chromadb
= chromadb.PersistentClient(path="persist/") client
Then we can get the name of collection available within that database
= client.list_collections()
collections print(collections)
and get a previously saved vector collection
from langchain_community.vectorstores import Chroma
= Chroma(collection_name="text_bert",
vectorstore_text_bert ="persist/",
persist_directory=embeddings_bert) embedding_function
Content Generation
Set up a local LLM
If do not have access to a LLM API, an alternative is to install a local one and there are again plenty of Foundation LLM options to select from. Foundation models are AI neural networks trained on massive amounts of raw data (usually with unsupervised learning) that can be adapted to a wide variety of tasks.
Open-source Large Language Models (LLM) have multiple advantages:
Cost & Energy Savings: generally more affordable in the long run as they don’t involve licensing fees, once infrastructure is setup and/or can be used offline on local computer. More insights on total cost of Owernership can be gained here. One element is also that most of the open source model have comparatively a lot less parameters (3b to 70b) than the large GPT ones (over 150b) which directly impact on infererence costs, i.e. computing cost to generate an answer.
Data Protection: allow to use within the data enclave of your own computer without any data being sent to a remote server.
Transparency and Flexibility: accessible to the public, allowing developers to inspect, modify, and distribute the code. This transparency fosters a community-driven development process, leading to rapid innovation and diverse applications.
Added Features and Community Contributions: can leverage multiple providers and internal teams for updates and support, which enables to stay at the forefront of technology and exercise greater control over their usage.
Customizability: allow for added features and benefit from community contributions. They are ideal for projects that require customization and those where budget constraints are a primary concern.
There are multiple options to do that. An easy one is to install OLLAMA, which offers a wide variety of open models from the “AI Race” competitors arena, for instance: LLama3 from Facebook, gemma from Google, phi3 from Microsoft but also qwen from the Chinese AliBaba, falcon from the Emirati Technology Innovation Institute, or Mixtral from the french startup Mistral_AI. Langchain has a dedicated module to work with ollama.
Below, we start with Mixtral Sparse Mixture-of-Expert, and specifically the quantized version: 8x7b-instruct-v0.1-q4_K_M, an open-weight model designed to optimize performance-to-cost ratio, aka small in size to run on a strong laptop but good in performance. This download a file with the model which size around 26Gb.
from langchain_community.chat_models import ChatOllama
= ChatOllama(
ollama_mixtral ="mixtral:8x7b-instruct-v0.1-q4_K_M",
model=0.2,
temperature=500
request_timeout )
The temperature is setting the creativeness of the response - the higher the more creative - below we will remain conservative! It is the equivalent of the conversation style setting in copilot: creative [1-0.7], balanced ]0.7-0.4], precise ]0.4,0]…
Summarisation Prompt
A prompt is is a piece of text or a set of instructions, used by the LLM to generate a response or perform a task. Writing a good summarization prompt involves a few key steps:
Be Specific: Clearly state what you want to summarize. For example, “Summarize this Operation Strategic Plan in 200 words using abstractive summarization” or “Provide a summary of this needs assessment report, highlighting its key takeaways”.
Define the Scope: Specify the length or depth of the summary you need. For instance, “Summarize this text into two paragraphs with simple language to make it easier to understand” or “Create a summary of this report by summarizing all chapters separately and then generating an overall summary of the report”.
Set the Context: If the summary is for a specific purpose or audience, mention it in the prompt. For example, “I need to write talking points based on this report. Help me summarize this text for better understanding so that I can use it as an introduction emai” or “Summarize this for me like I’m 8 years old”.
Use Clear and Concise Language: Avoid unnecessary complexity or ambiguity. A good prompt should provide enough direction to start but leave room for creativity.
Here we will try to create a prompt that generate an “Evaluation Brief” from the larger evaluation report.
Mixtral comes with specific tags to use for the prompt:
<s>\[INST\] Instruction \[/INST\] Model answer</s>\[INST\] Follow-up instruction \[/INST\]
= """
RAG_prompt <s>
[INST]Act if you were a public program evaluation expert working for UNHCR.
Your audience target is composed of Senior Executives that are managing the operation or program that got evaluated.[/INST]
Your task is to generate an executive summary of the report you just ingested.
</s>
[INST]
The summary should follow the following defined structure:
- In the first part titled "What have we learn?", start with a description of the Forcibly Displaced population in the operation and include as 5 bullet points, the main challenges in relation with the evaluation objectives that have been identified in the document.
For each challenge explain why it's a problem and give a practical example to illustrate the consequence of this problem.
- In a second part titled: "How did we get there?" try to review the common root causes for all the challenges that have been identified.
- In a third part, title: "What is working well?", provide a summary of the main success and achievement, i.e. things that have been identified as good practices and / or effective by the evaluators.
- In the fourth part: "Now What to do?", include and summarize the recommendations proposed by the evaluation. Classify the recommendations according to their relevant level:
1. "Operational Level": i.e recommendations that need to be implemented in the field as an adaptation or change of current practices. Please flag clearly, if this is the case, the recommendations related to practice that should be stopped or discontinued;
2. "Organizational level": i.e recommendations that require changes in staffing or capacity building. Please flag clearly, if this is the case, the recommendations related to practice that should be stopped or discontinued;
3. "Strategic Level": i.e recommendations that require a change in existing policy and rules.
- At the end, for the "Conclusion", craft a reflective conclusion in one sentence that highlights the broader significance of the discussed topic.
[/INST]
"""
Set up the Retriever
A retriever acts as an information gatekeeper in the RAG architecture. Its primary function is to search through a large corpus of data to find relevant pieces of information that can be used for text generation. You can think of it as a specialized librarian who knows exactly which ‘books’ to pull off the ‘shelves’ when you ask a question. In other words, the retriever first fetches relevant parts of the document pertaining to the user query, and then the Large Language Model (LLM) uses this information to generate a response.
The search_type argument within vectorstore.as_retriever
for LangChain allows you to specify the retrieval strategy used to find relevant documents in your vector store. Different options are available:
If you simply want the most relevant documents, “similarity” (default): This is the most common search type and is used by default. It performs a standard nearest neighbor search based on vector similarity. The retriever searches for documents in the vector store whose vector representations are closest to the query vector. Documents with higher similarity scores are considered more relevant and are returned first.
If you need diverse results that cover different aspects of a topic, “mmr” (Maximum Marginal Relevance): This search type focuses on retrieving documents that are both relevant to the query and diverse from each other. It aims to avoid redundancy in the results. MMR is particularly useful when you want a collection of documents that cover different aspects of a topic, rather than just multiple copies of the most similar document.
If you want to ensure a minimum level of relevance,“similarity_score_threshold”: This search type retrieves documents based on a similarity score threshold. It only returns documents that have a similarity score above the specified threshold. This allows you to filter out documents with low relevance to the query.
The retriever also takes a series of potential parameters. The search_kwargs={"k": 2,"score_threshold":0.8}
argument is a dictionary used to configure how documents are retrieved during the search process. This argument lets you control how many results you get (up to two in this case) and how good those results need to be (with a score of at least 0.8):
k (int): This parameter controls the number of documents to retrieve from the search. In this case, k: 2 specifies that the retriever should return up to two documents that match the search query.
score_threshold (float): This parameter sets a minimum score threshold for retrieved documents. Documents with a score lower than 0.8 will be excluded from the results. This essentially acts as a quality filter, ensuring a certain level of relevance between the query and retrieved documents.
The scoring mechanism used by the retriever might depend on the specific retriever implementation. It’s likely based on how well the retrieved documents match the search query. The effectiveness of these parameters depends on your specific use case and the quality of the underlying retrieval system.
Even with “similarity”, the retrieved documents might have varying degrees of relevance. Consider using ranking techniques within LangChain to further refine the results based on additional criteria. The underlying vector store might have limitations on the supported search types. Always refer to the documentation of your specific vector store to confirm available options.
We can build multiple retrievers out of the same vectorstore
:
= vectorstore_text_bert.as_retriever() ragRetriever_text_bert
= vectorstore_recursivecharactertext_bert.as_retriever() ragRetriever_recursivecharactertext_bert
= vectorstore_tokentext_bert.as_retriever(
ragRetriever_similarity_tokentext_bert ="similarity_score_threshold",
search_type={
search_kwargs"k": 3,
"score_threshold": 0.4,
}, )
= vectorstore_nltktext_bert.as_retriever(
ragRetriever_similarity_nltktext_bert ="similarity_score_threshold",
search_type={
search_kwargs"k": 5,
"score_threshold": 0.8,
}, )
Build the Chain
A retrieval question-answer chain act as a pipe: it takes an incoming question, look up relevant documents using a retriever, then pass those documents along with the original question into an LLM and return the answer the original question.
from langchain_core.prompts import ChatPromptTemplate
= ChatPromptTemplate.from_template(
prompt_retrieval """Answer the following question based only on the provided context:
<context>
{context}
</context>
Question: {input}"""
)
and last the retrieval chain!
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
= create_stuff_documents_chain(
combine_docs_chain_mixtral
ollama_mixtral ,
prompt_retrieval
)= create_retrieval_chain(
qa_chain_recursivecharactertext_bert
ragRetriever_recursivecharactertext_bert,
combine_docs_chain_mixtral )
Note that from this stage, the following steps may take time to run - this will be highly dependent on the power of your computer - obviously the availability of GPUs - Graphical Processing Unit - will significantly increase the speed! FYI, this notebook was built on a Thinkpad P53 with a Quadro T1000 GPU.
= qa_chain_recursivecharactertext_bert.invoke({"input": RAG_prompt}) response_recursivecharactertext_bert
Save in a word document
To complete the process, let’s save the result directly within a word document!
This can be automated with a created function create_word_doc
that will reformat the text output from the LLM that uses the standard Markdown format to the equivalent in Word:
import docx
from markdown import markdown
import re
def create_word_doc(text, file_name):
# Create a document
= docx.Document()
doc # add a heading of level 0 (largest heading)
'Evaluation Brief', 0)
doc.add_heading(
# Split the text into lines
= text.split('\n')
lines # Create a set to store bolded and italic strings
= set()
bolded_and_italic for line in lines:
# Check if the line is a heading
if line.startswith('#'):
= line.count('#')
level
doc.add_heading(line[level:].strip(), level)else:
# Check if the line contains markdown syntax for bold or italic
if '**' in line or '*' in line:
# Split the line into parts
= re.split(r'(\*{1,2}(.*?)\*{1,2})', line)
parts # Add another paragraph
= doc.add_paragraph()
p for i, part in enumerate(parts):
# Remove the markdown syntax
= part.strip('*')
content # Check if the content has been added before
if content not in bolded_and_italic:
# Add a run with the part and format it
= p.add_run(content)
run = 'Arial'
run.font.name = docx.shared.Pt(12)
run.font.size # If the part was surrounded by **, make it bold
if '**' in part:
= True
run.bold # If the part was surrounded by *, make it italic
elif '*' in part:
= True
run.italic # Add the content to the set
bolded_and_italic.add(content)else:
# Add another paragraph
= doc.add_paragraph()
p # Add a run with the line and format it
= p.add_run(line)
run = 'Arial'
run.font.name = docx.shared.Pt(12)
run.font.size
## Add a disclaimer... ----------------
# add a page break to start a new page
doc.add_page_break()# add a heading of level 2
'DISCLAIMER:', 2)
doc.add_heading(= doc.add_paragraph()
doc_para 'This document contains material generated by artificial intelligence technology. While efforts have been made to ensure accuracy, please be aware that AI-generated content may not always fully represent the intent or expertise of human-authored material and may contain errors or inaccuracies. An AI model might generate content that sounds plausible but that is either factually incorrect or unrelated to the given context. These unexpected outcomes, also called AI hallucinations, can stem from biases, under-performing information retrieval, lack of real-world understanding, or limitations in training data.').italic = True
doc_para.add_run(
# Save the document ---------------
doc.save(file_name)
Now we can simply use this function to get a word output from the LLM answer!
create_word_doc("answer"],
response_recursivecharactertext_bert["generated/Evaluation_Brief_response_recursivecharactertext_bert.docx"
)
Continuous Evaluation Process
We were able to get a first brief… still how can we assess how good is this report? We will first test different settings to create the brief. Then we create a dataset reflecting those settings and evaluate it!
Building Alternative Briefs
Let’s try to generate more reports using different settings.
LangChain often integrates with libraries like Hugging Face Transformers for embedding usage. Best is to experiment with different embeddings to see what works best for a specific use case and dataset. There are plenty of options also depending on the languages.
Let’s try first with a second embedding model… Hugging face has many options… and there is even a leaderboard to see how they compete… We will select here the embedding model bge-large-en-v1.5, an over 200MB model from the Beijing Academy of Artificial Intelligence. It remains relatively small in size but isefficient and does not consume too much memory.
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
= HuggingFaceBgeEmbeddings(
embeddings_bge="BAAI/bge-small-en",
model_name={"device": "cpu"},
model_kwargs={"normalize_embeddings": True}
encode_kwargs )
We build the vector store using the new embedding…
# Disable TOKENIZERS warning
import os
"TOKENIZERS_PARALLELISM"] = "false"
os.environ[
= Chroma.from_documents(
vectorstore_recursivecharactertext_bge
chunks_recursivecharactertext,
embeddings_bge,= "recursivecharactertext_bge",
collection_name= "persist"
persist_directory )
We can set a different retriever now using Maximum Marginal Relevance…
= vectorstore_recursivecharactertext_bge.as_retriever(
ragRetriever_mmr_recursivecharactertext_bge ="mmr"
search_type )
Advance retrieving strategies can also be used to improve the process. For instance, we can test:
- using ParentDocumentRetriever a document can be embedded into small chunks, and then the context that “surrounds” the found context -child documents - is retrieved using Dense Vector Retrieval, child documents are merged based on their parents. If they have the same parents – they become merged and the child documents with their respective parent documents are replace from an in-memory-store and the parent documents get used to augment generation.
from langchain.retrievers import ParentDocumentRetriever
= RecursiveCharacterTextSplitter(chunk_size=1536)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=256)
child_splitter
from langchain.storage import InMemoryStore
= InMemoryStore()
store
= ParentDocumentRetriever(
ragRetriever_parent_recursivecharactertext_bge = vectorstore_recursivecharactertext_bge,
vectorstore=store,
docstore=child_splitter,
child_splitter=parent_splitter,
parent_splitter
)
ragRetriever_parent_recursivecharactertext_bge.add_documents(docs)
- Ensemble retrieval is another technique where a Retriever Pair is created with on one side a sparse retriever (like Okapi BM25) and a dense retriever (like the embedding similarity we saw before) on the other side. Then retrieved information is “fused” based on their weighting using the Reciprocal Rank Fusion algorithm into a single ranked list and the resulting documents is used to augment the generation. This same approach has also been experimented by the World Bank
from langchain.retrievers import BM25Retriever
= BM25Retriever.from_documents(chunks_recursivecharactertext)
retriever_bm25 = 3
retriever_bm25.k
= vectorstore_recursivecharactertext_bge.as_retriever(search_kwargs={"k": 3})
retriever_similarity
from langchain.retrievers import EnsembleRetriever
= EnsembleRetriever(
ragRetriever_ensemble_recursivecharactertext_bge =[retriever_bm25, retriever_similarity],
retrievers# Relative weighting of each retriever needs to sums to 1!
=[0.42, 0.58]
weights )
We can also use a different model for the LLM: command-r from the start-up COHERE, and specifically the quantized version: command-r:35b-v0.1-q4_K_M, an open-weight model designed to optimize RAG and set up the corresponding chain.
from langchain_community.chat_models import ChatOllama
= ChatOllama(
ollama_commandR ="command-r:35b-v0.1-q4_K_M",
model=0.2,
temperature=500
request_timeout
)
from langchain.chains.combine_documents import create_stuff_documents_chain
= create_stuff_documents_chain(
combine_docs_chain_commandR
ollama_commandR ,
prompt_retrieval )
Finally we generate our alternative summaries!
from langchain.chains import create_retrieval_chain
= create_retrieval_chain(
qa_chain_mmr_recursivecharactertext_bge
ragRetriever_mmr_recursivecharactertext_bge,
combine_docs_chain_commandR
)
= qa_chain_mmr_recursivecharactertext_bge.invoke({"input": RAG_prompt})
response_mmr_recursivecharactertext_bge
create_word_doc("answer"],
response_mmr_recursivecharactertext_bge["generated/Evaluation_Brief_response_mmr_recursivecharactertext_bge.docx"
)
from langchain.chains import create_retrieval_chain
= create_retrieval_chain(
qa_chain_parent_recursivecharactertext_bge
ragRetriever_parent_recursivecharactertext_bge,
combine_docs_chain_commandR
)
= qa_chain_parent_recursivecharactertext_bge.invoke({"input": RAG_prompt})
response_parent_recursivecharactertext_bge
create_word_doc("answer"],
response_parent_recursivecharactertext_bge["generated/Evaluation_Brief_response_parent_recursivecharactertext_bge.docx"
)
from langchain.chains import create_retrieval_chain
= create_retrieval_chain(
qa_chain_ensemble_recursivecharactertext_bge
ragRetriever_ensemble_recursivecharactertext_bge,
combine_docs_chain_commandR
)
= qa_chain_ensemble_recursivecharactertext_bge.invoke({"input": RAG_prompt})
response_ensemble_recursivecharactertext_bge
create_word_doc("answer"],
response_ensemble_recursivecharactertext_bge["generated/Evaluation_Brief_response_ensemble_recursivecharactertext_bge.docx"
)
Et voila! We have now 4 alternative briefs:
#1 - Similarity retrieval with Bert embedding using Mixtral LLM
#2 - Maximum Marginal Relevance retrieval with BGE embedding using commandR LLM,
#3 - Parent document retrieval with BGE embedding using commandR LLM,
#4 - Ensemble document retrieval with BGE embedding using commandR LLM,
Each summary is slightly different… which is OK as it would be if it was a human doing it.. Though, it is likely that one report is better than the other.
Now let’s evaluate the quality of those summarization pipeline to objectively find out about this!
Generating Evaluation Dataset
To do the evaluation, first we need to build an large-enough evaluation dataset so that the evaluation is based on multiple output. We need to build the following data
question: list[str] - These are the questions the RAG pipeline will be evaluated on.
contexts: list[list[str]] - The contexts which were retrieved and passed into the LLM corresponding to each question. This is a list[list] since each question can retrieve multiple text chunks.
answer: list[str] - The answer that got generated from the RAG pipeline.
One approach is to extract from the report both:
all findings and evidence, i.e. what can be learnt from the specific context of this evaluation study, what are the root causes for the finding in this context and what are the main risks and difficulties in this context.
all recommendations, flagging clearly if the recommendations relate to practices that should be either discontinued on one side or on the other side to practices that should be scaled up and of if they comes with resource allocation requirement.
To provide more perspectives for the extraction, the report can be reviewed by 26 different type of experts that may look at UNHCR programme with different angles:
4 experts for Strategic Impact: i.e., findings or recommendations that require a change in existing policies and regulations in relation within the specific impact area:
- Attaining favorable protection environments
- Realizing rights in safe environments
- Empowering communities and achieving gender equality
- Securing durable solutions
17 experts for Operational Outcome: i.e., findings or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities in relation within the specific outcome area:
- Access to territory registration and documentation
- Status determination
- Protection policy and law
- Gender-based violence
- Child protection
- Safety and access to justice
- Community engagement and women’s empowerment
- Well-being and basic needs
- Sustainable housing and settlements
- Healthy lives
- Education
- Clean water sanitation and hygiene
- Self-reliance, Economic inclusion, and livelihoods
- Voluntary repatriation and sustainable reintegration
- Resettlement and complementary pathways
- Local integration and other local solutions
5 experts on Organizational Enabler: i.e., findings or recommendations that require changes in management practices, technical approach, business processes, staffing allocation or capacity building in relation with:
- Systems and processes
- Operational support and supply chain
- People and culture
- External engagement and resource mobilization
- Leadership and governance
First let’s set up the prompt questions
# Define the list of experts on impact - outcome - organisation
= [
q_experts "<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the Strategic Impact: ---Attaining favorable protection environments---: i.e., finding or recommendations that require a change in existing policy and regulations. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the Strategic Impact: ---Realizing rights in safe environments---: i.e., finding or recommendations that require a change in existing policy and regulations. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the Strategic Impact: ---Empowering communities and achieving gender equality--- : i.e., finding or recommendations that require a change in existing policy and regulations. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the Strategic Impact: ---Securing durable solutions--- : i.e., finding or recommendations that require a change in existing policy and regulations. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: ---Access to territory registration and documentation ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Status determination ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Protection policy and law---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Gender-based violence ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Child protection ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Safety and access to justice ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Community engagement and women's empowerment ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Well-being and basic needs ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Sustainable housing and settlements ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Healthy lives---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Education ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Clean water sanitation and hygiene ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Self-reliance, Economic inclusion, and livelihoods ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Voluntary repatriation and sustainable reintegration ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Resettlement and complementary pathways---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the specific Operational Outcome: --- Local integration and other local solutions ---, i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on Organizational Enablers related to Systems and processes, i.e. elements that require potential changes in either management practices, technical approach, business processes, staffing allocation or capacity building. [/INST]",
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on Organizational Enablers related to Operational support and supply chain, i.e. elements that require potential changes in either management practices, technical approach, business processes, staffing allocation or capacity building. [/INST]" ,
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on Organizational Enablers related to People and culture, i.e. elements that require potential changes in either management practices, technical approach, business processes, staffing allocation or capacity building. [/INST]" ,
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on Organizational Enablers related to External engagement and resource mobilization, i.e. elements that require potential changes in either management practices, technical approach, business processes, staffing allocation or capacity building. [/INST]" ,
"<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on Organizational Enablers related to Leadership and governance, i.e. elements that require potential changes in either management practices, technical approach, business processes, staffing allocation or capacity building. [/INST]"
]
# Predefined knowledge extraction questions
= [
q_questions " List, as bullet points, all findings and evidences in relation to your specific area of expertise and focus. ",
" Explain, in relation to your specific area of expertise and focus, what are the root causes for the situation. " ,
" Explain, in relation to your specific area of expertise and focus, what are the main risks and difficulties here described. ",
" Explain, in relation to your specific area of expertise and focus, what what can be learnt. ",
" List, as bullet points, all recommendations made in relation to your specific area of expertise and focus. "#,
# "Indicate if mentionnend what resource will be required to implement the recommendations made in relation to your specific area of expertise and focus. ",
# "List, as bullet points, all recommendations made in relation to your specific area of expertise and focus that relates to topics or activities recommended to be discontinued. ",
# "List, as bullet points, all recommendations made in relation to your specific area of expertise and focus that relates to topics or activities recommended to be scaled up. "
# Add more questions here...
]
## Additional instructions!
= """
q_instr </s>
[INST]
Keep your answer grounded in the facts of the contexts.
If the contexts do not contain the facts to answer the QUESTION, return {NONE}
Be concise in the response and when relevant include precise citations from the contexts.
[/INST]
"""
Then, we can reset the 2 RAG pipleline with their respective LLMs
from langchain_community.chat_models import ChatOllama
= ChatOllama(
ollama_mixtral ="mixtral:8x7b-instruct-v0.1-q4_K_M",
model=0.2,
temperature=500
request_timeout
)= ChatOllama(
ollama_commandR ="command-r:35b-v0.1-q4_K_M",
model=0.2,
temperature=500
request_timeout )
Then the 2 embeding models
from langchain_community.embeddings import GPT4AllEmbeddings
= GPT4AllEmbeddings(
embeddings_bert = "all-MiniLM-L6-v2.gguf2.f16.gguf"
model_name
)
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
= HuggingFaceBgeEmbeddings(
embeddings_bge="BAAI/bge-small-en",
model_name={"device": "cpu"},
model_kwargs={"normalize_embeddings": True}
encode_kwargs )
Now we reload the 2 previous vector store
from langchain_community.vectorstores import Chroma
import chromadb
= chromadb.PersistentClient(path="persist/")
client
= Chroma(
vectorstore_recursivecharactertext_bert ="recursivecharactertext_bert",
collection_name="persist/",
persist_directory=embeddings_bert
embedding_function
)
= Chroma(
vectorstore_recursivecharactertext_bge ="recursivecharactertext_bge",
collection_name="persist/",
persist_directory=embeddings_bge
embedding_function )
and related retrievers
= vectorstore_recursivecharactertext_bert.as_retriever()
ragRetriever_recursivecharactertext_bert
= vectorstore_recursivecharactertext_bge.as_retriever(
ragRetriever_mmr_recursivecharactertext_bge ="mmr"
search_type
)
from langchain.retrievers import ParentDocumentRetriever
= RecursiveCharacterTextSplitter(chunk_size=1536)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=256)
child_splitter
from langchain.storage import InMemoryStore
= InMemoryStore()
store
= ParentDocumentRetriever(
ragRetriever_parent_recursivecharactertext_bge = vectorstore_recursivecharactertext_bge,
vectorstore=store,
docstore=child_splitter,
child_splitter=parent_splitter,
parent_splitter
)
ragRetriever_parent_recursivecharactertext_bge.add_documents(docs)
from langchain.retrievers import BM25Retriever
= BM25Retriever.from_documents(chunks_recursivecharactertext)
retriever_bm25 = 3
retriever_bm25.k
= vectorstore_recursivecharactertext_bge.as_retriever(search_kwargs={"k": 3})
retriever_similarity
from langchain.retrievers import EnsembleRetriever
= EnsembleRetriever(
ragRetriever_ensemble_recursivecharactertext_bge =[retriever_bm25, retriever_similarity],
retrievers# Relative weighting of each retriever needs to sums to 1!
=[0.42, 0.58]
weights )
The main prompt template
from langchain_core.prompts import ChatPromptTemplate
= ChatPromptTemplate.from_template(
prompt_retrieval """Answer the following question based only on the provided context:
<context>
{context}
</context>
Question: {input}"""
)
and last the retrieval chain!
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
= create_stuff_documents_chain(
combine_docs_chain_mixtral
ollama_mixtral ,
prompt_retrieval
)= create_retrieval_chain(
qa_chain_mixtral_recursivecharactertext_bert
ragRetriever_recursivecharactertext_bert,
combine_docs_chain_mixtral
)
= create_stuff_documents_chain(
combine_docs_chain_command
ollama_commandR,
prompt_retrieval
)= create_retrieval_chain(
qa_chain_command_mmr_recursivecharactertext_bge
ragRetriever_mmr_recursivecharactertext_bge,
combine_docs_chain_command
)= create_retrieval_chain(
qa_chain_command_parent_recursivecharactertext_bge
ragRetriever_parent_recursivecharactertext_bge,
combine_docs_chain_command
)= create_retrieval_chain(
qa_chain_command_ensemble_recursivecharactertext_bge
ragRetriever_ensemble_recursivecharactertext_bge,
combine_docs_chain_command )
and now build the two evaluation dataset by iterating over expert profiles and questions!
The first dataset
# Create dataset (empty list for now)
= []
dataset_mixtral_recursivecharactertext_bert
# Iterate through each expert question and its corresponding context list
for expert in q_experts:
for question in q_questions:
# Generate response
= qa_chain_mixtral_recursivecharactertext_bert.invoke({"input": expert + question + q_instr})
response # Add context-question-response to dataset
dataset_mixtral_recursivecharactertext_bert.append({"question": expert + question + q_instr,
"contexts": [context.page_content for context in response["context"]],
"answer": response["answer"]
})
#Save this to the disk!
import pandas as pd
= pd.DataFrame(dataset_mixtral_recursivecharactertext_bert)
dataset_mixtral_recursivecharactertext_bert_d "dataset/dataset_mixtral_recursivecharactertext_bert.xlsx") dataset_mixtral_recursivecharactertext_bert_d.to_excel(
Then producing the second dataset
# Create dataset (empty list for now)
= []
dataset_command_mmr_recursivecharactertext_bge
# Iterate through each expert question and its corresponding context list
for expert in q_experts:
for question in q_questions:
# Generate response with Ollama
= qa_chain_command_mmr_recursivecharactertext_bge.invoke({"input": expert + question + q_instr})
response # Add context-question-response to dataset
dataset_command_mmr_recursivecharactertext_bge.append({"question": expert + question + q_instr,
"contexts": [context.page_content for context in response["context"]],
"answer": response["answer"]
})#Save this to the disk!
import pandas as pd
= pd.DataFrame(dataset_command_mmr_recursivecharactertext_bge)
dataset_command_mmr_recursivecharactertext_bge_d "dataset/dataset_command_mmr_recursivecharactertext_bge.xlsx") dataset_command_mmr_recursivecharactertext_bge_d.to_excel(
# Create dataset (empty list for now)
= []
dataset_command_parent_recursivecharactertext_bge
# Iterate through each expert question and its corresponding context list
for expert in q_experts:
for question in q_questions:
# Generate response with Ollama
= qa_chain_command_parent_recursivecharactertext_bge.invoke({"input": expert + question + q_instr})
response # Add context-question-response to dataset
dataset_command_parent_recursivecharactertext_bge.append({"question": expert + question + q_instr,
"contexts": [context.page_content for context in response["context"]],
"answer": response["answer"]
})#Save this to the disk!
import pandas as pd
= pd.DataFrame(dataset_command_parent_recursivecharactertext_bge)
dataset_command_parent_recursivecharactertext_bge_d "dataset/dataset_command_parent_recursivecharactertext_bge.xlsx") dataset_command_parent_recursivecharactertext_bge_d.to_excel(
# Create dataset (empty list for now)
= []
dataset_command_ensemble_recursivecharactertext_bge
# Iterate through each expert question and its corresponding context list
for expert in q_experts:
for question in q_questions:
# Generate response with Ollama
= qa_chain_command_ensemble_recursivecharactertext_bge.invoke({"input": expert + question + q_instr})
response # Add context-question-response to dataset
dataset_command_ensemble_recursivecharactertext_bge.append({"question": expert + question + q_instr,
"contexts": [context.page_content for context in response["context"]],
"answer": response["answer"]
})#Save this to the disk!
import pandas as pd
= pd.DataFrame(dataset_command_ensemble_recursivecharactertext_bge)
dataset_command_ensemble_recursivecharactertext_bge_d "dataset/dataset_command_ensemble_recursivecharactertext_bge.xlsx") dataset_command_ensemble_recursivecharactertext_bge_d.to_excel(
Computing Assessment Metrics
Developing a proof-of-concept RAG application might seem straightforward, but ensuring its performance meets production standards is a challenging task. Similar to data science projects, it’s essential to assess the RAG pipeline’s performance using a validation dataset and appropriate evaluation metrics.
Several criteria can be used to evaluate RAG pipeline. Among them, the diagramm below provides a simple perspective:
Satisfactory evaluations on context relevance (good chunking and embedding), groundedness (good retriever) and answer relevance (good prompt and LLM) will provide confidence that hallucination risks are minimized.
There are different framework available for RAG Evaluation. Here we test RAGAS (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of RAG pipelines. “Reference-free” evaluation means that instead of having to rely on human-annotated ground truth labels in the evaluation dataset, RAGAs leverages LLMs under the hood to conduct the evaluations. It includes the metrics below:
Context Precision (also called Grounding): Measures whether items present in the contexts are ranked higher or not.
Answer relevancy: Measures how directly the answer addresses the question.
Faithfulness (also called groundedness): Measures whether the LLM outputs are based on the provided ground truth.
RAGAS is expecting data to be provided in the datasets format, a format designed to let the community easily add and share new datasets. We need to convert our currentl list into a dictionnary and then export it to the correct format.
from datasets import Dataset
= Dataset.from_dict({
response_evaluation_dataset_mixtral_recursivecharactertext_bert "question" : dataset_mixtral_recursivecharactertext_bert_d["question"].values.tolist(),
"answer" : dataset_mixtral_recursivecharactertext_bert_d["answer"].values.tolist() ,
"contexts" : dataset_mixtral_recursivecharactertext_bert_d["contexts"].values.tolist()
})
= Dataset.from_dict({
response_evaluation_dataset_command_mmr_recursivecharactertext_bge "question" : dataset_command_mmr_recursivecharactertext_bge_d["question"].values.tolist(),
"answer" : dataset_command_mmr_recursivecharactertext_bge_d["answer"].values.tolist() ,
"contexts" : dataset_command_mmr_recursivecharactertext_bge_d["contexts"].values.tolist()
})
RAGAS require another LLM to do the assessment. We can use a dedicated model as a critic of the first one. Let us the last LLM from Meta, lama3
from langchain_community.chat_models import ChatOllama
= ChatOllama(
ollama_llama3 ="llama3:70b-instruct",
model=0.2,
temperature=500
request_timeout )
Now we can compile the different metrics!
#from ragas.metrics.critique import harmfulness
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
answer_similarity,
answer_correctness,
context_recall,
context_precision,
context_relevancy
)
## The following ragas metrics requires 'ground_truth' information
# answer_similarity,
# answer_correctness,
# context_recall,
# context_precision,
# context_relevancy
= evaluate(
raga_result_mixtral_recursivecharactertext_bert =response_evaluation_dataset_mixtral_recursivecharactertext_bert,
dataset=ollama_llama3,
llm=embeddings_bert,
embeddings=[
metrics
answer_relevancy,
faithfulness,
answer_relevancy],=False
raise_exceptions
)= {
data_mixtral_recursivecharactertext_bert 'faithfulness': raga_result_mixtral_recursivecharactertext_bert['faithfulness'],
'answer_relevancy': raga_result_mixtral_recursivecharactertext_bert['answer_relevancy']
}
= evaluate(
raga_result_command_mmr_recursivecharactertext_bge =response_evaluation_dataset_command_mmr_recursivecharactertext_bge,
dataset=ollama_llama3,
llm=embeddings_bert,
embeddings=[
metrics
answer_relevancy,
faithfulness,
answer_relevancy],=False
raise_exceptions
)= {
data_command_mmr_recursivecharactertext_bge 'faithfulness': raga_result_command_mmr_recursivecharactertext_bge['faithfulness'],
'answer_relevancy': raga_result_command_mmr_recursivecharactertext_bge['answer_relevancy']
}
We can summarise the results with a radar chart:
import plotly.graph_objects as go
= go.Figure()
fig
fig.add_trace(go.Scatterpolar(=list(data_mixtral_recursivecharactertext_bert.values()),
r=list(data_mixtral_recursivecharactertext_bert.keys()),
theta='toself',
fill='RAG_mixtral_recursivecharactertext_bert'
name
))
fig.add_trace(go.Scatterpolar(=list(data_command_mmr_recursivecharactertext_bge.values()),
r=list(data_command_mmr_recursivecharactertext_bge.keys()),
theta='toself',
fill='RAG_command_mmr_recursivecharactertext_bge'
name
))
fig.update_layout(=dict(
polar=dict(
radialaxis=True,
visiblerange=[0, 1]
)),=True,
showlegend='Retrieval Augmented Generation - Evaluation',
title=800,
width
)
fig.show()
Production Deployment Strategy
Buy or Build?
As presented in Gartner AI readiness framework, there are potential graduated deployment stages to consider: consume, embed, extend and build. For each of them, the strategic decision is to defined the share of investement between outsourced and internalized capacity.
Providing organisation-wide access to copilot represents only the very initial consume stage. Creating a dedicated app like “Chat with your Evaluation Reports” is the second one: embed stage. Though using off-the-shelves solutions in a “consume or embed” mode comes with inherent limitations:
- to incorporate organization specific Knowledge in a systematic and reliable way (i.e. with an evaluated RAG pipeline!);
- to set up processes for continuous update of the knowledge base used by the model;
- to prevent what is called “Hallucinations”, in other words the risk of generating incorrect or misleading information, that would not be context-aware;
- to develop internal technical capacity building on the new way of working that AI is offering.
Above, we presented a recipe to extend existing foundation model, using the first step: Data Retrieval and Prompt engineering. We highlighted the importance of the configuration to ensure the reliability of the system and therefore the relevance of managing directly such process. Building common knowledge on “data retrieval scripts” could be a first achievable target. This would imply to tune a RAG extraction for each evaluation report and build an evaluation dataset for each of them.
The next stage is to enable Task-Specific & Alignment Fine Tuning. It comes with the additional requirement of building AI-ready and validated data. The assumption is that, if you train smaller models in certain areas really well, it can perform almost at the level of a human expert for instance for causal knowledge extraction from impact evaluation or regulation reviews. Because tine-tuned model are more efficient, they also save money, especially for tasks like RAG workflows and automation in private clouds. Fine-tuning brings the ability to skip providing in-context learning examples, which results in lower token usage on each prompt and lower latency requests.
The future of Language Model development within organisations is likely to revolve around the creation of “specialized” fine-tuned smaller models. And at first comes the training cost componnent… To have some cost estimation in mind, training a custom large model can require easily 2 months on a big pool of specific hardware, the A100 GPU. With an estimated cost of 3$ per hour, the total training can go above 3M$. By comparison, fine-tuning an existing medium size foundation model can be done for instance with 16 GPUs × $3.00 /hour × 24 hours = $1,152…
“AI-Ready” data: Human Review for ground_truth
Human Review is key to maintaining quality, minimizing the risk of hallucination and enforcing alignment. Such ground_truths attribute on the evaluation dataset allows to test if the context is well recalled by the RAG Pipeline.
It can be performed both before and after fine-tuning. Human labelling is performed to verify that the response is relevant, generic or out-of-context. A platform like labelStudio can be used to implement human review that rank the quality of the knowledge extraction.
To do so, First let’s prepare the data.
import pandas as pd
import json
= pd.read_excel("dataset/dataset_mixtral_recursivecharactertext_bert.xlsx")
df1 'id'] = 'mixtral_recursivecharactertext_bert'
df1['title'] = 'LLM: mixtral / Retriver: similarity / Chunking: recursivecharactertext / Embedding: bert'
df1[
= pd.read_excel("dataset/dataset_command_mmr_recursivecharactertext_bge.xlsx")
df2 'id'] = 'command_mmr_recursivecharactertext_bge'
df2['title'] = 'LLM: commandR / Retriver: mmr / Chunking: recursivecharactertext / Embedding: bge'
df2[
#df3 = pd.read_excel("dataset/dataset_command_parent_recursivecharactertext_bge.xlsx")
#df3['id'] = 'command_parent_recursivecharactertext_bge'
#df3['title'] = 'LLM: commandR / Retriver: parent / Chunking: recursivecharactertext / Embedding: bge'
#df4 = pd.read_excel("dataset/dataset_command_ensemble_recursivecharactertext_bge.xlsx")
#df4['id'] = 'command_ensemble_recursivecharactertext_bge'
#df4['title'] = 'LLM: commandR / Retriver: ensemble / Chunking: recursivecharactertext / Embedding: bge'
## Concatenate
= pd.concat([df1, df2])
df = df.drop('contexts', axis=1)
df
## Reformat the question for an easier review!
'question'] = df['question'].str.replace('<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on ', 'As an expert on ')
df[
'question'] = df['question'].str.replace('<s> [INST] Instructions: Act as a public program evaluation expert working for UNHCR. Your specific area of expertise and focus is strictly on the ', 'As an expert on ')
df[
'question'] = df['question'].str.replace('specific Operational Outcome', 'Operational Outcome')
df['question'] = df['question'].str.replace("""
df[</s>
[INST]
Keep your answer grounded in the facts of the contexts.
If the contexts do not contain the facts to answer the QUESTION, return {NONE}
Be concise in the response and when relevant include precise citations from the contexts.
[/INST]
""", '')
'question'] = df['question'].str.replace(' [/INST] ', ' -- ')
df['question'] = df['question'].str.replace('i.e., finding or recommendations that require a change in existing policy and regulations. ', ' ')
df['question'] = df['question'].str.replace(' i.e. finding or recommendations that require a change that needs to be implemented in the field as an adaptation or change of current activities. ', '')
df['question'] = df['question'].str.replace(' i.e. elements that require changes in management practices, technical approach, business processes, staffing allocation or capacity building. ', '')
df[
# Rename 'question' column to 'prompt'
= df.rename(columns={'question': 'prompt'})
df = df.rename(columns={'answer': 'body'})
df
# Group DataFrame by 'prompt'
= df.groupby('prompt')
grouped
# List to hold all JSON outputs
= []
json_outputs
for prompt, group in grouped:
# Convert group DataFrame to list of dictionaries
= group[['id', 'title', 'body']].to_dict('records')
items
# Create final dictionary for this group
= {
final_dict "prompt": prompt,
"items": items
}
# Add final dictionary to list
json_outputs.append(final_dict)
# Save JSON outputs to file
with open('dataset/dataset.json', 'w') as f:
=2) json.dump(json_outputs, f, indent
We can set up now a specific label studio project: UNHCR Data use and information management to Review knowledge Extraction from the 2019 Evaluation. We can then use the “LLM Ranker” template to review the RAG pipeline output as either: “Relevant”, “Too-Generic” or “Out-of-scope”.
Note as well that the Management Response, aka the organisation’s response to evaluation findings, could be another source of ground truthing to leverage to enhance the quality of knowledge extraction.
After the peer review is sent to observation, operation feedback on the review can also be collected and use at a later stage to further fine tune the model
A Fine-Tuned “expert” Model!
Using the labeled dataset, generated from the prompt, and then labeled, the next step would be to select any Open “foundational” LLM from HuggingFace and fine tune it. This is not covered throug this document but the recipe can easily be found.
In line with UN statement to promote open source in general and “open artificial intelligence models”, the resulting fine-tuned model could also be published on UNHCR Hugging Face Organisation Account or an intergancy one to be created…
A fine tune model could help front-loading many more contexts that a simple foundation model:
Situation – The fine-tuned model would be relevant and specific in relation with Operation profile, Area of focus between one of the strategic impact, operational outcome, or organizational topics
Task - The fine-tuned model could be triggered at a specific stage of the operation management cycle for Peer Review Purpose – at any stage of the Plan/Get/Show.
Activity – Based on the combination of situation and task, the fine-tuned model would help re-injecting previously found evidence and/or recalling recommendations
Results – The fine-tuned model output would be systematically saved in order to be re-assessed by humans to fine-tune it further from this feedback and improve over time (also called reinforcement learning.)
Conclusions
Blind trust in AI definitely comes with serious risks to manage. And at first on one side, the lack of transparency and explainability and on the other side the occurrence and reproduction of bias and discrimination.
Trust building will therefore require organizational commitment to control:
- the performance of information retrieval (RAG);
- the ground truthing and alignment of model outputs (Fine-tuning).
This paper is advocating for an approach grounded in open data science but backed with human review. Some key considerations to implement this involve to look at:
- Total Cost of Ownership: Off-the-shelves “production-level” solutions do not exist. The real challenge is to correctly balance outsourcing vs insourcing.
- Modular Customization: The “orchestration” solution should be flexible to adapt itself to incoming new development, without changing everything.
- Agility - Iterate & Deliver: Adopt short development rounds to start quickly testing with users.
- Information Formatting: Promote specific format for report publication, specifically Markdown rather than PDF, to ease the ingestion of content by the models.
- Expertise & Training: Need to nurture in-house awareness and expertise to understand how RAG works, to test and then to help building validation dataset.
Leveraging the potential of AI for evaluation implies significant investements. Tuning RAG extraction pipelines and building evaluation dataset for each evaluation report implies to set up dedicated teams and infrastructure. Pooling expertise, sharing scripts/knowledge and accessing capacity (server infrastructure) around this objective and across the UN system would likely be a sustainable way of addressing it.
Acknowledgement
Expressing thanks for all AI experts that are taking time to build open source tools for this new technology and create tutorials. There are many of them and the list below is far from exhaustive:
- advanced_rag
- rag_evaluation
- llm_judge
- rag-from-scratch
- langchain-tutorials
- AI Infrastructure Report
- AI costing
The World Bank Independent Evaluation Group (IEG) has also released a few blogs that focus on the “consume” stage and highlight the inherent limitations that comes a “buy-only” approach:
- Advanced Content Analysis: Can Artificial Intelligence Accelerate Theory-Driven Complex Program Evaluation?
- Setting up Experiments to Test GPT for Evaluation
- Unfulfilled Promises: Using GPT for Synthetic Tasks
- What are the benefits and challenges of using AI in evaluation?
Thanks also all UNHCR colleagues who took the time to review and proof read this document.