Evaluating AI Usage for Evaluation Purpose

Improving Report Summarization

“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”

Edward Osborne Wilson

Current Challenge

Considering the number of published evaluation reports across the UN system, information retrieval and evidence generalization challenges have arisen.

How to extract the most relevant findings and recommendations from within a specific context and to reuse and re-inject them in a different but appropriate context?

The 5th wave of the evidence revolution is triggered by AI

Having human beings scan articles for relevant text for inclusion is likely a very inefficient way to produce reviews. Adopting these technologies will improve the speed and accuracy of evidence synthesis.

The four waves of the evidence revolution, published in Nature, Howard White, 2019

Results Cherry Picking: how to build effective “Evaluation Brief”?

Choosing what to include and what to exclude, especially in terms of highlighting critical aspects while deciding on what are the less relevant details to omit…

Relying on automated retrieval can help improving the objectivity and independence of the evaluation report summarization.

Cassandra, bearer of bad news

RAG at Rescue!

Retrieval Augmented Generation (RAG) combines the strengths of retrieval-based large language models & generative large language models.

Embeddings are generated numeric representations of text data

Leaderboard for Large Language Models

Hugging Face Model Hub is the main platform that ranks and compares the performance of large language models (LLMs) on various benchmarks and tasks.

It includes Leaderboard for embedding and generation that:

  • Provides a clear and transparent comparison of different LLMs.
  • Helps identify the best models for specific tasks or domains.

Builing a RAG Pipeline requires:

  1. Data Collection: Select & Gather relevant reports.

  2. Model Testing: Test different generative and retrieval large language models.

  3. Integration: Combine models & functions into a cohesive pipeline.

  4. Validation: Build a human-baseline to benchmark the performance of the integrated system.

  5. Evaluation: Assess accuracy, relevance, and efficiency using predefined metrics.

A RAG Evaluation Framework

Define and apply relevant metrics for both retrieval and generation to systematically & Continuosly assess the performance of the pipeline against existing models and baselines.

Applying a “Data Science” Approach!

Thorough Documentation

Keep detailed records of data sources, processing steps, model configurations, and evaluation results.

Clear guidelines on usage and troubleshooting written for a lay audience.

Reproducible Workflows

Ensure that experiments can be replicated by others:

  • Put Code under Version control
  • Set up Pipelines and scripts for data processing, model training, and evaluation automated
  • Share Public repositories for collaborative work.

Transparent Reporting

Clearly communicate methodologies and findings in reports and publications:

  • Type of Chunking
  • Name of Embedding
  • Retrieval Strategy
  • Name of Response LLM

Organising Validation with Human Feedback Loop

Incorporate ongoing feedback from users to continuously improve model performance.

  • Task-Specific Fine Tuning: Adjust models based on specific application requirements and domain knowledge.
  • Alignment Fine Tuning: Ensure that model outputs align with ethical guidelines and user expectations.

Experimentation Results

See full article here

  1. Report used: the 2019 Evaluation of UNHCR’s data use and information management approaches with two test summary #1 & #2.

  2. Models Tested: Small large language model that can run out of a strong laptop: Command-r & Mixtral for the generation, bge-large-en-v1.5 for the embeddings

  3. Integration & Documentation: Use of LangChain for the orchestration. Code shared and documented in Github

  4. Human Validation: Ground truthing with labelStud.io.

  5. Evaluation: Assess accuracy, relevance, and efficiency using RAGAS (Retrieval Augmented Generation Assessment).

An interface for Human Review of LLM outputs

AI Deployment: Buy or Build?

Some Considerations

  • Total Cost of Ownership: Off-the-shelves “production-level” solutions do not exist. The real challenge is to correctly balance outsourcing vs insourcing.
  • Modular Customization: The “orchestration” solution should be flexible to adapt itself to incoming new development, without changing everything.
  • Agility - Iterate & Deliver: Adopt short development round to test with users.
  • Information Formatting: Promote specific format for report publication, specifically Markdown rather than PDF, to ease the ingestion of content by the models.
  • Expertise & Training: Need to nurture in-house awareness and expertise to understand how RAG works, to test and then to help building validation dataset.

Conclusions

Blind trust in AI definitely comes with serious risks to manage. And at first on one side, the lack of transparency and explainability and on the other side the occurrence and reproduction of bias and discrimination.

Trust building will therefore require organizational commitment to control:

  • the performance of information retrieval (RAG);
  • the ground truthing and alignment of model outputs (Fine-tuning).