AI
Usage for Evaluation PurposeImproving Report Summarization
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
Edward Osborne Wilson
Considering the number of published evaluation reports across the UN system, information retrieval and evidence generalization challenges have arisen.
How to extract the most relevant findings and recommendations from within a specific context and to reuse and re-inject them in a different but appropriate context?
“Having human beings scan articles for relevant text for inclusion is likely a very inefficient way to produce reviews. Adopting these technologies will improve the speed and accuracy of evidence synthesis.”
Choosing what to include and what to exclude, especially in terms of highlighting critical aspects while deciding on what are the less relevant details to omit…
Relying on automated retrieval can help improving the objectivity and independence of the evaluation report summarization.
Retrieval Augmented Generation (RAG) combines the strengths of retrieval-based large language models & generative large language models.
Embeddings are generated numeric representations of text data
Hugging Face Model Hub is the main platform that ranks and compares the performance of large language models (LLMs) on various benchmarks and tasks.
It includes Leaderboard for embedding and generation that:
Data Collection: Select & Gather relevant reports.
Model Testing: Test different generative and retrieval large language models.
Integration: Combine models & functions into a cohesive pipeline.
Validation: Build a human-baseline to benchmark the performance of the integrated system.
Evaluation: Assess accuracy, relevance, and efficiency using predefined metrics.
Define and apply relevant metrics for both retrieval and generation to systematically & Continuosly assess the performance of the pipeline against existing models and baselines.
Thorough Documentation
Keep detailed records of data sources, processing steps, model configurations, and evaluation results.
Clear guidelines on usage and troubleshooting written for a lay audience.
Reproducible Workflows
Ensure that experiments can be replicated by others:
Transparent Reporting
Clearly communicate methodologies and findings in reports and publications:
Incorporate ongoing feedback from users to continuously improve model performance.
Report used: the 2019 Evaluation of UNHCR’s data use and information management approaches with two test summary #1 & #2.
Models Tested: Small large language model that can run out of a strong laptop: Command-r & Mixtral for the generation, bge-large-en-v1.5 for the embeddings
Integration & Documentation: Use of LangChain for the orchestration. Code shared and documented in Github
Human Validation: Ground truthing with labelStud.io.
Evaluation: Assess accuracy, relevance, and efficiency using RAGAS (Retrieval Augmented Generation Assessment).
Blind trust in AI definitely comes with serious risks to manage. And at first on one side, the lack of transparency and explainability and on the other side the occurrence and reproduction of bias and discrimination.
Trust building will therefore require organizational commitment to control: