Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.52.1
title: Ltu Chat
emoji: 🏢
colorFrom: pink
colorTo: red
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
LTU Chat RAG Evaluation
This repository contains a RAG (Retrieval-Augmented Generation) pipeline for the LTU (Luleå University of Technology) programme data, along with evaluation tools using Ragas.
Overview
The system uses:
- Qdrant: Vector database for storing and retrieving embeddings
- Haystack: Framework for building the RAG pipeline
- Ragas: Framework for evaluating RAG systems
Files
rag_pipeline.py: Main RAG pipeline implementationragas_eval.py: Script to evaluate the RAG pipeline using Ragastestset.json: JSONL file containing test questions, reference answers, and contextstestset_generation.py: Script used to generate the test set
Requirements
streamlit==1.42.2
haystack-ai==2.10.3
qdrant-client==1.13.2
python-dotenv==1.0.1
beautifulsoup4==4.13.3
qdrant-haystack==8.0.0
ragas-haystack==2.1.0
rapidfuzz==3.12.2
pandas
Setup
Make sure you have all the required packages installed:
pip install -r requirements.txtSet up your environment variables (optional):
export NEBIUS_API_KEY="your_api_key_here"If not set, the script will use the default API key included in the code.
Running the Evaluation
To evaluate the RAG pipeline using Ragas:
python ragas_eval.py
This will:
- Load the Qdrant document store from the local directory
- Load the test set from
testset.json - Run the RAG pipeline on each test question
- Evaluate the results using Ragas metrics
- Save the evaluation results to
ragas_evaluation_results.json
Ragas Metrics
The evaluation uses the following Ragas metrics:
- Faithfulness: Measures if the generated answer is factually consistent with the retrieved contexts
- Answer Relevancy: Measures if the answer is relevant to the question
- Context Precision: Measures the proportion of retrieved contexts that are relevant
- Context Recall: Measures if the retrieved contexts contain the information needed to answer the question
- Context Relevancy: Measures the relevance of retrieved contexts to the question
Customization
You can customize the evaluation by modifying the RAGEvaluator class parameters:
evaluator = RAGEvaluator(
embedding_model_name="BAAI/bge-en-icl",
llm_model_name="meta-llama/Llama-3.3-70B-Instruct",
qdrant_path="./qdrant_data",
api_base_url="https://api.studio.nebius.com/v1/",
collection_name="ltu_programmes"
)
Test Set Format
The test set is a JSONL file where each line contains:
user_input: The questionreference: The reference answerreference_contexts: List of reference contexts that should be retrievedsynthesizer_name: Name of the synthesizer used to generate the reference answer