Spaces:

toll-brigs-0
/

ltu-chat

Sleeping

App Files Files Community

ltu-chat / README.md

Stepan

Test dataset improvements

da83cd6 9 months ago

preview code

raw

history blame contribute delete

3 kB

A newer version of the Streamlit SDK is available: 1.52.1

Upgrade

metadata

title: Ltu Chat
emoji: 🏢
colorFrom: pink
colorTo: red
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

LTU Chat RAG Evaluation

This repository contains a RAG (Retrieval-Augmented Generation) pipeline for the LTU (Luleå University of Technology) programme data, along with evaluation tools using Ragas.

Overview

The system uses:

Qdrant: Vector database for storing and retrieving embeddings
Haystack: Framework for building the RAG pipeline
Ragas: Framework for evaluating RAG systems

Files

rag_pipeline.py: Main RAG pipeline implementation
ragas_eval.py: Script to evaluate the RAG pipeline using Ragas
testset.json: JSONL file containing test questions, reference answers, and contexts
testset_generation.py: Script used to generate the test set

Requirements

streamlit==1.42.2
haystack-ai==2.10.3
qdrant-client==1.13.2
python-dotenv==1.0.1
beautifulsoup4==4.13.3
qdrant-haystack==8.0.0
ragas-haystack==2.1.0
rapidfuzz==3.12.2
pandas

Setup

Make sure you have all the required packages installed:
```
pip install -r requirements.txt
```
Set up your environment variables (optional):
```
export NEBIUS_API_KEY="your_api_key_here"
```
If not set, the script will use the default API key included in the code.

Running the Evaluation

To evaluate the RAG pipeline using Ragas:

python ragas_eval.py

This will:

Load the Qdrant document store from the local directory
Load the test set from testset.json
Run the RAG pipeline on each test question
Evaluate the results using Ragas metrics
Save the evaluation results to ragas_evaluation_results.json

Ragas Metrics

The evaluation uses the following Ragas metrics:

Faithfulness: Measures if the generated answer is factually consistent with the retrieved contexts
Answer Relevancy: Measures if the answer is relevant to the question
Context Precision: Measures the proportion of retrieved contexts that are relevant
Context Recall: Measures if the retrieved contexts contain the information needed to answer the question
Context Relevancy: Measures the relevance of retrieved contexts to the question

Customization

You can customize the evaluation by modifying the RAGEvaluator class parameters:

evaluator = RAGEvaluator(
    embedding_model_name="BAAI/bge-en-icl",
    llm_model_name="meta-llama/Llama-3.3-70B-Instruct",
    qdrant_path="./qdrant_data",
    api_base_url="https://api.studio.nebius.com/v1/",
    collection_name="ltu_programmes"
)

Test Set Format

The test set is a JSONL file where each line contains:

user_input: The question
reference: The reference answer
reference_contexts: List of reference contexts that should be retrieved
synthesizer_name: Name of the synthesizer used to generate the reference answer