Spaces:

toll-brigs-0
/

ltu-chat

Sleeping

File size: 3,004 Bytes

a8d3adb
 
 
 
 
 
 
 
 
 
 
 
da83cd6

---
title: Ltu Chat
emoji: 🏢
colorFrom: pink
colorTo: red
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# LTU Chat RAG Evaluation

This repository contains a RAG (Retrieval-Augmented Generation) pipeline for the LTU (Luleå University of Technology) programme data, along with evaluation tools using Ragas.

## Overview

The system uses:
- **Qdrant**: Vector database for storing and retrieving embeddings
- **Haystack**: Framework for building the RAG pipeline
- **Ragas**: Framework for evaluating RAG systems

## Files

- `rag_pipeline.py`: Main RAG pipeline implementation
- `ragas_eval.py`: Script to evaluate the RAG pipeline using Ragas
- `testset.json`: JSONL file containing test questions, reference answers, and contexts
- `testset_generation.py`: Script used to generate the test set

## Requirements

```
streamlit==1.42.2
haystack-ai==2.10.3
qdrant-client==1.13.2
python-dotenv==1.0.1
beautifulsoup4==4.13.3
qdrant-haystack==8.0.0
ragas-haystack==2.1.0
rapidfuzz==3.12.2
pandas
```

## Setup

1. Make sure you have all the required packages installed:
   ```
   pip install -r requirements.txt
   ```

2. Set up your environment variables (optional):
   ```
   export NEBIUS_API_KEY="your_api_key_here"
   ```
   If not set, the script will use the default API key included in the code.

## Running the Evaluation

To evaluate the RAG pipeline using Ragas:

```bash
python ragas_eval.py
```

This will:
1. Load the Qdrant document store from the local directory
2. Load the test set from `testset.json`
3. Run the RAG pipeline on each test question
4. Evaluate the results using Ragas metrics
5. Save the evaluation results to `ragas_evaluation_results.json`

## Ragas Metrics

The evaluation uses the following Ragas metrics:

- **Faithfulness**: Measures if the generated answer is factually consistent with the retrieved contexts
- **Answer Relevancy**: Measures if the answer is relevant to the question
- **Context Precision**: Measures the proportion of retrieved contexts that are relevant
- **Context Recall**: Measures if the retrieved contexts contain the information needed to answer the question
- **Context Relevancy**: Measures the relevance of retrieved contexts to the question

## Customization

You can customize the evaluation by modifying the `RAGEvaluator` class parameters:

```python
evaluator = RAGEvaluator(
    embedding_model_name="BAAI/bge-en-icl",
    llm_model_name="meta-llama/Llama-3.3-70B-Instruct",
    qdrant_path="./qdrant_data",
    api_base_url="https://api.studio.nebius.com/v1/",
    collection_name="ltu_programmes"
)
```

## Test Set Format

The test set is a JSONL file where each line contains:
- `user_input`: The question
- `reference`: The reference answer
- `reference_contexts`: List of reference contexts that should be retrieved
- `synthesizer_name`: Name of the synthesizer used to generate the reference answer