ltu-chat / README.md
Stepan
Test dataset improvements
da83cd6

A newer version of the Streamlit SDK is available: 1.52.1

Upgrade
metadata
title: Ltu Chat
emoji: 🏢
colorFrom: pink
colorTo: red
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

LTU Chat RAG Evaluation

This repository contains a RAG (Retrieval-Augmented Generation) pipeline for the LTU (Luleå University of Technology) programme data, along with evaluation tools using Ragas.

Overview

The system uses:

  • Qdrant: Vector database for storing and retrieving embeddings
  • Haystack: Framework for building the RAG pipeline
  • Ragas: Framework for evaluating RAG systems

Files

  • rag_pipeline.py: Main RAG pipeline implementation
  • ragas_eval.py: Script to evaluate the RAG pipeline using Ragas
  • testset.json: JSONL file containing test questions, reference answers, and contexts
  • testset_generation.py: Script used to generate the test set

Requirements

streamlit==1.42.2
haystack-ai==2.10.3
qdrant-client==1.13.2
python-dotenv==1.0.1
beautifulsoup4==4.13.3
qdrant-haystack==8.0.0
ragas-haystack==2.1.0
rapidfuzz==3.12.2
pandas

Setup

  1. Make sure you have all the required packages installed:

    pip install -r requirements.txt
    
  2. Set up your environment variables (optional):

    export NEBIUS_API_KEY="your_api_key_here"
    

    If not set, the script will use the default API key included in the code.

Running the Evaluation

To evaluate the RAG pipeline using Ragas:

python ragas_eval.py

This will:

  1. Load the Qdrant document store from the local directory
  2. Load the test set from testset.json
  3. Run the RAG pipeline on each test question
  4. Evaluate the results using Ragas metrics
  5. Save the evaluation results to ragas_evaluation_results.json

Ragas Metrics

The evaluation uses the following Ragas metrics:

  • Faithfulness: Measures if the generated answer is factually consistent with the retrieved contexts
  • Answer Relevancy: Measures if the answer is relevant to the question
  • Context Precision: Measures the proportion of retrieved contexts that are relevant
  • Context Recall: Measures if the retrieved contexts contain the information needed to answer the question
  • Context Relevancy: Measures the relevance of retrieved contexts to the question

Customization

You can customize the evaluation by modifying the RAGEvaluator class parameters:

evaluator = RAGEvaluator(
    embedding_model_name="BAAI/bge-en-icl",
    llm_model_name="meta-llama/Llama-3.3-70B-Instruct",
    qdrant_path="./qdrant_data",
    api_base_url="https://api.studio.nebius.com/v1/",
    collection_name="ltu_programmes"
)

Test Set Format

The test set is a JSONL file where each line contains:

  • user_input: The question
  • reference: The reference answer
  • reference_contexts: List of reference contexts that should be retrieved
  • synthesizer_name: Name of the synthesizer used to generate the reference answer