davidtran999's picture
Push full code from hue-portal-backend folder
519b145

Legal Data Refresh Workflow

Use this sequence whenever new DOCX/PDF files are imported outside the user-facing UI (e.g. nightly ETL or bulk manifests).

Prerequisites

  • Postgres + Redis running.
  • Celery worker online (for interactive uploads) or CELERY_TASK_ALWAYS_EAGER=true for synchronous runs.
  • Tesseract OCR installed (see OCR_SETUP.md).

Manual Command Sequence

cd backend/hue_portal
source ../.venv/bin/activate

python manage.py load_legal_document --file "/path/to/docx" --code DOC-123
python ../scripts/generate_embeddings.py --model legal
python ../scripts/build_faiss_index.py --model legal

Notes:

  • load_legal_document can be substituted with the manifest loader (scripts/load_legal_documents.py) if multiple files need ingestion.
  • The embedding script logs processed sections; expect a SHA checksum for each chunk.
  • FAISS builder writes artifacts under backend/hue_portal/artifacts/faiss_indexes.

Automated Helper

backend/scripts/refresh_legal_data.sh wraps the three steps:

./backend/scripts/refresh_legal_data.sh \
  --file "/path/to/THONG-TU.docx" \
  --code TT-02

Flags:

  • --skip-ingest to only regenerate embeddings/index (useful after editing chunking logic).
  • --python to point at a specific interpreter (default python3).

CI / Nightly Jobs

  1. Sync new files into tài nguyên/.
  2. Run the helper script for each file (or call the manifest loader first).
  3. Archive FAISS artifacts (upload to object storage) so the chatbot containers can download them at boot.
  4. Record build duration and artifact checksums for auditing.

Verification Checklist

  • generate_embeddings log ends with Completed model=legal.
  • FAISS directory contains fresh timestamped .faiss + .mappings.pkl.
  • Sample chatbot query (“Thông tư 02 ...”) returns snippets referencing the newly ingested document.