Legal Data Refresh Workflow
Use this sequence whenever new DOCX/PDF files are imported outside the user-facing UI (e.g. nightly ETL or bulk manifests).
Prerequisites
- Postgres + Redis running.
- Celery worker online (for interactive uploads) or
CELERY_TASK_ALWAYS_EAGER=truefor synchronous runs. - Tesseract OCR installed (see
OCR_SETUP.md).
Manual Command Sequence
cd backend/hue_portal
source ../.venv/bin/activate
python manage.py load_legal_document --file "/path/to/docx" --code DOC-123
python ../scripts/generate_embeddings.py --model legal
python ../scripts/build_faiss_index.py --model legal
Notes:
load_legal_documentcan be substituted with the manifest loader (scripts/load_legal_documents.py) if multiple files need ingestion.- The embedding script logs processed sections; expect a SHA checksum for each chunk.
- FAISS builder writes artifacts under
backend/hue_portal/artifacts/faiss_indexes.
Automated Helper
backend/scripts/refresh_legal_data.sh wraps the three steps:
./backend/scripts/refresh_legal_data.sh \
--file "/path/to/THONG-TU.docx" \
--code TT-02
Flags:
--skip-ingestto only regenerate embeddings/index (useful after editing chunking logic).--pythonto point at a specific interpreter (defaultpython3).
CI / Nightly Jobs
- Sync new files into
tài nguyên/. - Run the helper script for each file (or call the manifest loader first).
- Archive FAISS artifacts (upload to object storage) so the chatbot containers can download them at boot.
- Record build duration and artifact checksums for auditing.
Verification Checklist
generate_embeddingslog ends withCompleted model=legal.- FAISS directory contains fresh timestamped
.faiss+.mappings.pkl. - Sample chatbot query (“Thông tư 02 ...”) returns snippets referencing the newly ingested document.