davidtran999's picture
Push full code from hue-portal-backend folder
519b145
# Legal Data Refresh Workflow
Use this sequence whenever new DOCX/PDF files are imported outside the user-facing UI (e.g. nightly ETL or bulk manifests).
## Prerequisites
- Postgres + Redis running.
- Celery worker online (for interactive uploads) or `CELERY_TASK_ALWAYS_EAGER=true` for synchronous runs.
- Tesseract OCR installed (see `OCR_SETUP.md`).
## Manual Command Sequence
```
cd backend/hue_portal
source ../.venv/bin/activate
python manage.py load_legal_document --file "/path/to/docx" --code DOC-123
python ../scripts/generate_embeddings.py --model legal
python ../scripts/build_faiss_index.py --model legal
```
Notes:
- `load_legal_document` can be substituted with the manifest loader (`scripts/load_legal_documents.py`) if multiple files need ingestion.
- The embedding script logs processed sections; expect a SHA checksum for each chunk.
- FAISS builder writes artifacts under `backend/hue_portal/artifacts/faiss_indexes`.
## Automated Helper
`backend/scripts/refresh_legal_data.sh` wraps the three steps:
```
./backend/scripts/refresh_legal_data.sh \
--file "/path/to/THONG-TU.docx" \
--code TT-02
```
Flags:
- `--skip-ingest` to only regenerate embeddings/index (useful after editing chunking logic).
- `--python` to point at a specific interpreter (default `python3`).
## CI / Nightly Jobs
1. Sync new files into `tài nguyên/`.
2. Run the helper script for each file (or call the manifest loader first).
3. Archive FAISS artifacts (upload to object storage) so the chatbot containers can download them at boot.
4. Record build duration and artifact checksums for auditing.
## Verification Checklist
- `generate_embeddings` log ends with `Completed model=legal`.
- FAISS directory contains fresh timestamped `.faiss` + `.mappings.pkl`.
- Sample chatbot query (“Thông tư 02 ...”) returns snippets referencing the newly ingested document.