EpiMistral-7B: Fine-tuned Mistral for Epidemiological Information Extraction
Model Description
EpiMistral-7B is a fine-tuned version of Open-Orca/Mistral-7B-OpenOrca specialized for extracting structured epidemiological information from unstructured disease outbreak reports. The model was trained on the WHO Disease Outbreak News (DONs) curated database (Carlson et al., 2023) to automatically extract key epidemiological features including disease classification, geographical locations, case counts, temporal information, and outbreak characteristics.
Model Details
- Base Model: Open-Orca/Mistral-7B-OpenOrca (based on Mistral-7B-v0.1)
- Base Model License: Apache License 2.0
- Model Type: Causal Language Model (Decoder-only Transformer)
- Fine-tuning Method: Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation)
- Adapter Weights License: CC0-1.0 (Public Domain Dedication) - Note: Only the LoRA adapter weights are released under CC0. The base model weights remain under Apache 2.0.
- Training Data: WHO Disease Outbreak News curated database (3,338 records through 2019)
- Language: English
- Application Domain: Public health surveillance, epidemic intelligence, epidemiological information extraction
License
Licensing Information
This repository contains LoRA adapter weights only, not the full model weights.
Base Model (Mistral 7B / Open-Orca/Mistral-7B-OpenOrca): Licensed under Apache License 2.0
- Copyright (c) Mistral AI (base model)
- Copyright (c) OpenOrca (instruction-tuned variant)
- Permissive open-source license allowing commercial use, modification, and distribution
- Requires preservation of copyright and license notices
LoRA Adapter Weights: Released under CC0 1.0 Universal (Public Domain Dedication)
- The adapter weights can be used without restriction
- To use these adapters, you must have access to the base Mistral-7B-OpenOrca model (available under Apache 2.0)
Attribution: When using this model, please include appropriate attribution:
Mistral 7B and Open-Orca/Mistral-7B-OpenOrca are licensed under the Apache License 2.0,
Copyright (c) Mistral AI and OpenOrca.
EpiMistral-7B LoRA adapter weights are released under CC0 1.0 Universal (Public Domain).
Distribution Notes
- This repository distributes only the fine-tuned LoRA adapter parameters
- Base model weights are unchanged and must be obtained separately from Hugging Face
- Users benefit from Apache 2.0's permissive terms for the base model
- The LoRA adapters are applied on top of the base model weights at inference time
Performance
The model achieved the following results on the evaluation set:
| Metric | Score |
|---|---|
| Rouge-1 | 0.899 ± 0.046 |
| Rouge-2 | 0.853 ± 0.057 |
| Rouge-L | 0.889 ± 0.049 |
| Rouge-Lsum | 0.887 ± 0.047 |
These scores represent overall performance across 5-fold stratified cross-validation, demonstrating strong accuracy in extracting structured epidemiological information from unstructured outbreak reports.
Training Summary
- Best Training Step: 1,060
- Best Training Loss: 0.5338
- Initial Training Loss: 1.6518
- Total Improvement: 1.1180
- Total Training Steps: 1,066
- Final Training Loss: 0.5338
Intended Uses & Limitations
Intended Uses
This model is designed for:
- Automated extraction of epidemiological information from disease outbreak reports
- Public health surveillance systems requiring structured data from unstructured sources
- Epidemic intelligence pipelines for rapid outbreak detection and monitoring
- Research purposes in computational epidemiology and public health informatics
Limitations
- The model is trained specifically on WHO DONs format and may require adaptation for other report formats
- Performance on diseases not well-represented in the training data may vary
- The model extracts information present in the text and does not generate or infer missing data
- Designed for English-language outbreak reports only
- Should be used as a decision-support tool, with human verification for critical public health decisions
Extracted Features
The model extracts the following structured epidemiological information:
Disease Information:
- DiseaseLevel1 (primary disease classification)
- DiseaseLevel2 (disease subtype/variant)
Geographical Information:
- Country
- ISO country code
- OutbreakEpicenter (specific location within country)
Case Counts:
- CasesTotal
- CasesSuspected
- CasesProbable
- CasesConfirmed
- Deaths
Temporal Information:
- Outbreak start date (year, month, day)
- Outbreak detection date (year, month, day)
- Outbreak verification date (year, month, day)
- Outbreak end date and status
Training Procedure
Training Data
The model was trained on the WHO Disease Outbreak News curated database (Carlson et al., 2023), which contains:
- 3,338 structured records of disease outbreaks (data through 2019)
- Curated epidemiological information manually extracted from WHO DONs reports
- Standardized format for disease classifications, geographical locations, case counts, and temporal data
Training Approach
The training followed an instruction-tuning paradigm where unstructured outbreak report text is paired with structured JSON output containing extracted epidemiological features. The prompt format used was:
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction:
Extract disease outbreak information from the given text and format it as JSON.
Return a list containing one JSON object per outbreak mentioned.
Use "None" for missing information. Never invent or guess data.
### Input:
[Outbreak report text]
### Response:
[Extracted JSON with epidemiological features]
Fine-tuning Configuration
LoRA (Low-Rank Adaptation) Parameters:
- Rank (r): 16
- Alpha (α): 16
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0.05
- Task type: CAUSAL_LM
Training Hyperparameters:
- Learning rate: 1e-5 (with linear decay)
- Optimizer: AdamW (8-bit paged)
- Training batch size: 4 per device (8 GPUs)
- Gradient accumulation steps: 4
- Number of epochs: 2 (early convergence)
- Warmup steps: Adaptive (10% of training steps, max 10)
- FP16 mixed precision training
- Weight decay: 0.01
- LR scheduler: Linear
- Seed: 41
Evaluation Strategy:
- 5-fold stratified cross-validation
- Evaluation metric: Training loss (model selection based on lowest training loss)
- Early stopping: After 6 consecutive evaluations without improvement
- Logging steps: 10
- Save steps: Adaptive (10% of training steps)
Hardware:
- Infrastructure: JRC Big Data Analytics Platform
- System: Linux cluster, Ubuntu 22.04.5 LTS
- CPU: Intel Xeon Platinum 8470 (208 CPUs)
- RAM: 1TB
- GPUs: 8x NVIDIA H100
- Training time: ~20-22 hours per fold
Note: EpiMistral-7B demonstrated rapid convergence during training, reaching optimal performance within approximately 2 epochs, showcasing the efficiency of the Mistral architecture combined with the OpenOrca instruction-tuning for domain adaptation tasks.
Quantization
The model uses 8-bit quantization with LoRA during training:
- Load in 8-bit: True
- Quantization type: Standard 8-bit
- Compute dtype: bfloat16
Usage
Installation
pip install transformers==4.52.4
pip install torch==2.3.1
pip install peft==0.12.0
pip install accelerate==1.7.0
pip install bitsandbytes==0.43.3
Basic Usage
Note: You need access to the base Mistral-7B-OpenOrca model (available under Apache 2.0) to use these adapter weights.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base model (Apache 2.0 licensed)
base_model_id = "Open-Orca/Mistral-7B-OpenOrca"
adapter_model_id = "jrc-ai/EpiMistral-7B" # LoRA adapters
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load tokenizer from base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load and apply LoRA adapters
model = PeftModel.from_pretrained(base_model, adapter_model_id)
# Example outbreak report
outbreak_text = """
WHO has reported 3 suspected cases of yellow fever in Maryland county,
in the south-eastern part of the country. One case with disease onset on
1 August has been confirmed (IgM positive) by the Institut Pasteur in
Abidjan, Côte d'Ivoire. All three cases have died.
"""
# Format prompt
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Extract disease outbreak information from the given text and format it as JSON.
Return a list containing one JSON object per outbreak mentioned.
Always return a list of JSON objects, even for single outbreaks.
Use "None" for missing information. If no outbreak information is found, return an empty list [].
Never invent or guess data.
### Input:
{outbreak_text}
### Response:
"""
# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=600,
temperature=0.1,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
# Decode output
extracted_info = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(extracted_info)
Expected Output Format
[{
"DiseaseLevel1": "Yellow fever",
"DiseaseLevel2": "",
"Country": "Liberia",
"ISO": "LBR",
"OutbreakEpicenter": "Maryland county",
"CasesTotal": 3,
"CasesSuspected": 2,
"CasesProbable": null,
"CasesConfirmed": 1,
"Deaths": 3,
"OutbreakStartYear": 2001,
"OutbreakStartMonth": 8,
"OutbreakStartDay": 1,
"OutbreakDetectionYear": null,
"OutbreakDetectionMonth": null,
"OutbreakDetectionDay": null,
"OutbreakVerificationYear": null,
"OutbreakVerificationMonth": null,
"OutbreakVerificationDay": null,
"OutbreakEnd": null,
"OutbreakEndYear": null,
"OutbreakEndMonth": null,
"OutbreakEndDay": null
}]
Comparison with Other Approaches
In-Context Learning vs Fine-Tuning
This fine-tuned model dramatically outperforms in-context learning (iCL) approaches:
| Approach | Rouge-1 | Rouge-2 | Rouge-L | Rouge-Lsum |
|---|---|---|---|---|
| EpiMistral-7B (fine-tuned) | 0.899 | 0.853 | 0.889 | 0.887 |
| Mistral-7B-OpenOrca (16-shot iCL) | 0.600 | 0.475 | 0.580 | 0.598 |
| LLaMA 3.3-70B (16-shot iCL) | 0.840 | 0.698 | 0.824 | 0.841 |
Performance gain from fine-tuning: ~30 percentage points compared to its iCL baseline, demonstrating the substantial benefit of parameter-efficient fine-tuning for this specialized task.
Comparison with Other Fine-tuned Models
| Model | Parameters | Rouge-1 | Rouge-2 | Rouge-L |
|---|---|---|---|---|
| EpiLLaMA 3.3-70B | 70B | 0.937 | 0.896 | 0.928 |
| EpiQwen 2.5-7B | 7B | 0.918 | 0.864 | 0.908 |
| EpiMistral-7B | 7B | 0.899 | 0.853 | 0.889 |
All pairwise comparisons are statistically significant (p < 0.001, Nemenyi post-hoc test with Bonferroni correction).
Key Characteristics:
- Built on the OpenOrca instruction-tuned Mistral architecture, optimized for step-by-step reasoning
- Achieves strong performance with rapid convergence (best model within 2 epochs)
- Ideal for deployment scenarios requiring efficient training and inference
- Demonstrates robust extraction capabilities across diverse outbreak types
Citation
If you use this model in your research, please cite:
@article{consoli2025generative,
title={Generative AI for Structured Epidemiological Information Extraction: Comparing In-Context Learning and Fine-Tuning Approaches},
author={Consoli, Sergio and Bertolini, Lorenzo and Stefanovitch, Nicolas and Spagnolo, Luigi and Espinosa, Laura and Stilianakis, Nikolaos I.},
journal={Epidemiology and Infection},
volume={submitted, currently under revision},
year={2025},
publisher={Cambridge University Press}
}
Please also acknowledge the base models:
@article{jiang2023mistral,
title={Mistral 7B},
author={Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others},
journal={arXiv preprint arXiv:2310.06825},
year={2023}
}
Ethical Considerations & Dual-Use Implications
Upon evaluation, we identified no dual-use implications for this model. The model is designed specifically for public health surveillance and epidemic intelligence applications to support global health initiatives.
Important Notes:
- The model should be used as a decision-support tool with appropriate human oversight
- Extracted information should be verified by public health professionals before making critical decisions
- The model does not replace human expertise in epidemiological analysis
- Privacy and data protection regulations should be followed when processing outbreak reports
Acknowledgments
We acknowledge:
- Mistral AI for developing and releasing Mistral 7B under the Apache License 2.0
- The OpenOrca team for their instruction-tuned Mistral variant
- The GPT@JRC initiative for providing access to LLMs
- The JRC Big Data Analytics Platform for computational infrastructure
- The WHO Epidemic Intelligence from Open Sources (EIOS) initiative for support
- Colleagues at the European Commission Joint Research Centre (JRC) and the European Centre for Disease Prevention and Control (ECDC)
Framework Versions
- Transformers: 4.52.4
- PyTorch: 2.3.1
- PEFT: 0.12.0
- Accelerate: 1.7.0
- BitsAndBytes: 0.43.3
- Datasets: 2.20.0
Disclaimer: The views expressed are purely those of the authors and may not in any circumstance be regarded as stating an official position of the European Commission.
- Downloads last month
- 6
Model tree for AI4PH/EpiMistral-7B
Base model
Open-Orca/Mistral-7B-OpenOrca