EpiMistral-7B: Fine-tuned Mistral for Epidemiological Information Extraction

Model Description

EpiMistral-7B is a fine-tuned version of Open-Orca/Mistral-7B-OpenOrca specialized for extracting structured epidemiological information from unstructured disease outbreak reports. The model was trained on the WHO Disease Outbreak News (DONs) curated database (Carlson et al., 2023) to automatically extract key epidemiological features including disease classification, geographical locations, case counts, temporal information, and outbreak characteristics.

Model Details

Base Model: Open-Orca/Mistral-7B-OpenOrca (based on Mistral-7B-v0.1)
Base Model License: Apache License 2.0
Model Type: Causal Language Model (Decoder-only Transformer)
Fine-tuning Method: Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation)
Adapter Weights License: CC0-1.0 (Public Domain Dedication) - Note: Only the LoRA adapter weights are released under CC0. The base model weights remain under Apache 2.0.
Training Data: WHO Disease Outbreak News curated database (3,338 records through 2019)
Language: English
Application Domain: Public health surveillance, epidemic intelligence, epidemiological information extraction

License

Licensing Information

This repository contains LoRA adapter weights only, not the full model weights.

Base Model (Mistral 7B / Open-Orca/Mistral-7B-OpenOrca): Licensed under Apache License 2.0
- Copyright (c) Mistral AI (base model)
- Copyright (c) OpenOrca (instruction-tuned variant)
- Permissive open-source license allowing commercial use, modification, and distribution
- Requires preservation of copyright and license notices
LoRA Adapter Weights: Released under CC0 1.0 Universal (Public Domain Dedication)
- The adapter weights can be used without restriction
- To use these adapters, you must have access to the base Mistral-7B-OpenOrca model (available under Apache 2.0)

Attribution: When using this model, please include appropriate attribution:

Mistral 7B and Open-Orca/Mistral-7B-OpenOrca are licensed under the Apache License 2.0,
Copyright (c) Mistral AI and OpenOrca.

EpiMistral-7B LoRA adapter weights are released under CC0 1.0 Universal (Public Domain).

Distribution Notes

This repository distributes only the fine-tuned LoRA adapter parameters
Base model weights are unchanged and must be obtained separately from Hugging Face
Users benefit from Apache 2.0's permissive terms for the base model
The LoRA adapters are applied on top of the base model weights at inference time

Performance

The model achieved the following results on the evaluation set:

Metric	Score
Rouge-1	0.899 ± 0.046
Rouge-2	0.853 ± 0.057
Rouge-L	0.889 ± 0.049
Rouge-Lsum	0.887 ± 0.047

These scores represent overall performance across 5-fold stratified cross-validation, demonstrating strong accuracy in extracting structured epidemiological information from unstructured outbreak reports.

Training Summary

Best Training Step: 1,060
Best Training Loss: 0.5338
Initial Training Loss: 1.6518
Total Improvement: 1.1180
Total Training Steps: 1,066
Final Training Loss: 0.5338

Intended Uses & Limitations

Intended Uses

This model is designed for:

Automated extraction of epidemiological information from disease outbreak reports
Public health surveillance systems requiring structured data from unstructured sources
Epidemic intelligence pipelines for rapid outbreak detection and monitoring
Research purposes in computational epidemiology and public health informatics

Limitations

The model is trained specifically on WHO DONs format and may require adaptation for other report formats
Performance on diseases not well-represented in the training data may vary
The model extracts information present in the text and does not generate or infer missing data
Designed for English-language outbreak reports only
Should be used as a decision-support tool, with human verification for critical public health decisions

Extracted Features

The model extracts the following structured epidemiological information:

Disease Information:

DiseaseLevel1 (primary disease classification)
DiseaseLevel2 (disease subtype/variant)

Geographical Information:

Country
ISO country code
OutbreakEpicenter (specific location within country)

Case Counts:

CasesTotal
CasesSuspected
CasesProbable
CasesConfirmed
Deaths

Temporal Information:

Outbreak start date (year, month, day)
Outbreak detection date (year, month, day)
Outbreak verification date (year, month, day)
Outbreak end date and status

Training Procedure

Training Data

The model was trained on the WHO Disease Outbreak News curated database (Carlson et al., 2023), which contains:

3,338 structured records of disease outbreaks (data through 2019)
Curated epidemiological information manually extracted from WHO DONs reports
Standardized format for disease classifications, geographical locations, case counts, and temporal data

Training Approach

The training followed an instruction-tuning paradigm where unstructured outbreak report text is paired with structured JSON output containing extracted epidemiological features. The prompt format used was:

Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.

### Instruction:
Extract disease outbreak information from the given text and format it as JSON.
Return a list containing one JSON object per outbreak mentioned.
Use "None" for missing information. Never invent or guess data.

### Input:
[Outbreak report text]

### Response:
[Extracted JSON with epidemiological features]

Fine-tuning Configuration

LoRA (Low-Rank Adaptation) Parameters:

Rank (r): 16
Alpha (α): 16
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout: 0.05
Task type: CAUSAL_LM

Training Hyperparameters:

Learning rate: 1e-5 (with linear decay)
Optimizer: AdamW (8-bit paged)
Training batch size: 4 per device (8 GPUs)
Gradient accumulation steps: 4
Number of epochs: 2 (early convergence)
Warmup steps: Adaptive (10% of training steps, max 10)
FP16 mixed precision training
Weight decay: 0.01
LR scheduler: Linear
Seed: 41

Evaluation Strategy:

5-fold stratified cross-validation
Evaluation metric: Training loss (model selection based on lowest training loss)
Early stopping: After 6 consecutive evaluations without improvement
Logging steps: 10
Save steps: Adaptive (10% of training steps)

Hardware:

Infrastructure: JRC Big Data Analytics Platform
System: Linux cluster, Ubuntu 22.04.5 LTS
CPU: Intel Xeon Platinum 8470 (208 CPUs)
RAM: 1TB
GPUs: 8x NVIDIA H100
Training time: ~20-22 hours per fold

Note: EpiMistral-7B demonstrated rapid convergence during training, reaching optimal performance within approximately 2 epochs, showcasing the efficiency of the Mistral architecture combined with the OpenOrca instruction-tuning for domain adaptation tasks.

Quantization

The model uses 8-bit quantization with LoRA during training:

Load in 8-bit: True
Quantization type: Standard 8-bit
Compute dtype: bfloat16

Usage

Installation

pip install transformers==4.52.4
pip install torch==2.3.1
pip install peft==0.12.0
pip install accelerate==1.7.0
pip install bitsandbytes==0.43.3

Basic Usage

Note: You need access to the base Mistral-7B-OpenOrca model (available under Apache 2.0) to use these adapter weights.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model (Apache 2.0 licensed)
base_model_id = "Open-Orca/Mistral-7B-OpenOrca"
adapter_model_id = "jrc-ai/EpiMistral-7B"  # LoRA adapters
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer from base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load and apply LoRA adapters
model = PeftModel.from_pretrained(base_model, adapter_model_id)

# Example outbreak report
outbreak_text = """
WHO has reported 3 suspected cases of yellow fever in Maryland county, 
in the south-eastern part of the country. One case with disease onset on 
1 August has been confirmed (IgM positive) by the Institut Pasteur in 
Abidjan, Côte d'Ivoire. All three cases have died.
"""

# Format prompt
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Extract disease outbreak information from the given text and format it as JSON.
Return a list containing one JSON object per outbreak mentioned.
Always return a list of JSON objects, even for single outbreaks.
Use "None" for missing information. If no outbreak information is found, return an empty list [].
Never invent or guess data.

### Input:
{outbreak_text}

### Response:
"""

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=600,
        temperature=0.1,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode output
extracted_info = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(extracted_info)

Expected Output Format

[{
  "DiseaseLevel1": "Yellow fever",
  "DiseaseLevel2": "",
  "Country": "Liberia",
  "ISO": "LBR",
  "OutbreakEpicenter": "Maryland county",
  "CasesTotal": 3,
  "CasesSuspected": 2,
  "CasesProbable": null,
  "CasesConfirmed": 1,
  "Deaths": 3,
  "OutbreakStartYear": 2001,
  "OutbreakStartMonth": 8,
  "OutbreakStartDay": 1,
  "OutbreakDetectionYear": null,
  "OutbreakDetectionMonth": null,
  "OutbreakDetectionDay": null,
  "OutbreakVerificationYear": null,
  "OutbreakVerificationMonth": null,
  "OutbreakVerificationDay": null,
  "OutbreakEnd": null,
  "OutbreakEndYear": null,
  "OutbreakEndMonth": null,
  "OutbreakEndDay": null
}]

Comparison with Other Approaches

In-Context Learning vs Fine-Tuning

This fine-tuned model dramatically outperforms in-context learning (iCL) approaches:

Approach	Rouge-1	Rouge-2	Rouge-L	Rouge-Lsum
EpiMistral-7B (fine-tuned)	0.899	0.853	0.889	0.887
Mistral-7B-OpenOrca (16-shot iCL)	0.600	0.475	0.580	0.598
LLaMA 3.3-70B (16-shot iCL)	0.840	0.698	0.824	0.841

Performance gain from fine-tuning: ~30 percentage points compared to its iCL baseline, demonstrating the substantial benefit of parameter-efficient fine-tuning for this specialized task.

Comparison with Other Fine-tuned Models

Model	Parameters	Rouge-1	Rouge-2	Rouge-L
EpiLLaMA 3.3-70B	70B	0.937	0.896	0.928
EpiQwen 2.5-7B	7B	0.918	0.864	0.908
EpiMistral-7B	7B	0.899	0.853	0.889

All pairwise comparisons are statistically significant (p < 0.001, Nemenyi post-hoc test with Bonferroni correction).

Key Characteristics:

Built on the OpenOrca instruction-tuned Mistral architecture, optimized for step-by-step reasoning
Achieves strong performance with rapid convergence (best model within 2 epochs)
Ideal for deployment scenarios requiring efficient training and inference
Demonstrates robust extraction capabilities across diverse outbreak types

Citation

If you use this model in your research, please cite:

@article{consoli2025generative,
  title={Generative AI for Structured Epidemiological Information Extraction: Comparing In-Context Learning and Fine-Tuning Approaches},
  author={Consoli, Sergio and Bertolini, Lorenzo and Stefanovitch, Nicolas and Spagnolo, Luigi and Espinosa, Laura and Stilianakis, Nikolaos I.},
  journal={Epidemiology and Infection},
  volume={submitted, currently under revision},
  year={2025},
  publisher={Cambridge University Press}
}

Please also acknowledge the base models:

@article{jiang2023mistral,
  title={Mistral 7B},
  author={Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others},
  journal={arXiv preprint arXiv:2310.06825},
  year={2023}
}

Ethical Considerations & Dual-Use Implications

Upon evaluation, we identified no dual-use implications for this model. The model is designed specifically for public health surveillance and epidemic intelligence applications to support global health initiatives.

Important Notes:

The model should be used as a decision-support tool with appropriate human oversight
Extracted information should be verified by public health professionals before making critical decisions
The model does not replace human expertise in epidemiological analysis
Privacy and data protection regulations should be followed when processing outbreak reports

Acknowledgments

We acknowledge:

Mistral AI for developing and releasing Mistral 7B under the Apache License 2.0
The OpenOrca team for their instruction-tuned Mistral variant
The GPT@JRC initiative for providing access to LLMs
The JRC Big Data Analytics Platform for computational infrastructure
The WHO Epidemic Intelligence from Open Sources (EIOS) initiative for support
Colleagues at the European Commission Joint Research Centre (JRC) and the European Centre for Disease Prevention and Control (ECDC)

Framework Versions

Transformers: 4.52.4
PyTorch: 2.3.1
PEFT: 0.12.0
Accelerate: 1.7.0
BitsAndBytes: 0.43.3
Datasets: 2.20.0

Disclaimer: The views expressed are purely those of the authors and may not in any circumstance be regarded as stating an official position of the European Commission.

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AI4PH/EpiMistral-7B

Base model

Open-Orca/Mistral-7B-OpenOrca

Finetuned

(11)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard