|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- kl3m |
|
|
- kl3m-002 |
|
|
- legal |
|
|
- financial |
|
|
- enterprise |
|
|
- slm |
|
|
- gpt-neox |
|
|
date: '2024-02-20T00:00:00.000Z' |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "Medical devices are regulated by" |
|
|
- temperature: 0.3 |
|
|
- do_sample: True |
|
|
--- |
|
|
|
|
|
# kl3m-002-170m |
|
|
|
|
|
kl3m-002-170m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally |
|
|
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai), |
|
|
kl3m-002-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications) |
|
|
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows, |
|
|
with a focus on low toxicity and high efficiency. |
|
|
|
|
|
Given its small size and lack of instruction-aligned training data, kl3m-002-170m is best suited for use either in |
|
|
SLM fine-tuning or as part of training larger models without using unethical data or models. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture) |
|
|
- **Size**: 170 million parameters |
|
|
- **Hidden Size**: 1024 |
|
|
- **Layers**: 16 |
|
|
- **Attention Heads**: 16 |
|
|
- **Key-Value Heads**: 8 |
|
|
- **Intermediate Size**: 1024 |
|
|
- **Max Sequence Length**: 4,096 tokens (true size, no sliding window) |
|
|
- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling) |
|
|
- **Language(s)**: Primarily English |
|
|
- **Training Objective**: Next token prediction |
|
|
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai) |
|
|
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
|
|
- **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1 |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
kl3m-002-170m is particularly effective for: |
|
|
|
|
|
- Basic regulatory question answering |
|
|
- Contract provision drafting |
|
|
- Structured JSON information extraction |
|
|
- Foundation for downstream optimization |
|
|
- Base model for domain-specific fine-tuning |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Perplexity Scores |
|
|
| Dataset | Score | |
|
|
|---------------|--------| |
|
|
| Wiki | 19.58 | |
|
|
| CNN/Daily Mail| 11.20 | |
|
|
| Legal Domain | 2.31 | |
|
|
|
|
|
The model demonstrates particularly strong per-parameter performance on legal domain content, outperforming many |
|
|
larger models as of its training data. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible. |
|
|
- **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity) |
|
|
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows. |
|
|
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware. |
|
|
|
|
|
## Usage |
|
|
|
|
|
Basic usage for text generation: |
|
|
|
|
|
```python |
|
|
import json |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model and tokenizer |
|
|
p = pipeline('text-generation', 'alea-institute/kl3m-002-170m', device='cpu') |
|
|
|
|
|
# Example usage on CPU |
|
|
text = "Under this" |
|
|
print( |
|
|
json.dumps( |
|
|
[ |
|
|
r.get("generated_text") |
|
|
for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32) |
|
|
], |
|
|
indent=2 |
|
|
) |
|
|
) |
|
|
``` |
|
|
|
|
|
```json |
|
|
[ |
|
|
"Under this proposed rule, the Federal agency must determine the effect on State, local, and", |
|
|
"Under this proposed rule, we are proposing to amend the definition of \u201ccovered product\u201d in ", |
|
|
"Under this proposed rule, the FAA is considering issuing this proposed rule after evaluating the information" |
|
|
] |
|
|
``` |
|
|
|
|
|
### Contract Example |
|
|
|
|
|
```python |
|
|
text = "Governing Law.\n" |
|
|
print( |
|
|
json.dumps( |
|
|
[ |
|
|
r.get("generated_text") |
|
|
for r in p(text, do_sample=True, temperature=0.3, num_return_sequences=3, max_new_tokens=32) |
|
|
], |
|
|
indent=2 |
|
|
) |
|
|
) |
|
|
``` |
|
|
|
|
|
```json |
|
|
[ |
|
|
"Governing Law.\n The provisions of the Plan shall be construed and enforced in accordance with", |
|
|
"Governing Law.\n The laws of the State of Delaware shall govern the validity, construction, and", |
|
|
"Governing Law.\n The laws of the State of New York shall govern the validity, construction, enforcement" |
|
|
] |
|
|
``` |
|
|
|
|
|
### Generation Parameters |
|
|
|
|
|
The model supports various parameters to control the generation process: |
|
|
|
|
|
- `temperature`: Controls randomness (lower = more deterministic) |
|
|
- `top_p`: Nucleus sampling parameter (lower = more focused) |
|
|
- `top_k`: Limits vocabulary selection to top k tokens |
|
|
- `max_new_tokens`: Maximum number of tokens to generate |
|
|
- `do_sample`: Whether to use sampling vs. greedy decoding |
|
|
- `num_return_sequences`: Number of different sequences to generate |
|
|
|
|
|
## Training |
|
|
|
|
|
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is |
|
|
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024. |
|
|
|
|
|
The model implements several techniques during training: |
|
|
|
|
|
- Hybrid NTP and SFT cotraining |
|
|
- Dynamic, document-aware segmentation |
|
|
- Randomized padding |
|
|
- Traditional fixed-attention mechanisms |
|
|
|
|
|
### Training Data |
|
|
|
|
|
While the original training data collection and training infrastructure relies on software that was not donated by |
|
|
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API. |
|
|
|
|
|
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data) |
|
|
|
|
|
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a |
|
|
zero-cost distribution model as soon as we can obtain additional support. |
|
|
|
|
|
This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that |
|
|
we believe is 100% public domain material. However, so as to enforce maximum transparency to all |
|
|
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0. |
|
|
|
|
|
## Intended Usage |
|
|
|
|
|
This model is intended for use in: |
|
|
|
|
|
- Legal and regulatory document processing systems |
|
|
- Contract drafting assistance |
|
|
- Financial and enterprise document workflows |
|
|
- Educational contexts for learning about domain-specific language models |
|
|
- Research on small, efficient language models |
|
|
|
|
|
## Special Tokens |
|
|
|
|
|
kl3m-002-170m uses the following special tokens: |
|
|
|
|
|
- `<s>` (ID: 0): Beginning of sequence token (BOS) |
|
|
- `</s>` (ID: 1): End of sequence token (EOS) |
|
|
- `<pad>` (ID: 2): Padding token |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Limited to a 4,096 token context window |
|
|
- As a small language model (170M parameters), it has limited general knowledge |
|
|
- Not instruction-tuned or aligned with human preferences |
|
|
- May generate plausible-sounding but incorrect legal or regulatory text |
|
|
- Not a substitute for professional legal advice or domain expertise |
|
|
- Performance is optimized for legal and financial domains; general performance may be lower |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- This model should not be used to generate legal advice without human expert review |
|
|
- The model may reflect biases present in the training data despite efforts to use clean data |
|
|
- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness |
|
|
|
|
|
## Source |
|
|
|
|
|
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research) |
|
|
|
|
|
## References |
|
|
|
|
|
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247) |
|
|
- Additional tokenizer, dataset, and model publications are pending. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{kl3m-002-170m, |
|
|
author = {ALEA Institute}, |
|
|
title = {kl3m-002-170m: A Small Language Model for Legal and Regulatory Text}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-170m}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute. |
|
|
|
|
|
The model weights are released under the CC-BY 4.0 License. |
|
|
|
|
|
## Contact |
|
|
|
|
|
The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries: |
|
|
|
|
|
- GitHub: https://github.com/alea-institute/kl3m-model-research |
|
|
- Email: hello@aleainstitute.ai |
|
|
- Website: https://aleainstitute.ai |
|
|
|
|
|
 |
|
|
|