DKSplit
Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.
Model Description
- Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
- Format: ONNX with INT8 quantization
- Size: ~9MB
- Input: Lowercase a-z, 0-9 (max 64 characters)
Usage
Install
pip install dksplit
Python
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]
Direct ONNX
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code
Files
dksplit-int8.onnx- ONNX model (INT8 quantized)dksplit.npz- CRF parameters
Limitations
- Input: a-z, 0-9 only
- Max length: 64 characters
- Non-Latin scripts: use Romanized form
Links
License
MIT