DKSplit

Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

Model Description

Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
Format: ONNX with INT8 quantization
Size: ~9MB
Input: Lowercase a-z, 0-9 (max 64 characters)

Usage

Install

pip install dksplit

Python

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]

Direct ONNX

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code

Files

dksplit-int8.onnx - ONNX model (INT8 quantized)
dksplit.npz - CRF parameters

Limitations

Input: a-z, 0-9 only
Max length: 64 characters
Non-Latin scripts: use Romanized form

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

ABTdomain
/

dksplit

DKSplit

Model Description

Usage

Install

Python

Direct ONNX

Files

Limitations

Links

License