DKSplit

Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

Model Description

  • Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
  • Format: ONNX with INT8 quantization
  • Size: ~9MB
  • Input: Lowercase a-z, 0-9 (max 64 characters)

Usage

Install

pip install dksplit

Python

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]

Direct ONNX

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code

Files

  • dksplit-int8.onnx - ONNX model (INT8 quantized)
  • dksplit.npz - CRF parameters

Limitations

  • Input: a-z, 0-9 only
  • Max length: 64 characters
  • Non-Latin scripts: use Romanized form

Links

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support