Upload README.md

Browse files

Files changed (1) hide show

README.md +62 -3

README.md CHANGED Viewed

@@ -1,3 +1,62 @@
----
-license: apache-2.0
----

+urlbert-tiny-base-v4 is a lightweight BERT-based model specifically optimized for URL analysis. This version includes several improvements over the previous version:
+- Trained using a teacher-student architecture
+- Utilized masked token prediction as the primary pre-training task
+- Incorporated knowledge distillation from a larger model's logits
+- Additional training on 3 specialized tasks to enhance URL structure understanding
+The result is an efficient model that can be rapidly fine-tuned for URL classification tasks with minimal computational resources.
+## Model Details
+- **Parameters:** 3.72M
+- **Tensor Type:** F32
+- **Previous Version:** [urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3)
+## Usage Example
+```python
+from transformers import BertTokenizerFast, BertForMaskedLM, pipeline
+import torch
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+print(f"Device: {device}")
+model_name = "CrabInHoney/urlbert-tiny-base-v4"
+tokenizer = BertTokenizerFast.from_pretrained(model_name)
+model = BertForMaskedLM.from_pretrained(model_name)
+model.to(device)
+fill_mask = pipeline(
+    "fill-mask",
+    model=model,
+    tokenizer=tokenizer,
+    device=0 if torch.cuda.is_available() else -1
+)
+sentences = [
+    "http://example.[MASK]/"
+]
+for sentence in sentences:
+    print(f"\nInput: {sentence}")
+    results = fill_mask(sentence)
+    for result in results:
+        token_str = result['token_str']
+        score = result['score']
+        print(f"Predicted token: {token_str}, probability: {score:.4f}")
+```
+### Sample Output
+```
+Input: http://example.[MASK]/
+Predicted token: com, probability: 0.7307
+Predicted token: net, probability: 0.1319
+Predicted token: org, probability: 0.0881
+Predicted token: info, probability: 0.0094
+Predicted token: cn, probability: 0.0084
+```