trapoom555
/

MiniCPM-2B-Text-Embedding-cft

@@ -1,125 +1,125 @@
----
-license: mit
-language:
-- en
-tags:
-- sentence-embedding
-- sentence-similarity
-- transformers
-- feature-extraction
-pipeline_tag: sentence-similarity
----
-# MiniCPM-2B-Text-Embedding-cft
-## Description
-This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
-## Base Model
-[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
-## Usage
-1. Clone MiniCPM-2B-dpo-bf16 repository
-```bash
-git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
-```
-2. Change a tokenizer setting in `tokenizer_config.json`
-```json
-"add_eos_token": true
-```
-3. Use the model
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-import numpy as np
-class MiniCPMSentenceEmbedding:
-    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
-        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
-        self.model = AutoModelForCausalLM.from_pretrained(model_path,
-                                                          torch_dtype=torch.bfloat16,
-                                                          device_map='cuda',
-                                                          trust_remote_code=True)
-        if adapter_path != None:
-            # Load fine-tuned LoRA
-            self.model.load_adapter(adapter_path)
-    def get_last_hidden_state(self, text):
-        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
-        with torch.no_grad():
-            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
-        return out.squeeze().float().cpu().numpy()
-    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
-        """
-        Returns a list of embeddings for the given sentences.
-        Args:
-            sentences: List of sentences to encode
-        Returns:
-            List of embeddings for the given sentences
-        """
-        out = []
-        for s in sentences:
-            out.append(self.get_last_hidden_state(s))
-        return out
-minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
-example_sentences = ["I don't like apples", "I like apples"]
-encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
-print(encoded_sentences)
-```
-## Training Details
-| **Training Details**    | **Value**         |
-|-------------------------|-------------------|
-| Loss                    | InfoNCE           |
-| Batch Size              | 60                |
-| InfoNCE Temperature     | 0.05              |
-| Learning Rate           | 5e-05             |
-| Warmup Steps            | 100               |
-| Learning Rate Scheduler | CosineAnnealingLR |
-| LoRA Rank               | 8                 |
-| LoRA Alpha              | 32                |
-| LoRA Dropout            | 0.1               |
-| Training Precision      | bf16              |
-| Max Epoch               | 1                 |
-| GPU                     | RTX3090           |
-| Num GPUs                | 4                 |
-## Training Scripts
-**_(coming soon...)_**
-## Checkpoints
-We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
-## Evaluation Results
-**_(coming soon...)_**
-## Contributors
-Trapoom Ukarapol, Zhicheng Lee, Amy Xin
-## Foot Notes
 This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.

+---
+license: mit
+language:
+- en
+tags:
+- sentence-embedding
+- sentence-similarity
+- transformers
+- feature-extraction
+pipeline_tag: sentence-similarity
+---
+# MiniCPM-2B-Text-Embedding-cft
+## Description
+This is a fine-tuned version of [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets.
+## Base Model
+[MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16)
+## Usage
+1. Clone MiniCPM-2B-dpo-bf16 repository
+```bash
+git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
+```
+2. Change a tokenizer setting in `tokenizer_config.json`
+```json
+"add_eos_token": true
+```
+3. Use the model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import numpy as np
+class MiniCPMSentenceEmbedding:
+    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+        self.model = AutoModelForCausalLM.from_pretrained(model_path,
+                                                          torch_dtype=torch.bfloat16,
+                                                          device_map='cuda',
+                                                          trust_remote_code=True)
+        if adapter_path != None:
+            # Load fine-tuned LoRA
+            self.model.load_adapter(adapter_path)
+    def get_last_hidden_state(self, text):
+        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
+        with torch.no_grad():
+            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
+        return out.squeeze().float().cpu().numpy()
+    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
+        """
+        Returns a list of embeddings for the given sentences.
+        Args:
+            sentences: List of sentences to encode
+        Returns:
+            List of embeddings for the given sentences
+        """
+        out = []
+        for s in sentences:
+            out.append(self.get_last_hidden_state(s))
+        return out
+minicpm_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
+example_sentences = ["I don't like apples", "I like apples"]
+encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
+print(encoded_sentences)
+```
+## Training Details
+| **Training Details**    | **Value**         |
+|-------------------------|-------------------|
+| Loss                    | InfoNCE           |
+| Batch Size              | 60                |
+| InfoNCE Temperature     | 0.05              |
+| Learning Rate           | 5e-05             |
+| Warmup Steps            | 100               |
+| Learning Rate Scheduler | CosineAnnealingLR |
+| LoRA Rank               | 8                 |
+| LoRA Alpha              | 32                |
+| LoRA Dropout            | 0.1               |
+| Training Precision      | bf16              |
+| Max Epoch               | 1                 |
+| GPU                     | RTX3090           |
+| Num GPUs                | 4                 |
+## Training Scripts
+The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).
+## Checkpoints
+We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/MiniCPM-2B-Text-Embedding-cft-checkpoints).
+## Evaluation Results
+**_(coming soon...)_**
+## Contributors
+Trapoom Ukarapol, Zhicheng Lee, Amy Xin
+## Foot Notes
 This project is the topic-free final project of the Tsinghua University NLP course for Spring 2024.