openbmb
/

MiniCPM-Embedding

@@ -1,28 +1,34 @@
-## MiniCPM-R
-**MiniCPM-R** 是面壁智能与清华大学自然语言处理实验室（THUNLP）共同开发的中英双语言文本嵌入模型，有如下特点：
 - 出色的中文、英文检索能力。
 - 出色的中英跨语言检索能力。
-MiniCPM-R 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练，结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
 欢迎关注 RAG 套件系列：
-- 检索模型：[MiniCPM-R](https://huggingface.co/openbmb/MiniCPM-R)
-- 重排模型：[MiniCPM-RR](https://huggingface.co/openbmb/MiniCPM-RR)
 - 面向 RAG 场景的 LoRA 插件：[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
-**MiniCPM-R** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:
 - Exceptional Chinese and English retrieval capabilities.
 - Outstanding cross-lingual retrieval capabilities between Chinese and English.
-MiniCPM-R is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
 We also invite you to explore the RAG toolkit series:
-- Retrieval Model: [MiniCPM-R](https://huggingface.co/openbmb/MiniCPM-R)
-- Re-ranking Model: [MiniCPM-RR](https://huggingface.co/openbmb/MiniCPM-RR)
 - LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
 [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
@@ -42,7 +48,7 @@ We also invite you to explore the RAG toolkit series:
 本模型支持 query 侧指令，格式如下：
-MiniCPM-R supports query-side instructions in the following format:
 ```
 Instruction: {{ instruction }} Query: {{ query }}
@@ -62,7 +68,7 @@ Instruction: Given a claim about climate change, retrieve documents that support
 也可以不提供指令，即采取如下格式：
-MiniCPM-R also works in instruction-free mode in the following format:
 ```
 Query: {{ query }}
@@ -87,7 +93,7 @@ from transformers import AutoModel, AutoTokenizer
 import torch
 import torch.nn.functional as F
-model_name = "openbmb/MiniCPM-R"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
 model.eval()
@@ -145,8 +151,8 @@ print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]
 | gte-Qwen2-1.5B-instruct      | 71.86             | 58.29         |
 | gte-Qwen2-7B-instruct        | 76.03             | 60.25         |
 | bge-multilingual-gemma2      | 73.73             | 59.24         |
-| MiniCPM-R                    | **76.76**         | 58.56         |
-| MiniCPM-R+MiniCPM-RR         | 77.08             | 61.61         |
 ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
@@ -157,15 +163,15 @@ print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]
 | gte-multilingual-base(Dense) | 68.2               | 39.46              | 45.86              |
 | gte-Qwen2-1.5B-instruct      | 68.52              | 49.11              | 45.05              |
 | gte-Qwen2-7B-instruct        | 68.27              | 49.14              | 49.6               |
-| MiniCPM-R                    | **72.95**          | **52.65**          | **49.95**          |
-| MiniCPM-R+MiniCPM-RR         | 74.33              | 53.21              | 54.12              |
 ## 许可证 License
 - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
-- MiniCPM-R 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
-- MiniCPM-R 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
-* The usage of MiniCPM-R model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
-* The models and weights of MiniCPM-R are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-R weights are also available for free commercial use.

+---
+language:
+- zh
+- en
+base_model: openbmb/MiniCPM-2B-sft-bf16
+---
+## RankCPM-E
+**RankCPM-E** 是面壁智能与清华大学自然语言处理实验室（THUNLP）共同开发的中英双语言文本嵌入模型，有如下特点：
 - 出色的中文、英文检索能力。
 - 出色的中英跨语言检索能力。
+RankCPM-E 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练，结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
 欢迎关注 RAG 套件系列：
+- 检索模型：[RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
+- 重排模型：[RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
 - 面向 RAG 场景的 LoRA 插件：[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
+**RankCPM-E** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:
 - Exceptional Chinese and English retrieval capabilities.
 - Outstanding cross-lingual retrieval capabilities between Chinese and English.
+RankCPM-E is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
 We also invite you to explore the RAG toolkit series:
+- Retrieval Model: [RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
+- Re-ranking Model: [RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
 - LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
 [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
 本模型支持 query 侧指令，格式如下：
+RankCPM-E supports query-side instructions in the following format:
 ```
 Instruction: {{ instruction }} Query: {{ query }}
 也可以不提供指令，即采取如下格式：
+RankCPM-E also works in instruction-free mode in the following format:
 ```
 Query: {{ query }}
 import torch
 import torch.nn.functional as F
+model_name = "openbmb/RankCPM-E"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
 model.eval()
 | gte-Qwen2-1.5B-instruct      | 71.86             | 58.29         |
 | gte-Qwen2-7B-instruct        | 76.03             | 60.25         |
 | bge-multilingual-gemma2      | 73.73             | 59.24         |
+| RankCPM-E                    | **76.76**         | 58.56         |
+| RankCPM-E+RankCPM-R         | 77.08             | 61.61         |
 ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
 | gte-multilingual-base(Dense) | 68.2               | 39.46              | 45.86              |
 | gte-Qwen2-1.5B-instruct      | 68.52              | 49.11              | 45.05              |
 | gte-Qwen2-7B-instruct        | 68.27              | 49.14              | 49.6               |
+| RankCPM-E                    | **72.95**          | **52.65**          | **49.95**          |
+| RankCPM-E+RankCPM-R         | 74.33              | 53.21              | 54.12              |
 ## 许可证 License
 - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
+- RankCPM-E 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
+- RankCPM-E 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
+* The usage of RankCPM-E model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
+* The models and weights of RankCPM-E are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, RankCPM-E weights are also available for free commercial use.