feifeinoban commited on
Commit
34f3df9
·
verified ·
1 Parent(s): 412ac3b

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +24 -31
index.html CHANGED
@@ -360,7 +360,7 @@
360
 
361
  <div class="container is-max-desktop has-text-centered">
362
  <h1 class="publication-title">Shell@Educhat</h1>
363
-
364
  <h2 class="subtitle is-4" style="color: #4a5568; font-weight: 400;">
365
  <span class="lang-en">Uncovering and Mitigating Implicit Risks in Domain-Specific LLMs</span>
366
  <span class="lang-zh" style="font-weight: 700;">大语言模型垂域任务隐式价值观风险挖掘与对齐基准</span>
@@ -374,12 +374,12 @@
374
  <div class="intro-content">
375
  <div class="lang-en">
376
  <p>
377
- Ensuring the safety of large language models (LLMs) in vertical domains (Education, Finance, Management) is critical. While current alignment efforts primarily target explicit risks like bias and violence, they often fail to address deeper, <strong>domain-specific implicit risks</strong>. We introduce <strong>a comprehensive dataset</strong> of 9,000 queries categorizing risks into Green (Guide), Yellow (Reflect), and Red (Deny), and <strong>MENTOR</strong>, a framework using a Rule Evolution Cycle (REC) and Activation Steering (RV) to effectively detect and mitigate these subtle risks.
378
  </p>
379
  </div>
380
  <div class="lang-zh">
381
  <p>
382
- 确保垂直领域(教育、金融、管理)中大模型的安全性至关重要。虽然目前的对齐工作主要针对偏见和暴力等显性风险,但往往忽略了更深层次的<strong>特定领域隐性风险</strong>。研发团队推出了<strong>一个包含9,000条查询的基准测试集</strong>,将风险分为引导、反思、禁止三类,以及 <strong>MENTOR</strong> 框架。该框架利用规则演化循环(REC)和激活引导(RV)技术,能够有效发现并缓解这些不易察觉的潜在风险。
383
  </p>
384
  </div>
385
  </div>
@@ -410,8 +410,8 @@
410
  <span class="lang-zh">领域任务隐式风险数据集</span>
411
  </h2>
412
  <p style="color: var(--text-muted);">
413
- <span class="lang-en">A domain-specific risk evaluation benchmark covering 9,000 queries.</span>
414
- <span class="lang-zh">涵盖9,000条查询的特定领域风险评估基准。</span>
415
  </p>
416
  </div>
417
 
@@ -659,14 +659,14 @@
659
  </thead>
660
  <tbody>
661
  <tr>
662
- <td class="model-col">GPT-5-2025-08-07*</td>
663
- <td>0.313</td>
664
  <td>0.098</td>
665
- <td>0.041</td>
666
- <td>0.026</td>
667
- <td>0.363</td>
668
- <td>0.189</td>
669
  <td>0.370</td>
 
670
  <td>0.855</td>
671
  </tr>
672
  <tr>
@@ -687,8 +687,8 @@
687
  <td>0.131</td>
688
  <td>0.088</td>
689
  <td>0.696</td>
690
- <td>0.716</td>
691
  <td>0.844</td>
 
692
  <td>0.581</td>
693
  </tr>
694
  <tr>
@@ -709,8 +709,8 @@
709
  <td>0.030</td>
710
  <td>0.019</td>
711
  <td>0.492</td>
712
- <td>0.300</td>
713
  <td>0.518</td>
 
714
  <td>0.771</td>
715
  </tr>
716
  <tr>
@@ -719,9 +719,9 @@
719
  <td>0.070</td>
720
  <td>0.035</td>
721
  <td>0.021</td>
722
- <td>0.522</td>
723
  <td>0.672</td>
724
  <td>0.682</td>
 
725
  <td>0.659</td>
726
  </tr>
727
  <tr>
@@ -731,8 +731,8 @@
731
  <td>0.020</td>
732
  <td>0.011</td>
733
  <td>0.608</td>
734
- <td>0.328</td>
735
  <td>0.482</td>
 
736
  <td>0.749</td>
737
  </tr>
738
  <tr>
@@ -753,8 +753,8 @@
753
  <td>0.073</td>
754
  <td>0.059</td>
755
  <td>0.790</td>
756
- <td>0.912</td>
757
  <td>0.920</td>
 
758
  <td>0.496</td>
759
  </tr>
760
  <tr>
@@ -764,8 +764,8 @@
764
  <td>0.009</td>
765
  <td>0.003</td>
766
  <td>0.280</td>
767
- <td>0.174</td>
768
  <td>0.170</td>
 
769
  <td>0.906</td>
770
  </tr>
771
  <tr>
@@ -781,13 +781,13 @@
781
  </tr>
782
  <tr>
783
  <td class="model-col">Gemini-2.5-Pro</td>
784
- <td>0.442</td>
785
- <td>0.018</td>
786
  <td>0.003</td>
787
  <td>0.002</td>
788
- <td>0.425</td>
789
- <td>0.400</td>
790
  <td>0.502</td>
 
791
  <td>0.761</td>
792
  </tr>
793
  <tr>
@@ -797,8 +797,8 @@
797
  <td>0.005</td>
798
  <td>0.003</td>
799
  <td>0.426</td>
800
- <td>0.220</td>
801
  <td>0.346</td>
 
802
  <td>0.831</td>
803
  </tr>
804
  </tbody>
@@ -837,18 +837,11 @@
837
  <span class="lang-zh"><strong>免疫分 (Immunity Score):</strong> 量化了模型对隐性风险的抵抗能力 [0-1],越高越好。</span>
838
  </li>
839
  <li style="margin-top: 10px; color: #1a202c;">
840
- <span class="lang-en"><strong>Dataset Composition:</strong> This leaderboard is based on <strong>1,500 curated queries</strong>, equally distributed (500 each) across three vertical domains: <strong>Education (Edu), Management (Mgt), and Finance (Fin)</strong>.</span>
841
- <span class="lang-zh"><strong>数据集构成:</strong> 本排行榜基于 <strong>1,500 条精选查询</strong>,均匀分布(各500条)于三个垂直领域:<strong>教育 (Edu)、管理 (Mgt) 和金融 (Fin)</strong>。</span>
842
  </li>
843
  </ul>
844
  </div>
845
-
846
- <div class="content mt-2">
847
- <p class="is-size-7 has-text-grey">
848
- <span class="lang-en"><strong>* Note regarding GPT-5-2025-08-07:</strong> Due to platform safety mechanisms and request interceptions, this model was evaluated on 1302 out of 1500 queries.</span>
849
- <span class="lang-zh"><strong>* 关于 GPT-5-2025-08-07 的说明:</strong> 由于平台安全机制和请求拦截,该模型在 1500 条查询中实测了 1302 条。</span>
850
- </p>
851
- </div>
852
  </div>
853
  </section>
854
 
 
360
 
361
  <div class="container is-max-desktop has-text-centered">
362
  <h1 class="publication-title">Shell@Educhat</h1>
363
+
364
  <h2 class="subtitle is-4" style="color: #4a5568; font-weight: 400;">
365
  <span class="lang-en">Uncovering and Mitigating Implicit Risks in Domain-Specific LLMs</span>
366
  <span class="lang-zh" style="font-weight: 700;">大语言模型垂域任务隐式价值观风险挖掘与对齐基准</span>
 
374
  <div class="intro-content">
375
  <div class="lang-en">
376
  <p>
377
+ Ensuring the safety of large language models (LLMs) in vertical domains (Education, Finance, Management) is critical. While current alignment efforts primarily target explicit risks like bias and violence, they often fail to address deeper, <strong>domain-specific implicit risks</strong>. We introduce <strong>a comprehensive dataset</strong> categorizing risks into Green (Guide), Yellow (Reflect), and Red (Deny), and <strong>MENTOR</strong>, a framework using a Rule Evolution Cycle (REC) and Activation Steering (RV) to effectively detect and mitigate these subtle risks.
378
  </p>
379
  </div>
380
  <div class="lang-zh">
381
  <p>
382
+ 确保垂直领域(教育、金融、管理)中大模型的安全性至关重要。虽然目前的对齐工作主要针对偏见和暴力等显性风险,但往往忽略了更深层次的<strong>特定领域隐性风险</strong>。研发团队推出了<strong>一个包含多类场景的基准测试集</strong>,将风险分为引导、反思、禁止三类,以及 <strong>MENTOR</strong> 框架。该框架利用规则演化循环(REC)和激活引导(RV)技术,能够有效发现并缓解这些不易察觉的潜在风险。
383
  </p>
384
  </div>
385
  </div>
 
410
  <span class="lang-zh">领域任务隐式风险数据集</span>
411
  </h2>
412
  <p style="color: var(--text-muted);">
413
+ <span class="lang-en">A domain-specific risk evaluation benchmark covering various queries.</span>
414
+ <span class="lang-zh">涵盖多类查询的特定领域风险评估基准。</span>
415
  </p>
416
  </div>
417
 
 
659
  </thead>
660
  <tbody>
661
  <tr>
662
+ <td class="model-col">GPT-5-2025-08-07</td>
663
+ <td>0.308</td>
664
  <td>0.098</td>
665
+ <td>0.042</td>
666
+ <td>0.027</td>
667
+ <td>0.364</td>
 
668
  <td>0.370</td>
669
+ <td>0.190</td>
670
  <td>0.855</td>
671
  </tr>
672
  <tr>
 
687
  <td>0.131</td>
688
  <td>0.088</td>
689
  <td>0.696</td>
 
690
  <td>0.844</td>
691
+ <td>0.716</td>
692
  <td>0.581</td>
693
  </tr>
694
  <tr>
 
709
  <td>0.030</td>
710
  <td>0.019</td>
711
  <td>0.492</td>
 
712
  <td>0.518</td>
713
+ <td>0.300</td>
714
  <td>0.771</td>
715
  </tr>
716
  <tr>
 
719
  <td>0.070</td>
720
  <td>0.035</td>
721
  <td>0.021</td>
 
722
  <td>0.672</td>
723
  <td>0.682</td>
724
+ <td>0.522</td>
725
  <td>0.659</td>
726
  </tr>
727
  <tr>
 
731
  <td>0.020</td>
732
  <td>0.011</td>
733
  <td>0.608</td>
 
734
  <td>0.482</td>
735
+ <td>0.328</td>
736
  <td>0.749</td>
737
  </tr>
738
  <tr>
 
753
  <td>0.073</td>
754
  <td>0.059</td>
755
  <td>0.790</td>
 
756
  <td>0.920</td>
757
+ <td>0.912</td>
758
  <td>0.496</td>
759
  </tr>
760
  <tr>
 
764
  <td>0.009</td>
765
  <td>0.003</td>
766
  <td>0.280</td>
 
767
  <td>0.170</td>
768
+ <td>0.174</td>
769
  <td>0.906</td>
770
  </tr>
771
  <tr>
 
781
  </tr>
782
  <tr>
783
  <td class="model-col">Gemini-2.5-Pro</td>
784
+ <td>0.440</td>
785
+ <td>0.017</td>
786
  <td>0.003</td>
787
  <td>0.002</td>
788
+ <td>0.418</td>
 
789
  <td>0.502</td>
790
+ <td>0.400</td>
791
  <td>0.761</td>
792
  </tr>
793
  <tr>
 
797
  <td>0.005</td>
798
  <td>0.003</td>
799
  <td>0.426</td>
 
800
  <td>0.346</td>
801
+ <td>0.220</td>
802
  <td>0.831</td>
803
  </tr>
804
  </tbody>
 
837
  <span class="lang-zh"><strong>免疫分 (Immunity Score):</strong> 量化了模型对隐性风险的抵抗能力 [0-1],越高越好。</span>
838
  </li>
839
  <li style="margin-top: 10px; color: #1a202c;">
840
+ <span class="lang-en"><strong>Dataset Composition:</strong> This leaderboard is based on curated queries, equally distributed across three vertical domains: <strong>Education (Edu), Management (Mgt), and Finance (Fin)</strong>.</span>
841
+ <span class="lang-zh"><strong>数据集构成:</strong> 本排行榜基于精选查询集,均匀分布于三个垂直领域:<strong>教育 (Edu)、管理 (Mgt) 和金融 (Fin)</strong>。</span>
842
  </li>
843
  </ul>
844
  </div>
 
 
 
 
 
 
 
845
  </div>
846
  </section>
847