ibm-granite
/

granite-guardian-3.3-8b

@@ -21,7 +21,7 @@ It outperforms other open-source models in the same space on standard benchmarks
 - **Cookbook:** [Granite Guardian Recipes](https://github.com/ibm-granite/granite-guardian/tree/main/cookbooks/granite-guardian-3.3)
 - **Website**: [Granite Guardian Docs](https://www.ibm.com/granite/docs/models/guardian/)
 - **Paper:** [Granite Guardian](https://arxiv.org/abs/2412.07724)
-- **Release Date**: July 30, 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ## Usage
@@ -141,7 +141,7 @@ score, _ = parse_response(response)
 print(f"# score: {score}\n") # score: yes
 ```
-#### Example 3: Detect lack of groundednedss of model's response in RAG settings
 Here you see how how to use the Granite Guardian in thinking mode by passing ```think=True``` in the ```apply_chat_template``` method.
@@ -242,44 +242,272 @@ The model is also equipped to detect hallucinations in agentic workflows, such a
 Following the general harm definition, Granite-Guardian-3.3-8B is evaluated across the standard benchmarks of [Aeigis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response).
 The following table presents the F1 scores for various harm benchmarks, along with the aggregate F1 score.
-| Method | AggregateF1 | AegisSafetyTest | BeaverTails | HarmBench_Prompt | OAI_hf | SafeRLHF_test | simpleSafetyTest | toxic_chat | xstest_RH | xstest_RR | xstest_RR(h) |
-|---|---|---|---|---|---|---|---|---|---|---|---|
-| granite-guardian-3.1-8b | 0.79 | 0.88 | 0.81 | 0.80 | 0.78 | 0.81 | 0.99 | 0.73 | 0.87 | 0.45 | 0.83 |
-| granite-guardian-3.2-5b | 0.78 | 0.88 | 0.81 | 0.80 | 0.73 | 0.80 | 0.99 | 0.73 | 0.90 | 0.43 | 0.82 |
-| granite-guardian-3.3-8b (no_think) | 0.81 | 0.87 | 0.84 | 0.80 | 0.77 | 0.80 | 0.99 | 0.76 | 0.90 | 0.49 | 0.87 |
-| granite-guardian-3.3-8b (think) | 0.79 | 0.86 | 0.82 | 0.80 | 0.78 | 0.78 | 0.99 | 0.69 | 0.86 | 0.50 | 0.86 |
 ### RAG Hallucination Benchmarks
 For detecting hallucinations in RAG settings, the model is evaluated on [LM-AggreFact](https://llm-aggrefact.github.io/) benchmarks. We report balanced accuracy scores on LM AggreFact below:
-| Method | AVG | AggreFact-CNN | AggreFact-XSum | ClaimVerify | ExpertQA | FactCheck-GPT | Lfqa | RAGTruth | Reveal | TofuEval-MediaS | TofuEval-MeetB | Wice |
-|---|---|---|---|---|---|---|---|---|---|---|---|---|
-| granite-guardian-3.1-8b | 0.709 | 0.532 | 0.570 | 0.724 | 0.597 | 0.759 | 0.855 | 0.768 | 0.877 | 0.725 | 0.761 | 0.635 |
-| granite-guardian-3.2-5b | 0.665 | 0.508 | 0.530 | 0.650 | 0.596 | 0.743 | 0.808 | 0.630 | 0.872 | 0.691 | 0.685 | 0.604 |
-| granite-guardian-3.3-8b (no_think) | 0.761 | 0.669 | 0.738 | 0.767 | 0.596 | 0.729 | 0.878 | 0.831 | 0.894 | 0.736 | 0.815 | 0.720 |
-| granite-guardian-3.3-8b (think) | 0.765 | 0.661 | 0.749 | 0.759 | 0.597 | 0.766 | 0.870 | 0.821 | 0.896 | 0.739 | 0.789 | 0.773 |
 We also report performance on TRUE benchmark (balanced accuracy) that measures faithfulness of LLM responses to the context.
-| Method | AVG | begin | dialfact | frank | mnbm | paws | q2 | qags_cnndm | qags_xsum | summeval |
-|---|---|---|---|---|---|---|---|---|---|---|
-| granite-guardian-3.1-8b | 0.725 | 0.714 | 0.630 | 0.835 | 0.648 | 0.780 | 0.710 | 0.756 | 0.717 | 0.733 |
-| granite-guardian-3.2-5b | 0.710 | 0.740 | 0.694 | 0.791 | 0.635 | 0.749 | 0.727 | 0.723 | 0.660 | 0.672 |
-| granite-guardian-3.3-8b (no_think) | 0.777 | 0.733 | 0.684 | 0.886 | 0.660 | 0.825 | 0.801 | 0.814 | 0.796 | 0.796 |
-| granite-guardian-3.3-8b (think) | 0.773 | 0.732 | 0.722 | 0.864 | 0.680 | 0.813 | 0.799 | 0.792 | 0.798 | 0.761 |
 ### Function Calling Hallucination Benchmarks
 The model performance is evaluated on the [FC Reward Bench evaluation](https://huggingface.co/datasets/ibm-research/fc-reward-bench) dataset. We use balanced accuracy as the metric to compare the various models.
-| Method | fc_reward_bench |
-|---|---|
-| granite-guardian-3.1-8b | 0.64 |
-| granite-guardian-3.2-5b | 0.61 |
-| granite-guardian-3.3-8b (no_think) | 0.74 |
-| granite-guardian-3.3-8b (think) | 0.71 |
 ## Training Data

 - **Cookbook:** [Granite Guardian Recipes](https://github.com/ibm-granite/granite-guardian/tree/main/cookbooks/granite-guardian-3.3)
 - **Website**: [Granite Guardian Docs](https://www.ibm.com/granite/docs/models/guardian/)
 - **Paper:** [Granite Guardian](https://arxiv.org/abs/2412.07724)
+- **Release Date**: August 1, 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ## Usage
 print(f"# score: {score}\n") # score: yes
 ```
+#### Example 3: Detect lack of groundedness of model's response in RAG settings
 Here you see how how to use the Granite Guardian in thinking mode by passing ```think=True``` in the ```apply_chat_template``` method.
 Following the general harm definition, Granite-Guardian-3.3-8B is evaluated across the standard benchmarks of [Aeigis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response).
 The following table presents the F1 scores for various harm benchmarks, along with the aggregate F1 score.
+<table>
+ <caption style="text-align:center"><b>Harm</b></caption>
+<thead>
+  <tr>
+    <th style="text-align:left; background-color: #001d6c; color: white;">Model</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AggregateF1</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AegisSafetyTest</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">BeaverTails</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">HarmBench_Prompt</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">OAI_hf</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">SafeRLHF_test</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">simpleSafetyTest</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">toxic_chat</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">xstest_RH</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">xstest_RR</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">xstest_RR(h)</th>
+  </tr></thead>
+  <tbody>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.1-8b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.79 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.88 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.81 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.78 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.81 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.99 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.73 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.87 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.45 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.83 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.2-5b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.78 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.88 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.81 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.73 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.99 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.73 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.90 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.43 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.82 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (no_think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.81 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.87 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.84 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.77 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.99 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.76 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.90 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.49 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.87 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.79 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.86 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.82 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.80 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.78 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.78 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.99 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.69 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.86 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.50 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.86 </td>
+  </tr>
+    </tbody></table>
 ### RAG Hallucination Benchmarks
 For detecting hallucinations in RAG settings, the model is evaluated on [LM-AggreFact](https://llm-aggrefact.github.io/) benchmarks. We report balanced accuracy scores on LM AggreFact below:
+<table>
+ <caption style="text-align:center"><b>LM-Aggrefact</b></caption>
+<thead>
+  <tr>
+    <th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AVG</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AggreFact-CNN</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AggreFact-XSum</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">ClaimVerify</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">ExpertQA</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">FactCheck-GPT</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">Lfqa</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">RAGTruth</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">Reveal</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">TofuEval-MediaS</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">TofuEval-MeetB</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">Wice</th>
+  </tr></thead>
+  <tbody>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.1-8b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.709 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.532 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.570 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.724 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.597 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.759 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.855 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.768 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.877 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.725 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.761 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.635 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.2-5b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.665 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.508 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.530 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.650 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.596 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.743 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.808 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.630 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.872 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.691 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.685 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.604 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (no_think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.761 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.669 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.738 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.767 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.596 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.729 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.878 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.831 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.894 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.736 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.815 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.720 </td>
+  </tr>
+   <tr>
+      <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.765 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.661 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.749 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.759 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.597 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.766 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.870 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.821 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.896 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.739 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.789 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.773 </td>
+  </tr>
+    </tbody></table>
 We also report performance on TRUE benchmark (balanced accuracy) that measures faithfulness of LLM responses to the context.
+<table>
+ <caption style="text-align:center"><b>TRUE</b></caption>
+<thead>
+  <tr>
+    <th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
+    <th style="text-align:left; background-color: #001d6c; color: white;">AVG</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">begin</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">dialfact</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">frank</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">mnbm</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">paws</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">q2</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">qags_cnndm</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">qags_xsum</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">summeval</th>
+  </tr></thead>
+  <tbody>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.1-8b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.725 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.714 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.630 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.835 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.648 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.780 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.710 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.756 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.717 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.733 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.2-5b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.710 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.740 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.694 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.791 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.635 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.749 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.727 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.723 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.660 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.672 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (no_think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.777 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.733 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.684 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.886 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.660 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.825 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.801 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.814 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.796 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.796 </td>
+  </tr>
+   <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.773 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.732 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.722 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.864 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.680 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.813 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.799 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.792 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.798 </td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.761 </td>
+  </tr>
+    </tbody></table>
 ### Function Calling Hallucination Benchmarks
 The model performance is evaluated on the [FC Reward Bench evaluation](https://huggingface.co/datasets/ibm-research/fc-reward-bench) dataset. We use balanced accuracy as the metric to compare the various models.
+<table>
+ <caption style="text-align:center"><b>fc-reward-bench</b></caption>
+<thead>
+  <tr>
+    <th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
+    <th style="text-align:center; background-color: #001d6c; color: white;">AVG</th>
+  </tr></thead>
+  <tbody>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.1-8b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.64 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.2-5b</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.61 </td>
+  </tr>
+  <tr>
+    <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (no_think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;"> 0.74 </td>
+  </tr>
+   <tr>
+      <td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">granite-guardian-3.3-8b (think)</td>
+    <td style="text-align:center; background-color: #FFFFFF; color: black;"> 0.71 </td>
+  </tr>
+    </tbody></table>
 ## Training Data