SerialKicked
/

ModelTestingBed

Model card Files Files and versions

SerialKicked commited on Jun 8, 2024

Commit

1da6a85

·

verified ·

1 Parent(s): 8802cae

Update README.md

Files changed (1) hide show

README.md +9 -8

README.md CHANGED Viewed

@@ -15,16 +15,17 @@ Simply put, I'm making my methodology to evaluate RP models public. While none o
 # Testing Environment
-- All models are loaded in Q8_0 (GGUF) with all layers on the GPU (NVidia RTX3060 12GB)
 - Backend is the latest version of KoboldCPP for Windows using CUDA 12.
 - Using **CuBLAS** but **not using QuantMatMul (mmq)**.
-- 7-10B Models
-- - All models are extended to **16K context length** (auto rope from KCPP)
-- - **Flash Attention** and **ContextShift** enabled.
-- 11-15B Models:
-- - All models are extended to **12K context length** (auto rope from KCPP)
-- - **Flash Attention** and **8Bit cache compression** are enabled.
-- Frontend is staging version of Silly Tavern.
 - Response size set to 1024 tokens max.
 - Fixed Seed for all tests: **123**

 # Testing Environment
+- Frontend is staging version of Silly Tavern.
 - Backend is the latest version of KoboldCPP for Windows using CUDA 12.
 - Using **CuBLAS** but **not using QuantMatMul (mmq)**.
+- **7-10B Models:**
+  - All models are loaded in Q8_0 (GGUF)
+  - All models are extended to **16K context length** (auto rope from KCPP)
+  - **Flash Attention** and **ContextShift** enabled.
+- **11-15B Models:**
+  - All models are loaded in Q4_KM or whatever is the highest/closest available (GGUF)
+  - All models are extended to **12K context length** (auto rope from KCPP)
+  - **Flash Attention** and **8Bit cache compression** are enabled.
 - Response size set to 1024 tokens max.
 - Fixed Seed for all tests: **123**