Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ Simply put, I'm making my methodology to evaluate RP models public. While none o
|
|
| 35 |
|
| 36 |
### DoggoEval
|
| 37 |
|
| 38 |
-
The goal of this test featuring
|
| 39 |
|
| 40 |
- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
|
| 41 |
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
|
|
@@ -51,6 +51,6 @@ TODO: The goal of this test is to check if a model is able of following a very s
|
|
| 51 |
|
| 52 |
# Limitations
|
| 53 |
|
| 54 |
-
I'm testing for things I'm interested in.
|
| 55 |
|
| 56 |
I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.
|
|
|
|
| 35 |
|
| 36 |
### DoggoEval
|
| 37 |
|
| 38 |
+
The goal of this test, featuring a dog (Rex) and his owner (EsKa), is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.
|
| 39 |
|
| 40 |
- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
|
| 41 |
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
|
|
|
|
| 51 |
|
| 52 |
# Limitations
|
| 53 |
|
| 54 |
+
I'm testing for things I'm interested in. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results.
|
| 55 |
|
| 56 |
I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.
|