Spaces:
Runtime error
Runtime error
| Feature: llama.cpp server slot management | |
| Background: Server startup | |
| Given a server listening on localhost:8080 | |
| And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models | |
| And prompt caching is enabled | |
| And 2 slots | |
| And . as slot save path | |
| And 2048 KV cache size | |
| And 42 as server seed | |
| And 24 max tokens to predict | |
| Then the server is starting | |
| Then the server is healthy | |
| Scenario: Save and Restore Slot | |
| # First prompt in slot 1 should be fully processed | |
| Given a user prompt "What is the capital of France?" | |
| And using slot id 1 | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Lily|cake) | |
| And 22 prompt tokens are processed | |
| When the slot 1 is saved with filename "slot1.bin" | |
| Then the server responds with status code 200 | |
| # Since we have cache, this should only process the last tokens | |
| Given a user prompt "What is the capital of Germany?" | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Thank|special) | |
| And 7 prompt tokens are processed | |
| # Loading the original cache into slot 0, | |
| # we should only be processing 1 prompt token and get the same output | |
| When the slot 0 is restored with filename "slot1.bin" | |
| Then the server responds with status code 200 | |
| Given a user prompt "What is the capital of France?" | |
| And using slot id 0 | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Lily|cake) | |
| And 1 prompt tokens are processed | |
| # For verification that slot 1 was not corrupted during slot 0 load, same thing | |
| Given a user prompt "What is the capital of Germany?" | |
| And using slot id 1 | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Thank|special) | |
| And 1 prompt tokens are processed | |
| Scenario: Erase Slot | |
| Given a user prompt "What is the capital of France?" | |
| And using slot id 1 | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Lily|cake) | |
| And 22 prompt tokens are processed | |
| When the slot 1 is erased | |
| Then the server responds with status code 200 | |
| Given a user prompt "What is the capital of France?" | |
| And a completion request with no api error | |
| Then 24 tokens are predicted matching (Lily|cake) | |
| And 22 prompt tokens are processed | |