Spaces:
Runtime error
Runtime error
| Feature: Parallel | |
| Background: Server startup | |
| Given a server listening on localhost:8080 | |
| And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models | |
| And a model file test-model-00001-of-00003.gguf | |
| And 42 as server seed | |
| And 128 as batch size | |
| And 256 KV cache size | |
| And 2 slots | |
| And continuous batching | |
| Then the server is starting | |
| Then the server is healthy | |
| Scenario Outline: Multi users completion | |
| Given a prompt: | |
| """ | |
| Write a very long story about AI. | |
| """ | |
| And a prompt: | |
| """ | |
| Write another very long music lyrics. | |
| """ | |
| And <n_predict> max tokens to predict | |
| Given concurrent completion requests | |
| Then the server is busy | |
| Then the server is idle | |
| And all slots are idle | |
| Then all prompts are predicted with <n_predict> tokens | |
| Examples: | |
| | n_predict | | |
| | 128 | | |
| Scenario Outline: Multi users OAI completions compatibility | |
| Given a system prompt You are a writer. | |
| And a model tinyllama-2 | |
| Given a prompt: | |
| """ | |
| Write a very long book. | |
| """ | |
| And a prompt: | |
| """ | |
| Write another a poem. | |
| """ | |
| And <n_predict> max tokens to predict | |
| And streaming is <streaming> | |
| Given concurrent OAI completions requests | |
| Then the server is busy | |
| Then the server is idle | |
| Then all prompts are predicted with <n_predict> tokens | |
| Examples: | |
| | streaming | n_predict | | |
| | disabled | 128 | | |
| | enabled | 64 | | |
| Scenario Outline: Multi users OAI completions compatibility no v1 | |
| Given a system prompt You are a writer. | |
| And a model tinyllama-2 | |
| Given a prompt: | |
| """ | |
| Write a very long book. | |
| """ | |
| And a prompt: | |
| """ | |
| Write another a poem. | |
| """ | |
| And <n_predict> max tokens to predict | |
| And streaming is <streaming> | |
| Given concurrent OAI completions requests no v1 | |
| Then the server is busy | |
| Then the server is idle | |
| Then all prompts are predicted with <n_predict> tokens | |
| Examples: | |
| | streaming | n_predict | | |
| | disabled | 128 | | |
| | enabled | 64 | | |
| Scenario Outline: Multi users with number of prompts exceeding number of slots | |
| Given a system prompt You are a writer. | |
| And a model tinyllama-2 | |
| Given a prompt: | |
| """ | |
| Write a very long book. | |
| """ | |
| And a prompt: | |
| """ | |
| Write another a poem. | |
| """ | |
| And a prompt: | |
| """ | |
| What is LLM? | |
| """ | |
| And a prompt: | |
| """ | |
| The sky is blue and I love it. | |
| """ | |
| And <n_predict> max tokens to predict | |
| And streaming is <streaming> | |
| Given concurrent OAI completions requests | |
| Then the server is busy | |
| Then the server is idle | |
| Then all prompts are predicted with <n_predict> tokens | |
| Examples: | |
| | streaming | n_predict | | |
| | disabled | 128 | | |
| | enabled | 64 | | |
| Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969 | |
| Given a prompt: | |
| """ | |
| Write a very long story about AI. | |
| """ | |
| And a prompt: | |
| """ | |
| Write another very long music lyrics. | |
| """ | |
| And a prompt: | |
| """ | |
| Write a very long poem. | |
| """ | |
| And a prompt: | |
| """ | |
| Write a very long joke. | |
| """ | |
| And 128 max tokens to predict | |
| Given concurrent completion requests | |
| Then the server is busy | |
| Then the server is idle | |
| Then all prompts are predicted | |