Disabling/Reducing model reasoning
I have Important CoT prompts that guide the llm how to think. Using them is leading to latency and large token output, I'd like to reduce the internal model reasoning for those reasons.
We hear the ask. You are not alone. We will add it in the next version
ideally there would also be a non-thinking version or a non-thinking switch to keep the model responsive for local usage on consumer hardware or when latency is key to the application (such as using tex to speech to have a conversation etc.)
I have a non-thinking version of it here:
https://www.neuroengine.ai/Neuroengine-Large
Just add </think> at the beggining of the assistant section, like this:
<im_start|>assistant\n</think>
And it will stop reasoning pretty much every time.
- Make a custom Jinja Template:
https://huggingface.co/stepfun-ai/Step-3.5-Flash/blob/main/chat_template.jinja
- Replacing the last lines with:
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n<think>\n</think>\n' }}
{%- endif %}
- Add to llama.cpp with
--chat-template-file jinja.tmpl
@ortegaalfredo
I have not been able to reproduce it using https://api.stepfun.ai/v1
I tried what you said, but it doesn't seem to be working for me