Text Generation
Transformers
Safetensors
step3p5
conversational
custom_code
Eval Results

Disabling/Reducing model reasoning

#22
by Abdallah1997 - opened

I have Important CoT prompts that guide the llm how to think. Using them is leading to latency and large token output, I'd like to reduce the internal model reasoning for those reasons.

Abdallah1997 changed discussion title from Disabling/Reducing reasoning to Disabling/Reducing model reasoning
StepFun org

We hear the ask. You are not alone. We will add it in the next version

ideally there would also be a non-thinking version or a non-thinking switch to keep the model responsive for local usage on consumer hardware or when latency is key to the application (such as using tex to speech to have a conversation etc.)

I have a non-thinking version of it here:

https://www.neuroengine.ai/Neuroengine-Large

Just add </think> at the beggining of the assistant section, like this:

<im_start|>assistant\n</think>

And it will stop reasoning pretty much every time.

  • Make a custom Jinja Template:

https://huggingface.co/stepfun-ai/Step-3.5-Flash/blob/main/chat_template.jinja

  • Replacing the last lines with:
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n</think>\n' }}
{%- endif %}
  • Add to llama.cpp with

--chat-template-file jinja.tmpl

@ortegaalfredo
I have not been able to reproduce it using https://api.stepfun.ai/v1
I tried what you said, but it doesn't seem to be working for me

Sign up or log in to comment