CreitinGameplays
/

Mistral-Nemo-12B-R1-v0.2-exp

 - mistralai/Mistral-Nemo-Instruct-2407
 pipeline_tag: text-generation
 library_name: transformers
+---
+## Chat template:
+```
+[SYSTEM]You are an AI focused on providing systematic, well-reasoned responses. Response Structure: - Format: <think>{{reasoning}}</think>{{answer}} - Reasoning: Minimum 6 logical steps only when it required in <think> block - Process: Think first, then answer.[/SYSTEM]
+[INST]{user_input}[/INST]
+```
+## Run the model:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
+import bitsandbytes
+import torch._dynamo
+from torch._dynamo import disable as dynamo_disable
+import os
+torch._dynamo.config.suppress_errors = True
+os.environ["TORCHDYNAMO_DISABLE"] = "1"
+quantization_config = BitsAndBytesConfig(
+    load_in_8bit=True,
+    #bnb_8bit_use_double_quant=True,
+    #bnb_8bit_quant_type="nf4",
+    #bnb_8bit_compute_dtype=torch.bfloat16,
+    #llm_int8_threshold=200.0,
+    llm_int8_enable_fp32_cpu_offload=True
+)
+model_id = "CreitinGameplays/Llama-3.1-8B-R1-v0.1"
+# Initialize model and tokenizer with streaming support
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    quantization_config=quantization_config
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)
+# Custom streamer that collects the output into a string while streaming
+class CollectingStreamer(TextStreamer):
+    def __init__(self, tokenizer):
+        super().__init__(tokenizer)
+        self.output = ""
+    def on_llm_new_token(self, token: str, **kwargs):
+        self.output += token
+        print(token, end="", flush=True)  # prints the token as it's generated
+print("Chat session started. Type 'exit' to quit.\n")
+# Initialize chat history as a list of messages
+chat_history = []
+chat_history.append({"role": "system", "content": "You are an AI assistant made by Mistral AI"})
+while True:
+    user_input = input("You: ")
+    if user_input.strip().lower() == "exit":
+        break
+    # Append the user message to the chat history
+    chat_history.append({"role": "user", "content": user_input})
+    # Prepare the prompt by formatting the complete chat history
+    inputs = tokenizer.apply_chat_template(
+        chat_history,
+        return_tensors="pt",
+        add_special_tokens=False
+    ).to(model.device)
+    # Create a new streamer for the current generation
+    streamer = CollectingStreamer(tokenizer)
+    # Generate streamed response
+    model.generate(
+        inputs,
+        streamer=streamer,
+        temperature=0.3,
+        top_p=0.8,
+        top_k=50,
+        repetition_penalty=1.1,
+        max_new_tokens=4096,
+        do_sample=True
+    )
+    # The complete response text is stored in streamer.output
+    response_text = streamer.output
+    print("\nAssistant:", response_text)
+    # Append the assistant response to the chat history
+    chat_history.append({"role": "assistant", "content": response_text})
+```
+### Note: This model was finetuned only with 2000 max steps.