This is the fully finished chat-ready checkpoint.
Note: for very short prompts, you should prefill <think> xml tag at the start of the assistant response to ensure it would properly reason.
Fijik-1.5 2.6B
Trained on H200,A100 and some use of rtx 2000 Ada gpus.
Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080 (bf16 gguf), supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device.

What it is
Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2 135m: https://huggingface.co/HuggingFaceTB/SmolLM2-135M) then we turned the original model into a 32 expert "Franken"-moe. Obviously, after that stage, it was nowhere near finished, so heavy CPT (continual pre-training) was done, this also allowed us to scale the context from 8192 tokens to 32k, and techincally the model should work up to 128k tokens.
This model is completely uncensored, and thus is not ideal for production use-cases when safety is a must.
The model should be used for:
- General chat applications
- Fun quick local model
- Code suggestions / generation
- Fine-tuning for domain specific tasks (eg; only front-end generation, title generation, tool calling etc.)
The model should NOT be used for:
- Anything which needs lots of knowledge (model is too small for that)
- Medical, law or high-risk fields
- Math (From internal testing, the model is not good at math, could be fine-tuned to excel at it)
Overall, it is a special little model, it has a different style compared to other similarly sized LLMs, is uncensored completely and is a very small MoE.
Model information
| Feature | Amount/other |
|---|---|
| Chat model? | Yes |
| architecture | Mixtral |
| max_position_embeddings | 32,768 |
| intermediate_size | 1,536 |
| num_hidden_layers | 30 |
| hidden_size | 576 |
| num_experts_per_tok | 4 |
| num_attention_heads | 9 |
| vocab_size | 49,166 |
| rope_theta | 500,000 |
CPT (continual pre-training)
To make a proper decent base model for the size, CPT had to be done, both to make the experts actual experts and to improve the context, knowledge of the model.
The CPT data was ~60% synthetic and ~40% non synthetic (across all CPT stages combined)
5 stages of continual pre-training were done.
Started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetic web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data with a 32k context length.
By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.
SFT (supervised fine-tuning)
For sft, a ~549M token high quality diverse dataset was used. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.
Estimated data mix:
- ~12% tool/json
- ~27% code generation (front-end, backend, competitive coding)
- ~43% general chats / instruction following
- ~18% math
Estimated based on pure dataset mix but real percentages are unknown.
RL (reinforcment learning)
SFT was not enough, especially in todays times. After SFT 3 different "rounds" of DPO (direct preference optimization) were done, which improved instruction following significantly, yet, that was still not enough and more RL was done.
After the 3 DPO stages, DeepSeek-R1-like GRPO was done, (note: DPO, GRPO were done with LORA other than the final DPO stage that would be talked about soon) the grpo had very hard rewards, that the model had a "hard time" getting good reward, but this actually helped it, before this GRPO stage(s) the model had significant looping issues, more incoherent outputs and worse instruction following. This GRPO helped it think for less time, go into loops less and be better overall.
But still, a little more was done. After this, two final stages were done:
- DPO (final): Different DPO dataset with more coding, stricter instruction following, generalist chat (eg; Hi! what are you?) was done, with full fine-tuning enabled (no lora).
- GRPO (final): Two epochs of the same dataset and rewards as the previous GRPO stages just as a last push.
Benchmarks
None done yet, soon.
How to run
This model should be ran with a system prompt ideally, works perfectly fine without. It uses standard qwen3 tool calls but it should be fine-tuned to excell at it as it currently has some issues with tool calling.
recommended sampling parameters:
- Tempature:
0.35 - Top-k:
35 - Repetition penalty:
1.1 - Top-p:
0.85 - Min-p:
0.1(optional)
Test it out with a simple prompt, like "Why is the sky blue" Keep in mind, this model does support multi-turns, but be aware it expect the previous response to also have reasoning, removing reasoning from the previous response could save compute and context, but would break the model.
When fine-tuning, you would need at minimum 8gb of memory for basic QLoRa with low context, ideally, 16gb.
Special thanks
This wouldn't have been possible without HuggingfaceTB (They trained smollm2 135M), Unsloth, MergeKit, Transformers.
For questions, Open a community discussion.
- Downloads last month
- 27