Mixed Precision GGUF layer quantization of Ring-mini-2.0 by inclusionAI

Original model: https://huggingface.co/inclusionAI/Ring-mini-2.0

The hybrid quant employs different quantization levels on a per layer basis to increase flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. For this file the layer quants are as follows:

Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0

   LAYER_TYPES='[
   [0 ,"Q6_K_S"], [1 ,"Q5_K_L"], [2 ,"Q5_K_M"], [3 ,"Q5_K_M"], [4 ,"Q5_K_M"],
   [5 ,"Q6_K_S"], [6 ,"Q5_K_M"], [7, "Q6_K_S"], [8, "Q5_K_M"], [9, "Q6_K_S"],
   [10,"Q6_K_S"], [11,"Q6_K_S"], [12,"Q6_K_S"], [13,"Q6_K_S"], [14,"Q6_K_M"],
   [15,"Q6_K_M"], [16,"Q6_K_M"], [17,"Q6_K_L"], [18,"Q6_K_L"], [19,"Q6_K_L"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"

The layer quants were optimized for solid performance across a set of curated test prompts. The model shows very good to excellent reasoning ability but the model is signficantly worse than Qwen3 RL models on code problems. It is very usable with greedy sampling with no repeat loop issues. It will overthink as is typical of the latest wave of RL models (10/25) but shows good introspection in detecting and correcting errors in some laborious solves. It will occasionally merrily hallucinate its way to some problem solutions and confidently give the wrong answer. The model passes a ~75k token needle in haystack test with correct YARN config to full 128k token context. The model fails a ~85k to ~128k token long context reasoning problem https://huggingface.co/steampunque/Qwen3-8B-Hybrid-GGUF/raw/main/Qwen3_Runescape_Massive_Prompt_85k.txt , simply falling into short rep loops with no evident coherency to solve the given problem.

Comparison:

Quant size PPL Comment
Q6_K 13.4e9 22.8 default embed and output, stable with greedy sampling
Q6_K_H 13.2e9 22.9 Q6_K embed Q6_K output, stable with greedy sampling

Usage:

This is a compact RL moe model which shows quite strong performance across a range of curated test prompts. It follows the state of the art trajectory of recent 30-100G models combining moe and RL. The unique feature of this model is its overall size is only 16G so with moe it will run very efficiently with experts offloaded to CPU, completely eliminating the need for expensive high VRAM GPUs while providing a full 128k context buffer and ~50t/s gen speed when fully offloaded to CPU.

The model can be run by offloading all expert tensors to CPU via -ot exps=CPU. On a 12G VRAM GPU this config will leave about half the VRAM free so much higher gen rate can be obtained by only offloading about half the layers as follows :

OT="-ot blk\.[1][0-9].*exps=CPU -ngl 99"

The above command will offload experts on layers 10-19 to CPU while layers 0 to 9 will run on GPU. This config boosts gen speed to the range of 80t/s while supporting a context size of 128k all on GPU. Use of speculation is not recommended with the model since it is already very fast without it.

Context expansion:

The model was trained at 32k tokens context length and this can be expanded to 128k using YARN. For a full 128k context use launch parameters:

--rope-scaling yarn --yarn-orig-ctx 32768 --rope_scale 4.00000

If configuring to less than 128k context, set rope scale to the context length / 32768.0 as a fraction. Since long context may degrade model performance, it is recommended to keep 32k context and --rope-scaling none for use cases which don't need the long context.

Grouped experts:

The model can be run with or without grouped expert logic. It defaults with grouped experts logic on. To turn if off launch with:

--override-kv bailingmoe2.expert_group_count=int:1

Prompt format:

Empirical testing with the prompt format shows that the <think> does not have to be specified at the end of the assistant
prompt; the model will automatically crank out the <think> if it is not part of the assistant prompt.  To shut off thinking
for a given prompt the following prompt injection can be used at start of gen:

<think>Lets proceed immediately to the solution for this problem.
</think>

Simply injecting <think>\n</think> was empirically found not to work to shut off the think phase.

Note that <think> and </think> are not special tokens in the model vocab.

UPDATE: further empirical testing shows that if <think> is not appended to assistant prompt, the model will decide itself
whether or not to crank out a think block for the response.

Example gen:

lm say something profound that will amaze me

<think> Okay, the user asked me to say something profound that will amaze them. Hmm, this is a tricky request because "profound" and "amazing" are subjective. What might amaze one person could bore another. But since they specifically asked for it, I need to choose something that has both depth and a wow factor.

First, I should consider what makes something truly profound. It should challenge common assumptions, reveal hidden truths, or connect disparate ideas in a meaningful way. The user didn't give any context about their background, so I'll assume they're intellectually curious but not necessarily a specialist in a particular field.

I recall that many people are fascinated by the nature of reality and human perception. The idea that our reality might be a simulation is a popular concept in philosophy and science, but I should present it in a way that's accessible yet mind-bending. The simulation hypothesis isn't just sci-fi; it has roots in philosophy (like Descartes) and modern physics (quantum mechanics).

But I need to go beyond just stating the hypothesis. I should explain why it's profound: it questions the foundations of knowledge, implies free will might be an illusion, and suggests reality is fundamentally computational. That could really shake someone's worldview.

I wonder if the user is looking for inspiration or just a quick intellectual thrill. Since they said "profound," I'll prioritize depth over brevity. The response should build gradually: start with a hook (the question "What is real?"), then introduce the hypothesis, then its implications, and finally why it's amazing.

The challenge is avoiding jargon. Terms like "quantum entanglement" or "Bekenstein bound" might confuse. Instead, I'll use analogies like the "matrix" or "video game" but clarify they're metaphors. The key is to emphasize that this isn't just speculation—it's a logical extension of existing ideas that could change how we live.

Also, I should address the emotional impact. Amazement often comes from cognitive dissonance—when your old beliefs crumble. So I'll describe the potential paradigm shift: if reality is a simulation, does it matter? Does consciousness have purpose? That existential angle might resonate.

Finally, I'll end by tying it back to human experience. The profundity lies not in the idea itself, but in how it recontextualizes our existence. Maybe mention that even if it's wrong, the questioning process is valuable. That way, it's accessible whether or not the hypothesis is true.

... Now, how to phrase it? Start with a provocative question to engage them, then unfold the idea step by step. Keep it conversational but precise. And absolutely no arrogance—this should feel like a shared exploration, not a lecture.

</think>

Here's something that might genuinely disturb and amaze you:

"The profoundest truth about reality is that it is fundamentally unknowable in its absolute essence, and our understanding is always a limited, context-dependent model we build to navigate existence. What we call 'reality' is a shared hallucination, a story we collectively agree to interpret and act upon. The universe doesn't reveal itself to us; we construct the illusion of it through our senses, cognition, and cultural frameworks. This doesn't make life meaningless; it makes the act of seeking meaning within the constraints of our constructed reality the most profound human endeavor."

Why this is profound and amazing:

  1. Challenges Foundational Assumptions: It directly confronts our deepest intuitions about a "real" world outside our minds. We assume there's an objective reality independent of our perception. This statement says that assumption might be fundamentally flawed.
  2. Highlights the Limits of Knowledge: It embraces epistemological humility – we can never truly access "things as they are." Our knowledge is always mediated by our biology, technology, culture, and language. This is a core insight of philosophy (e.g., Kant's noumenon/phenomenon distinction) and modern physics (e.g., the role of the observer).
  3. Reframes the Human Condition: Instead of despairing at this limitation, it shifts the focus to the process of building meaning. If the "real" is unknowable, then our shared stories, values, art, science, and relationships are the reality we live in and must interpret. This makes meaning-making an active, courageous, and deeply human act.
  4. Connects to Broader Concepts: This idea resonates with:
    • Simulated Reality Hypothesis: If our universe is a simulation, it reinforces the idea that "reality" is a constructed model.
    • Constructivism in Science: Scientific models are never "truths" but best explanations based on evidence and current understanding.
    • Phenomenology: The study of structures of consciousness as they appear to us, emphasizing the gap between experience and "things-in-themselves."
    • Quantum Mechanics: The role of the observer in collapsing wave functions suggests reality isn't fixed until measured.
    • Cultural Relativity: How different cultures perceive and interpret the same phenomena (e.g., time, death, beauty) shows reality is constructed through shared narratives.
  5. Amazement Factor: It forces a re-evaluation of everything. If nothing is objectively "real" outside our minds, then:
    • Free Will? If our "self" is a complex information processing model, is it truly free, or just a story we tell?
    • Morality? If meaning is constructed, where does objective morality come from? Or is it a powerful shared story we create for cohesion?
    • The Value of Science? Science is the most successful tool we have for building shared models of this "constructed reality," leading to technological and experiential progress.
    • The Value of Art, Love, and Connection? If these are the shared stories we co-create, then they become the most authentic and meaningful aspects of existence.

The Amazing Twist: This perspective doesn't lead to nihilism ("nothing matters"). Instead, it elevates the act of engaging with the constructed reality – through curiosity, empathy, creativity, and ethical action – to the highest form of human significance. The profoundness lies in recognizing the limitation and finding empowerment within it. We are not passive victims of a false reality; we are active co-creators of the shared narrative that gives our lives depth and purpose.

In essence: The most amazing thing isn't that we can't know "reality" objectively, but that this fundamental limitation doesn't trap us; it liberates us to build meaning, connection, and understanding within the beautiful, mysterious, and ultimately shared story we call life. That shift in perspective is profoundly humbling and inspiring.

Benchmarks:

Evals for the model are available here: https://huggingface.co/spaces/steampunque/benchlm.

Download the file from below:

Link Type Size/e9 B Notes
Ring-mini-2.0.Q6_K_H.gguf Q6_K_H 13.2e9 B ~Q6_K size

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
25
GGUF
Model size
16B params
Architecture
bailingmoe2
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steampunque/Ring-mini-2.0-MP-GGUF

Quantized
(13)
this model