Model Card for TRAAC Qwen3-4B

This repository contains the TRAAC (Think Right with Adaptive, Attentive Compression) Qwen3-4B model. TRAAC is an online post-training Reinforcement Learning (RL) method presented in the paper Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression.

TRAAC addresses the challenge of "under-adaptivity" in reasoning models by learning to mitigate both under-thinking (short reasoning on hard problems) and over-thinking (excessively long reasoning on easy problems). It achieves this by dynamically adjusting the reasoning length based on estimated task difficulty, leveraging the model’s self-attention to prune redundant steps.

Paper:
Code:

Model Details

Model Description

This model, joykirat/Qwen3-4B-TRAAC, is a 🤗 Transformers model based on the Qwen3-4B architecture. It is fine-tuned using the TRAAC methodology, which is an online post-training Reinforcement Learning (RL) method. TRAAC leverages the model's self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. It also estimates task difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty.

This approach significantly improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines across a variety of tasks.

Developed by: Joykirat Singh, Justin Chih-Yao Chen, Archiki Prasad, Elias Stengel-Eskin, Akshay Nambi, Mohit Bansal
Model type: Causal Language Model (Qwen2ForCausalLM architecture), trained with an online post-training RL method for adaptive reasoning.
Language(s) (NLP): English
License: MIT
Finetuned from model: Qwen/Qwen3-4B

Model Highlights

Adaptive Thinking: Learns to balance under- and over-thinking by dynamically adjusting reasoning length based on estimated task difficulty.
Attention-based Compression: Utilizes self-attention scores over reasoning trajectories to identify and prune redundant steps, enhancing efficiency.
Improved Performance: Achieves significant accuracy gains and substantial reduction in reasoning length across various tasks.
Strong Generalization: Demonstrates effectiveness on both math datasets (AIME, AMC) and out-of-distribution non-math datasets (GPQA-D, BBEH, OptimalThinkingBench).

Model Sources

Repository: https://github.com/joykirat18/TRAAC
Paper: https://arxiv.org/abs/2510.01581

Uses

Direct Use

The model is intended for direct use in complex reasoning tasks where efficient and adaptive allocation of computational resources (i.e., reasoning steps) is crucial. It can generate responses that are appropriately detailed for the problem's difficulty without unnecessary verbosity.

Downstream Use

TRAAC can serve as a foundation for developing more robust and efficient AI agents and reasoning systems. Its ability to adapt reasoning length can be particularly beneficial in resource-constrained environments or applications requiring quick, yet accurate, decision-making.

Out-of-Scope Use

The model is not intended for generating harmful, unethical, or biased content. While trained on diverse datasets, potential biases from the underlying base model or training data may still exist. Misuse for generating misinformation or engaging in inappropriate conversations is strictly out of scope.

How to Get Started with the Model

Installation

To run the model and its associated framework, first install the necessary packages. The codebase is built on top of verl.

python -m venv traac_venv
source traac_venv/bin/activate
pip install -e .[vllm]
pip install -r requirements.txt

Download Models

You can download the trained Adaptive reasoning models directly from Hugging Face:

Model	Download Link
(TRAAC)DeepSeek-R1-Distill-Qwen-7B
(TRAAC) Qwen3-4B

For detailed instructions on running evaluations, training TRAAC models, and further usage examples, please refer to the official GitHub repository's Overview, Run Evaluations, and Train Models sections.

Training Details

Training Data

Training data generation scripts, such as dapo-17k.py, are located in the scripts/data folder of the GitHub repository.

Training Procedure

TRAAC is an online post-training RL method. Training was conducted on 3 GPUs:

1 GPU was dedicated to hosting the policy model for calculating attention scores (attention-based compression).
2 GPUs were used to train the main model.

The file vllm_rollout_spmd.py contains the implementation for adaptive, attentive summarization, which is used during training.

Evaluation

TRAAC was evaluated across a variety of tasks including AIME, AMC, GPQA-D, BBEH, and OptimalThinkingBench. Evaluation scripts are available in the scripts/data and scripts/eval folders on the GitHub repository.

Results

TRAAC (Qwen3-4B) achieved an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model.
It also showed a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline.
The model demonstrates strong generalization: although trained on math datasets, it shows accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench.

Citation

If you find this work useful, please consider citing us:

@misc{singh2025thinkrightlearningmitigate,
      title={Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression}, 
      author={Joykirat Singh and Justin Chih-Yao Chen and Archiki Prasad and Elias Stengel-Eskin and Akshay Nambi and Mohit Bansal},
      year={2025},
      eprint={2510.01581},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.01581}, 
}

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

F32

Model tree for joykirat/DeepSeek-R1-Distill-Qwen-7B-TRAAC

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(348)

this model