Miaow-Lab
/

RLVR-Linearity-Checkpoints

Text Generation

Model card Files Files and versions

RLVR-Linearity-Checkpoints / README.md

louiswng's picture

Update README.md

4b925a1 verified 30 days ago

|

history blame contribute delete

1.81 kB

	---
	license: apache-2.0
	datasets:
	- Miaow-Lab/RLVR-Linearity-Dataset
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	pipeline_tag: text-generation
	---

	# Model Card

	## 1. Model Details
	This model is the fine-tuned checkpoint described in the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training". It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.

	- Paper: [ArXiv](https://arxiv.org/pdf/2601.04537v1)
	- Code: [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
	- Base Model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
	- Training Method: GRPO


	## 2. Training Details

	- Hyperparameters:
	- Learning Rate: `1e-6`
	- Train Batch Size: `128`
	- PPO Mini Batch Size: `64`
	- RL Algorithm: `GRPO`
	- Rollout Temperature: 1.0
	- Group Size: 16
	- Compute: Trained on `32 x H100` GPUs for about `150` hours.

	For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).

	## 3. Citation

	If you use this model in your research, please cite our paper:

	```bibtex
	@misc{wang2026stepsinformativelinearityllms,
	title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
	author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
	year={2026},
	eprint={2601.04537},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2601.04537},
	}
	```

	> [!TIP]
	> Motivation for this Model
	> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.