Detail publikačního výsledku

Reinforcement Learning for Mathematical Reasoning in Small-Scale Language Models with Structured Policy Optimization

TYAGI, N.; JOSHI, R.; DAS, S.; SIKORA, P.; MYŠKA, V.; DUTTA, M.

Originální název

Reinforcement Learning for Mathematical Reasoning in Small-Scale Language Models with Structured Policy Optimization

Anglický název

Reinforcement Learning for Mathematical Reasoning in Small-Scale Language Models with Structured Policy Optimization

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Advancing mathematical reasoning in small-scale language models remains a challenge due to their limited capacity and the high computational demands of standard reinforcement learning methods like Proximal Policy Optimization (PPO). To address this, a resource-efficient training pipeline based on Group Relative Policy Optimization (GRPO) is proposed, enabling fine-tuning of compact models under strict memory constraints. The proposed method introduces structured prompting to explicitly separate reasoning steps from final answers and applies a dual reward system to jointly optimize for format adherence and mathematical correctness. The training incorporates low-overhead techniques such as 8-bit optimization, mixed-precision training, gradient checkpointing, and accelerated decoding for efficient rollout and policy updates. Experimental results show a 50.95% accuracy on a benchmark reasoning dataset—GSM8K, outperforming several larger models—while training entirely on a single GPU. These findings demonstrate that small-scale models, when trained with structured reinforcement learning, can achieve competitive performance in mathematical reasoning tasks. The approach offers a practical pathway for deploying interpretable, reasoning-capable models in low-resource environments.

Anglický abstrakt

Advancing mathematical reasoning in small-scale language models remains a challenge due to their limited capacity and the high computational demands of standard reinforcement learning methods like Proximal Policy Optimization (PPO). To address this, a resource-efficient training pipeline based on Group Relative Policy Optimization (GRPO) is proposed, enabling fine-tuning of compact models under strict memory constraints. The proposed method introduces structured prompting to explicitly separate reasoning steps from final answers and applies a dual reward system to jointly optimize for format adherence and mathematical correctness. The training incorporates low-overhead techniques such as 8-bit optimization, mixed-precision training, gradient checkpointing, and accelerated decoding for efficient rollout and policy updates. Experimental results show a 50.95% accuracy on a benchmark reasoning dataset—GSM8K, outperforming several larger models—while training entirely on a single GPU. These findings demonstrate that small-scale models, when trained with structured reinforcement learning, can achieve competitive performance in mathematical reasoning tasks. The approach offers a practical pathway for deploying interpretable, reasoning-capable models in low-resource environments.

Klíčová slova

Emotion Flip Reasoning, Therapeutic Dialogue Modelling, Emotional Trajectory Prediction, Transformer

Klíčová slova v angličtině

Emotion Flip Reasoning, Therapeutic Dialogue Modelling, Emotional Trajectory Prediction, Transformer

Autoři

TYAGI, N.; JOSHI, R.; DAS, S.; SIKORA, P.; MYŠKA, V.; DUTTA, M.

Rok RIV

2026

Vydáno

05.11.2025

Nakladatel

IEEE

Místo

Florence, Italy

ISBN

979-8-3315-7675-2

Kniha

2025 17th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)

Periodikum

International Congress on Ultra Modern Telecommunications and Workshops

Stát

Spojené státy americké

Strany od

272

Strany do

277

Strany počet

6

URL

BibTex

@inproceedings{BUT200028,
  author="{} and Rakesh Chandra {Joshi} and  {} and Pavel {Sikora} and Vojtěch {Myška} and Malay Kishore {Dutta}",
  title="Reinforcement Learning for Mathematical Reasoning in Small-Scale Language Models with Structured Policy Optimization",
  booktitle="2025 17th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)",
  year="2025",
  journal="International Congress on Ultra Modern Telecommunications and Workshops",
  pages="272--277",
  publisher="IEEE",
  address="Florence, Italy",
  doi="10.1109/ICUMT67815.2025.11268643",
  isbn="979-8-3315-7675-2",
  url="https://ieeexplore.ieee.org/document/11268643"
}