An open-source model engineered by SVECTOR for exceptional performance in mathematics and code generation, with an advanced reinforcement learning variant, Spec-T1-RL-7B.
The development of advanced reinforcement learning (RL) models in open-source research has often relied on large-scale architectures, such as models with 32 billion parameters, to achieve robust reasoning capabilities in coding and mathematics. Historically, achieving comparable performance in smaller models has been a significant challenge. At SVECTOR, we propose that the reasoning potential of RL-enhanced models is fundamentally rooted in the capabilities of their foundational base model.
Introducing Spec-T1-Base-7B, a 7-billion parameter model meticulously designed from the ground up to excel in reasoning tasks. This open-source model demonstrates remarkable performance, surpassing significantly larger 32-billion parameter models in specific domains. Additionally, we have developed Spec-T1-RL-7B, a reinforcement learning-enhanced variant of the base model, which delivers superior results in mathematics and code reasoning, achieving performance comparable to leading models such as OpenAI's o1.
Key Features and Innovations
Pre-Training: Optimized for Reasoning Excellence
DataGen Pipeline: We developed a sophisticated synthetic data generation framework to create high-quality, reasoning-focused datasets tailored for mathematics and coding. The DataGen Pipeline employs iterative problem synthesis, context-aware augmentation, and quality-driven refinement to produce diverse, high-density reasoning patterns that enhance model training.
Three-Phase Pre-Training: Spec-T1-Base-7B was trained on approximately 5 trillion tokens over 5.5 months using a compact cluster of 3–4 Google Tensor Processing Unit (TPU) v4 chips. This efficient pre-training process utilized a strategic three-phase data mixture to optimize reasoning capabilities.
PolyStep Forecast (PSF): An auxiliary training objective, PSF enhances model performance and inference speed by enabling speculative decoding, maximizing the efficiency of pre-training resources.
Post-Training: Enhancing Reasoning through Reinforcement Learning
Curated Dataset: For reinforcement learning, we compiled a dataset of 130,000 meticulously verified mathematics and coding problems. Each problem was evaluated using rule-based systems and assessed for difficulty to ensure high quality and relevance.
Tiered Precision Scoring: To address the challenge of sparse rewards in complex coding tasks, we introduced a Tiered Precision Scoring system. This approach assigns granular scores to test cases of varying complexity, enabling more effective policy optimization during RL training.
Balanced Cycle Sampling: For simpler problems, we implemented a Balanced Cycle Sampling approach to improve rollout efficiency and stabilize policy updates in the later stages of RL training.
Reinforcement Learning Infrastructure
StreamPulse Accelerator: Our custom accelerator optimizes RL training and validation by integrating continuous rollout, asynchronous reward computation, and early termination, achieving 2.29× faster training and 1.96× faster validation.
Nexlify Inference Framework: Support for PSF was seamlessly integrated into our inference engine, enhancing robustness and efficiency for reinforcement learning workflows.
Model Specifications
The Spec-T1-Base-7B model incorporates PolyStep Forecast (PSF) layers, which are tuned during pre-training and remain fixed during reinforcement learning. With a single PSF layer for speculative decoding, the model achieves an acceptance rate of approximately 90%. This architecture ensures robust reasoning capabilities, serving as a strong foundation for the RL-enhanced Spec-T1-RL-7B model.
Both Spec-T1-Base-7B and Spec-T1-RL-7B are fully open-source and available at https://huggingface.co/SVECTOR-CORPORATION. By releasing these models, including their checkpoints, SVECTOR aims to contribute valuable insights to the broader research community, fostering advancements in high-performance reasoning language models.
Performance Evaluation
Note: Scores for Spec-T1-Base-7B are internal approximations by SVECTOR (temperature=0.6). AIME scores represent an average of 25 runs; MATH-500 and SuperGPQA scores are based on 2 runs; other benchmarks are averaged over 10 runs. Spec-T1-RL-7B demonstrates superior performance in mathematics and coding in a single run.