New Technology / Ai Development

Reinforcement Learning Insights with Kyle Corbitt

cognitive_revolution_how_ai_changes_everything • 2026-05-01T15:27:29Z

Source material: The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Summary

Kyle Corbitt discusses the advantages of reinforcement learning (RL) over supervised fine-tuning (SFT), particularly in enhancing performance and reducing costs for open-source models. He highlights the importance of using large language models (LLMs) as judges in RL post-training, suggesting this approach may yield greater benefits than traditional SFT methods. Corbitt emphasizes the iterative nature of developing effective rubrics for reinforcement learning, indicating that initial versions may require adjustments based on user feedback. He advises companies to pursue custom model development only when it aligns with their core business needs due to the associated costs and complexities. The discussion includes insights on the competitive landscape, mentioning that Chinese labs are leveraging distillation techniques to quickly improve their models, while facing challenges due to compute limitations. Corbitt points out the rise of companies dedicated to developing RL environments, indicating a burgeoning market but advising caution regarding investments in this area. Corbitt explains the complexities of fine-tuning reinforcement learning models, emphasizing the high costs and risks associated with post-deployment adjustments. He highlights the importance of model selection based on task overlap to optimize performance and resource allocation.

Perspectives

Reinforcement Learning Advocates

Emphasize the advantages of RL over SFT in enhancing model performance and reducing costs
Highlight the importance of iterative feedback and effective rubrics to prevent reward hacking

Skeptics of Custom Models

Question the scalability and effectiveness of transitioning to custom models for diverse tasks
Point out the high costs and complexities associated with fine-tuning models

Neutral / Shared

Acknowledge the competitive landscape and the role of Chinese labs in leveraging distillation techniques
Recognize the importance of model selection based on task overlap to optimize performance

Metrics

20%

discount for public demo booking

This discount incentivizes potential users to engage with the platform

use the code Cognizm in the source field when you book a public demo to save 20% off year one.

IQ of 180 units

minimum IQ for hiring at OpenAI

This high standard suggests a rigorous selection process for talent in AI

you're not allowed to be hired here unless you have any IQ of 180

75%

percentage of workforce laid off by Google

This statistic illustrates the model's learned behavior to exploit trending news for engagement

Google lays off 75% of workforce effectively immediately

20 to 40%

latency penalty when merging models

Increased latency can affect user experience and model responsiveness

you do get like a 20 to 40% latency penalty

30%

latency reduction compared to frontier models

Lower latency can significantly enhance user experience and system efficiency

we can typically get uh you know latency down to about 30% of what you get from using a frontier model

75%

Google layoffs

This indicates significant workforce reduction, impacting operational capacity

google lays off 75% of workforce effective immediately

Key entities

Companies

Anthropic • CoreWeave • Cursor • DeepMind • DeepSeek • Fundrise • Google • Google DeepMind • OpenPipe • Quilter • Willow

Countries / Locations

Themes

#ai_development • #ai_competition • #ai_environments • #ai_governance • #ai_models • #ai_performance • #automated_labs

Key developments

Phase 1

Kyle Corbitt, founder of OpenPipe, highlights the benefits of reinforcement learning (RL) over supervised fine-tuning (SFT), particularly in enhancing performance, reducing latency, and lowering costs for open-source models
He notes that RL fine-tuning is less prone to catastrophic forgetting compared to SFT, attributing this to its distinct weight update dynamics
Corbitt emphasizes the importance of utilizing large language models (LLMs) as judges in RL post-training, suggesting this method may offer greater advantages than conventional SFT approaches
The discussion includes insights on the competitive landscape, mentioning that Chinese labs are leveraging distillation techniques to quickly improve their models, while facing challenges due to compute limitations
Corbitt points out the rise of companies dedicated to developing RL environments, indicating a burgeoning market but advising caution regarding investments in this area
He offers practical guidance for implementing RL, such as creating evaluation rubrics and addressing reward hacking in specific tasks

Phase 2

Kyle Corbitt discusses the advantages of reinforcement learning (RL) over supervised fine-tuning (SFT) for model training, particularly in maintaining data distribution integrity. He suggests that while SFT can improve performance, RL may yield better results for open-source models, especially in creative tasks.

Kyle Corbitt argues that reinforcement learning (RL) can outperform supervised fine-tuning (SFT) in model training, especially for open-source models, as RL maintains data distribution integrity and avoids catastrophic forgetting
While SFT can enhance model performance, it often lacks the creative quality of human outputs, indicating that RL may be more effective for reaching higher performance ceilings
The choice between RL and advanced frontier models is task-dependent; for creative writing, utilizing sophisticated models with prompt engineering may yield better results than enhancing open-source models with RL
The significance of compute resources, suggesting that practical limitations may affect the decision to adopt RL over leveraging existing high-quality models

Phase 3

Kyle Corbitt explains the advantages of reinforcement learning (RL) over supervised fine-tuning (SFT) in model training, emphasizing RL's ability to minimize destructive changes to model weights. He highlights the importance of strategic updates to maintain existing capabilities while enhancing performance in creative tasks.

Reinforcement learning (RL) minimizes changes to model weights, preserving existing strengths, while supervised fine-tuning (SFT) risks catastrophic forgetting by overwriting effective pathways
The KL divergence penalty in SFT may lead to inefficient updates, as it fails to differentiate between necessary and unnecessary changes, potentially wasting resources on well-functioning aspects of the model
In creative writing tasks, RL can enhance efficiency by targeting updates to areas where the model struggles, rather than modifying parts that are already performing well
Strategic direction of model updates is crucial to maintain existing capabilities, highlighting a significant risk associated with traditional fine-tuning methods

Phase 4

Kyle Corbitt discusses the evolution of reinforcement learning (RL) and its advantages over supervised fine-tuning (SFT) in AI model training. He emphasizes the importance of algorithms like GRPO and the role of large language models in enhancing performance and reducing costs.

AI is evolving from a supportive role to one that requires autonomous decision-making, highlighting the need for a strong governance and security framework, as emphasized by AvePoint
The GRPO algorithms success stems from effective engineering and scaling by DeepSeek, rather than a major technological breakthrough, as they released a well-functioning model that effectively utilized GRPO
Evaluating GRPO involves multiple rollouts to identify and optimize patterns that lead to correct or higher-scoring answers, focusing on the nuances of performance
GRPOs advantage calculation operates at the token level, enabling precise updates to tokens that significantly influence outcomes, rather than applying broad changes across the model
Recent developments in reinforcement learning, including algorithms like DAPO and GSPO, have emerged in the wake of GRPO, showcasing the ongoing evolution in this field

Phase 5

Kyle Corbitt explains the advantages of the GRPO algorithm in reinforcement learning, highlighting its efficiency in evaluating action effectiveness without a value model. This shift from PPO to GRPO streamlines the training process for language models by focusing on entire action trajectories.

The GRPO algorithms rise is attributed to effective engineering and scaling by DeepSeek, rather than groundbreaking technological advancements
Unlike previous algorithms such as PPO, GRPO eliminates the need for a value model and utilizes multiple parallel rollouts to evaluate action effectiveness based on outcomes
In GRPO, each token generated is treated as an action, applying reinforcement learning principles directly to language models
The shift from PPO to GRPO represents a change in focus from assessing individual action values to evaluating the performance of entire action trajectories, streamlining the training process
The introduction of GRPO has spurred further advancements in reinforcement learning, leading to the development of algorithms like DAPO and GSPO

Phase 6

Kyle Corbitt discusses the advantages of the GRPO algorithm in reinforcement learning, emphasizing its efficiency in evaluating model performance without a separate value model. This approach streamlines training by focusing on overall action trajectories rather than individual actions.

GRPO (Generalized Reward Policy Optimization) enhances traditional reinforcement learning by removing the need for a separate value model, streamlining the training process
Instead of evaluating individual actions, GRPO utilizes multiple parallel simulations to assess overall model performance, leading to more accurate evaluations
The credit assignment problem is tackled by giving more credit to less common tokens when a model scores highly, suggesting these rare tokens played a crucial role in the outcome
This empirical approach marks a departure from earlier methods like PPO, which depended on a value model to gauge action significance
Despite its unconventional methodology, GRPO has shown practical effectiveness, indicating that simpler models can achieve substantial results in reinforcement learning

Reinforcement Learning Insights with Kyle Corbitt

Adjacent technology themes

Commercialization and strategic context

Reinforcement Learning Insights with Kyle Corbitt

Related coverage

Adjacent technology themes

Commercialization and strategic context