Back to tags
Tag

Agent Skills with tag: reinforcement-learning

11 skills match this tag. Use tags to discover related Agent Skills and explore similar workflows.

fine-tuning-with-trl

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

fine-tuningreinforcement-learningrlhfhuggingface
ovachiever
ovachiever
81

stable-baselines3

Use this skill for reinforcement learning tasks including training RL agents (PPO, SAC, DQN, TD3, DDPG, A2C, etc.), creating custom Gym environments, implementing callbacks for monitoring and control, using vectorized environments for parallel training, and integrating with deep RL workflows. This skill should be used when users request RL algorithm implementation, agent training, environment design, or RL experimentation.

reinforcement-learningstable-baselines3gym-environmentsrl-agent-training
ovachiever
ovachiever
81

pufferlib

This skill should be used when working with reinforcement learning tasks including high-performance RL training, custom environment development, vectorized parallel simulation, multi-agent systems, or integration with existing RL environments (Gymnasium, PettingZoo, Atari, Procgen, etc.). Use this skill for implementing PPO training, creating PufferEnv environments, optimizing RL performance, or developing policies with CNNs/LSTMs.

reinforcement-learningmulti-agent-systemsgymnasiumppo
ovachiever
ovachiever
81

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

reinforcement-learningdistributed-traininggpu-accelerationray
ovachiever
ovachiever
81

grpo-rl-training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

reinforcement-learningfine-tuningtransformerstrl
ovachiever
ovachiever
81

constitutional-ai

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

anthropicai-safetyreinforcement-learningself-critique
ovachiever
ovachiever
81

neural

Feedback-driven learning. The soul learns from experience - what helped gets strengthened, what misled gets weakened.

feedbacklearningreinforcement-learningadaptation
genomewalker
genomewalker
0

reinforcement-learning

Q-learning, DQN, PPO, A3C, policy gradient methods, multi-agent systems, and Gym environments. Use for training agents, game AI, robotics, or decision-making systems.

reinforcement-learningopenai-gymdeep-learningmulti-agent-systems
pluginagentmarketplace
pluginagentmarketplace
21

rlhf

Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.

reinforcement-learningrlhflarge-language-modelsreward-modeling
itsmostafa
itsmostafa
10

stable-baselines3

Use this skill for reinforcement learning tasks including training RL agents (PPO, SAC, DQN, TD3, DDPG, A2C, etc.), creating custom Gym environments, implementing callbacks for monitoring and control, using vectorized environments for parallel training, and integrating with deep RL workflows. This skill should be used when users request RL algorithm implementation, agent training, environment design, or RL experimentation.

pythonmachine-learningreinforcement-learningopenai-gym
K-Dense-AI
K-Dense-AI
3,233360

pufferlib

This skill should be used when working with reinforcement learning tasks including high-performance RL training, custom environment development, vectorized parallel simulation, multi-agent systems, or integration with existing RL environments (Gymnasium, PettingZoo, Atari, Procgen, etc.). Use this skill for implementing PPO training, creating PufferEnv environments, optimizing RL performance, or developing policies with CNNs/LSTMs.

machine-learningpythonreinforcement-learningautonomous-agent
K-Dense-AI
K-Dense-AI
3,233360