jpskill.com
🛠️ 開発・MCP コミュニティ

agentdb-reinforcement-learning-training

Train AI agents using AgentDB's 9 reinforcement learning algorithms including Q-Learning, DQN, PPO, and Actor-Critic. Build self-learning agents, implement RL training loops with experience replay, and deploy optimized models to production.

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o agentdb-reinforcement-learning-training.zip https://jpskill.com/download/18732.zip && unzip -o agentdb-reinforcement-learning-training.zip && rm agentdb-reinforcement-learning-training.zip
🪟 Windows (PowerShell)
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/18732.zip -OutFile "$d\agentdb-reinforcement-learning-training.zip"; Expand-Archive "$d\agentdb-reinforcement-learning-training.zip" -DestinationPath $d -Force; ri "$d\agentdb-reinforcement-learning-training.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)
  1. 1. 下の青いボタンを押して agentdb-reinforcement-learning-training.zip をダウンロード
  2. 2. ZIPファイルをダブルクリックで解凍 → agentdb-reinforcement-learning-training フォルダができる
  3. 3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
  4. 4. Claude Code を再起動

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-18
取得日時
2026-05-18
同梱ファイル
2
📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

AgentDB Reinforcement Learning Training

Overview

Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience.

When to Use This Skill

Use this skill when you need to:

  • Train autonomous agents that learn from experience
  • Implement reinforcement learning systems
  • Optimize agent behavior through trial and error
  • Build self-improving AI systems
  • Deploy RL agents in production environments
  • Benchmark and compare RL algorithms

Available RL Algorithms

  1. Q-Learning - Value-based, off-policy
  2. SARSA - Value-based, on-policy
  3. Deep Q-Network (DQN) - Deep RL with experience replay
  4. Actor-Critic - Policy gradient with value baseline
  5. Proximal Policy Optimization (PPO) - Trust region policy optimization
  6. Decision Transformer - Offline RL with transformers
  7. Advantage Actor-Critic (A2C) - Synchronous advantage estimation
  8. Twin Delayed DDPG (TD3) - Continuous control
  9. Soft Actor-Critic (SAC) - Maximum entropy RL

SOP Framework: 5-Phase RL Training Deployment

Phase 1: Initialize Learning Environment (1-2 hours)

Objective: Setup AgentDB learning infrastructure with environment configuration

Agent: ml-developer

Steps:

  1. Install AgentDB Learning Module

    npm install agentdb-learning@latest
    npm install @agentdb/rl-algorithms @agentdb/environments
  2. Initialize learning database

    
    import { AgentDB, LearningPlugin } from 'agentdb-learning';

const learningDB = new AgentDB({ name: 'rl-training-db', dimensions: 512, // State embedding dimension learning: { enabled: true, persistExperience: true, replayBufferSize: 100000 } });

await learningDB.initialize();

// Create learning plugin const learningPlugin = new LearningPlugin({ database: learningDB, algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'], config: { batchSize: 64, learningRate: 0.001, discountFactor: 0.99, explorationRate: 1.0, explorationDecay: 0.995 } });

await learningPlugin.initialize();


3. **Define environment**
```typescript
import { Environment } from '@agentdb/environments';

const environment = new Environment({
  name: 'grid-world',
  stateSpace: {
    type: 'continuous',
    shape: [10, 10],
    bounds: [[0, 10], [0, 10]]
  },
  actionSpace: {
    type: 'discrete',
    actions: ['up', 'down', 'left', 'right']
  },
  rewardFunction: (state, action, nextState) => {
    // Distance to goal reward
    const goalDistance = Math.sqrt(
      Math.pow(nextState[0] - 9, 2) +
      Math.pow(nextState[1] - 9, 2)
    );
    return -goalDistance + (goalDistance === 0 ? 100 : 0);
  },
  terminalCondition: (state) => {
    return state[0] === 9 && state[1] === 9; // Reached goal
  }
});

await environment.initialize();
  1. Setup monitoring
    
    const monitor = learningPlugin.createMonitor({
    metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'],
    logInterval: 100, // Log every 100 episodes
    saveCheckpoints: true,
    checkpointInterval: 1000
    });

monitor.on('episode-complete', (episode) => { console.log('Episode:', episode.number, 'Reward:', episode.totalReward); });


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/environment', {
  name: environment.name,
  stateSpace: environment.stateSpace,
  actionSpace: environment.actionSpace,
  initialized: Date.now()
});

Validation:

  • Learning database initialized
  • Environment configured and tested
  • Monitor capturing metrics
  • Configuration stored in memory

Phase 2: Configure RL Algorithm (1-2 hours)

Objective: Select and configure RL algorithm for the learning task

Agent: ml-developer

Steps:

  1. Select algorithm
    
    // Example: Deep Q-Network (DQN)
    const dqnAgent = learningPlugin.createAgent({
    algorithm: 'dqn',
    config: {
     networkArchitecture: {
       layers: [
         { type: 'dense', units: 128, activation: 'relu' },
         { type: 'dense', units: 128, activation: 'relu' },
         { type: 'dense', units: environment.actionSpace.size, activation: 'linear' }
       ]
     },
     learningRate: 0.001,
     batchSize: 64,
     replayBuffer: {
       size: 100000,
       prioritized: true,
       alpha: 0.6,
       beta: 0.4
     },
     targetNetwork: {
       updateFrequency: 1000,
       tauSync: 0.001 // Soft update
     },
     exploration: {
       initial: 1.0,
       final: 0.01,
       decay: 0.995
     },
     training: {
       startAfter: 1000, // Start training after 1000 experiences
       updateFrequency: 4
     }
    }
    });

await dqnAgent.initialize();


2. **Configure hyperparameters**
```typescript
const hyperparameters = {
  // Learning parameters
  learningRate: 0.001,
  discountFactor: 0.99, // Gamma
  batchSize: 64,

  // Exploration
  epsilonStart: 1.0,
  epsilonEnd: 0.01,
  epsilonDecay: 0.995,

  // Experience replay
  replayBufferSize: 100000,
  minReplaySize: 1000,
  prioritizedReplay: true,

  // Training
  maxEpisodes: 10000,
  maxStepsPerEpisode: 1000,
  targetUpdateFrequency: 1000,

  // Evaluation
  evalFrequency: 100,
  evalEpisodes: 10
};

dqnAgent.setHyperparameters(hyperparameters);
  1. Setup experience replay
    
    import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms';

const replayBuffer = new PrioritizedReplayBuffer({ capacity: 100000, alpha: 0.6, // Prioritization exponent beta: 0.4, // Importance sampling betaIncrement: 0.001, epsilon: 0.01 // Small constant for stability });

dqnAgent.setReplayBuffer(replayBuffer);


4. **Configure training loop**
```typescript
const trainingConfig = {
  episodes: 10000,
  stepsPerEpisode: 1000,
  warmupSteps: 1000,
  trainFrequency: 4,
  targetUpdateFrequency: 1000,
  saveFrequency: 1000,
  evalFrequency: 100,
  earlyStoppingPatience: 500,
  earlyStoppingThreshold: 0.01
};

dqnAgent.setTrainingConfig(trainingConfig);

Memory Pattern:

await agentDB.memory.store('agentdb/learning/algorithm-config', {
  algorithm: 'dqn',
  hyperparameters: hyperparameters,
  trainingConfig: trainingConfig,
  configured: Date.now()
});

Validation:

  • Algorithm selected and configured
  • Hyperparameters validated
  • Replay buffer initialized
  • Training config set

Phase 3: Train Agents (3-4 hours)

Objective: Execute training iterations and optimize agent behavior

Agent: safla-neural

Steps:

  1. Start training loop
    
    async function trainAgent() {
    console.log('Starting RL training...');

const trainingStats = { episodes: [], totalReward: [], episodeLength: [], loss: [], explorationRate: [] };

for (let episode = 0; episode < trainingConfig.episodes; episode++) { let state = await environment.reset(); let episodeReward = 0; let episodeLength = 0; let episodeLoss = 0;

for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) {
  // Select action
  const action = await dqnAgent.selectAction(state, {
    explore: true
  });

  // Execute action
  const { nextState, reward, done } = await environment.step(action);

  // Store experience
  await dqnAgent.storeExperience({
    state,
    action,
    reward,
    nextState,
    done
  });

  // Train if enough experiences
  if (dqnAgent.canTrain()) {
    const loss = await dqnAgent.train();
    episodeLoss += loss;
  }

  episodeReward += reward;
  episodeLength += 1;
  state = nextState;

  if (done) break;
}

// Update target network
if (episode % trainingConfig.targetUpdateFrequency === 0) {
  await dqnAgent.updateTargetNetwork();
}

// Decay exploration
dqnAgent.decayExploration();

// Log progress
trainingStats.episodes.push(episode);
trainingStats.totalReward.push(episodeReward);
trainingStats.episodeLength.push(episodeLength);
trainingStats.loss.push(episodeLoss / episodeLength);
trainingStats.explorationRate.push(dqnAgent.getExplorationRate());

if (episode % 100 === 0) {
  console.log(`Episode ${episode}:`, {
    reward: episodeReward.toFixed(2),
    length: episodeLength,
    loss: (episodeLoss / episodeLength).toFixed(4),
    epsilon: dqnAgent.getExplorationRate().toFixed(3)
  });
}

// Save checkpoint
if (episode % trainingConfig.saveFrequency === 0) {
  await dqnAgent.save(`checkpoint-${episode}`);
}

// Evaluate
if (episode % trainingConfig.evalFrequency === 0) {
  const evalReward = await evaluateAgent(dqnAgent, environment);
  console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`);
}

// Early stopping
if (checkEarlyStopping(trainingStats, episode)) {
  console.log('Early stopping triggered');
  break;
}

}

return trainingStats; }

const trainingStats = await trainAgent();


2. **Monitor training progress**
```typescript
monitor.on('training-update', (stats) => {
  // Calculate moving averages
  const window = 100;
  const recentRewards = stats.totalReward.slice(-window);
  const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length;

  // Store metrics
  agentDB.memory.store('agentdb/learning/training-progress', {
    episode: stats.episodes[stats.episodes.length - 1],
    avgReward: avgReward,
    explorationRate: stats.explorationRate[stats.explorationRate.length - 1],
    timestamp: Date.now()
  });

  // Plot learning curve (if visualization enabled)
  if (monitor.visualization) {
    monitor.plot('reward-curve', stats.episodes, stats.totalReward);
    monitor.plot('loss-curve', stats.episodes, stats.loss);
  }
});
  1. Handle convergence
    
    function checkConvergence(stats, windowSize = 100, threshold = 0.01) {
    if (stats.totalReward.length < windowSize * 2) {
     return false;
    }

const recent = stats.totalReward.slice(-windowSize); const previous = stats.totalReward.slice(-windowSize * 2, -windowSize);

const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length; const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length;

const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg);

return improvement < threshold; }


4. **Save trained model**
```typescript
await dqnAgent.save('trained-agent-final', {
  includeReplayBuffer: false,
  includeOptimizer: false,
  metadata: {
    trainingStats: trainingStats,
    hyperparameters: hyperparameters,
    finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1]
  }
});

console.log('Training complete. Model saved.');

Memory Pattern:

await agentDB.memory.store('agentdb/learning/training-results', {
  algorithm: 'dqn',
  episodes: trainingStats.episodes.length,
  finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1],
  converged: checkConvergence(trainingStats),
  modelPath: 'trained-agent-final',
  timestamp: Date.now()
});

Validation:

  • Training completed or converged
  • Reward curve shows improvement
  • Model saved successfully
  • Training stats stored

Phase 4: Validate Performance (1-2 hours)

Objective: Benchmark trained agent and validate performance

Agent: performance-benchmarker

Steps:

  1. Load trained agent

    const trainedAgent = await learningPlugin.loadAgent('trained-agent-final');
  2. Run evaluation episodes

    
    async function evaluateAgent(agent, env, numEpisodes = 100) {
    const results = {
     rewards: [],
     episodeLengths: [],
     successRate: 0
    };

for (let i = 0; i < numEpisodes; i++) { let state = await env.reset(); let episodeReward = 0; let episodeLength = 0; let success = false;

for (let step = 0; step < 1000; step++) {
  const action = await agent.selectAction(state, { explore: false });
  const { nextState, reward, done } = await env.step(action);

  episodeReward += reward;
  episodeLength += 1;
  state = nextState;

  if (done) {
    success = env.isSuccessful(state);
    break;
  }
}

results.rewards.push(episodeReward);
results.episodeLengths.push(episodeLength);
if (success) results.successRate += 1;

}

results.successRate /= numEpisodes;

return { meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length, stdReward: calculateStd(results.rewards), meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length, successRate: results.successRate, results: results }; }

const evalResults = await evaluateAgent(trainedAgent, environment, 100); console.log('Evaluation results:', evalResults);


3. **Compare with baseline**
```typescript
// Random policy baseline
const randomAgent = learningPlugin.createAgent({ algorithm: 'random' });
const randomResults = await evaluateAgent(randomAgent, environment, 100);

// Calculate improvement
const improvement = {
  rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward),
  lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength,
  successImprovement: evalResults.successRate - randomResults.successRate
};

console.log('Improvement over random:', improvement);
  1. Run comprehensive benchmarks
    
    const benchmarks = {
    performanceMetrics: {
     meanReward: evalResults.meanReward,
     stdReward: evalResults.stdReward,
     successRate: evalResults.successRate,
     meanEpisodeLength: evalResults.meanLength
    },
    algorithmComparison: {
     dqn: evalResults,
     random: randomResults,
     improvement: improvement
    },
    inferenceTiming: {
     actionSelection: 0,
     totalEpisode: 0
    }
    };

// Measure inference speed const timingTrials = 1000; const startTime = performance.now(); for (let i = 0; i < timingTrials; i++) { const state = await environment.randomState(); await trainedAgent.selectAction(state, { explore: false }); } const endTime = performance.now(); benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials;

await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks);


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/validation', {
  evaluated: true,
  meanReward: evalResults.meanReward,
  successRate: evalResults.successRate,
  improvement: improvement,
  timestamp: Date.now()
});

Validation:

  • Evaluation completed (100 episodes)
  • Mean reward exceeds threshold
  • Success rate acceptable
  • Improvement over baseline demonstrated

Phase 5: Deploy Trained Agents (1-2 hours)

Objective: Deploy trained agents to production environment

Agent: ml-developer

Steps:

  1. Export production model

    await trainedAgent.export('production-agent', {
    format: 'onnx', // or 'tensorflowjs', 'pytorch'
    optimize: true,
    quantize: 'int8', // Quantization for faster inference
    includeMetadata: true
    });
  2. Create inference API

    
    import express from 'express';

const app = express(); app.use(express.json());

// Load production agent const productionAgent = await learningPlugin.loadAgent('production-agent');

app.post('/api/predict', async (req, res) => { try { const { state } = req.body;

const action = await productionAgent.selectAction(state, {
  explore: false,
  returnProbabilities: true
});

res.json({
  action: action.action,
  probabilities: action.probabilities,
  confidence: action.confidence
});

} catch (error) { res.status(500).json({ error: error.message }); } });

app.listen(3000, () => { console.log('RL agent API running on port 3000'); });


3. **Setup monitoring**
```typescript
import { ProductionMonitor } from '@agentdb/monitoring';

const prodMonitor = new ProductionMonitor({
  agent: productionAgent,
  metrics: ['inference-latency', 'action-distribution', 'reward-feedback'],
  alerting: {
    latencyThreshold: 100, // ms
    anomalyDetection: true
  }
});

await prodMonitor.start();
  1. Create deployment pipeline
    
    const deploymentPipeline = {
    stages: [
     {
       name: 'validation',
       steps: [
         'Load trained model',
         'Run validation suite',
         'Check performance metrics',
         'Verify inference speed'
       ]
     },
     {
       name: 'export',
       steps: [
         'Export to production format',
         'Optimize model',
         'Quantize weights',
         'Package artifacts'
       ]
     },
     {
       name: 'deployment',
       steps: [
         'Deploy to staging',
         'Run smoke tests',
         'Deploy to production',
         'Monitor performance'
       ]
     }
    ]
    };

await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline);


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/production', {
  deployed: true,
  modelPath: 'production-agent',
  apiEndpoint: 'http://localhost:3000/api/predict',
  monitoring: true,
  timestamp: Date.now()
});

Validation:

  • Model exported successfully
  • API running and responding
  • Monitoring active
  • Deployment pipeline documented

Integration Scripts

Complete Training Script

#!/bin/bash
# train-rl-agent.sh

set -e

echo "AgentDB RL Training Script"
echo "=========================="

# Phase 1: Initialize
echo "Phase 1: Initializing learning environment..."
npm install agentdb-learning @agentdb/rl-algorithms

# Phase 2: Configure
echo "Phase 2: Configuring algorithm..."
node -e "require('./config-algorithm.js')"

# Phase 3: Train
echo "Phase 3: Training agent..."
node -e "require('./train-agent.js')"

# Phase 4: Validate
echo "Phase 4: Validating performance..."
node -e "require('./evaluate-agent.js')"

# Phase 5: Deploy
echo "Phase 5: Deploying to production..."
node -e "require('./deploy-agent.js')"

echo "Training complete!"

Quick Start Script

// quickstart-rl.ts
import { setupRLTraining } from './setup';

async function quickStart() {
  console.log('Starting RL training quick setup...');

  // Setup
  const { learningDB, environment, agent } = await setupRLTraining({
    algorithm: 'dqn',
    environment: 'grid-world',
    episodes: 1000
  });

  // Train
  console.log('Training agent...');
  const stats = await agent.train(environment, {
    episodes: 1000,
    logInterval: 100
  });

  // Evaluate
  console.log('Evaluating agent...');
  const results = await agent.evaluate(environment, {
    episodes: 100
  });
  console.log('Results:', results);

  // Save
  await agent.save('quickstart-agent');
  console.log('Quick start complete!');
}

quickStart().catch(console.error);

Evidence-Based Success Criteria

  1. Training Convergence (Self-Consistency)

    • Reward curve stabilizes
    • Moving average improvement < 1%
    • Agent achieves consistent performance
  2. Performance Benchmarks (Quantitative)

    • Mean reward exceeds baseline by 50%
    • Success rate > 80%
    • Inference time < 10ms per action
  3. Algorithm Validation (Chain-of-Verification)

    • Hyperparameters validated
    • Exploration-exploitation balanced
    • Experience replay functioning
  4. Production Readiness (Multi-Agent Consensus)

    • Model exported successfully
    • API responds within latency threshold
    • Monitoring active and alerting
    • Deployment pipeline documented

Additional Resources

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。