この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

AgentDB Reinforcement Learning Training

Overview

Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience.

When to Use This Skill

Use this skill when you need to:

Train autonomous agents that learn from experience
Implement reinforcement learning systems
Optimize agent behavior through trial and error
Build self-improving AI systems
Deploy RL agents in production environments
Benchmark and compare RL algorithms

Available RL Algorithms

Q-Learning - Value-based, off-policy
SARSA - Value-based, on-policy
Deep Q-Network (DQN) - Deep RL with experience replay
Actor-Critic - Policy gradient with value baseline
Proximal Policy Optimization (PPO) - Trust region policy optimization
Decision Transformer - Offline RL with transformers
Advantage Actor-Critic (A2C) - Synchronous advantage estimation
Twin Delayed DDPG (TD3) - Continuous control
Soft Actor-Critic (SAC) - Maximum entropy RL

SOP Framework: 5-Phase RL Training Deployment

Phase 1: Initialize Learning Environment (1-2 hours)

Objective: Setup AgentDB learning infrastructure with environment configuration

Agent: ml-developer

Steps:

Install AgentDB Learning Module

npm install agentdb-learning@latest
npm install @agentdb/rl-algorithms @agentdb/environments

Initialize learning database


import { AgentDB, LearningPlugin } from 'agentdb-learning';

const learningDB = new AgentDB({ name: 'rl-training-db', dimensions: 512, // State embedding dimension learning: { enabled: true, persistExperience: true, replayBufferSize: 100000 } });

await learningDB.initialize();

// Create learning plugin const learningPlugin = new LearningPlugin({ database: learningDB, algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'], config: { batchSize: 64, learningRate: 0.001, discountFactor: 0.99, explorationRate: 1.0, explorationDecay: 0.995 } });

await learningPlugin.initialize();


3. **Define environment**
```typescript
import { Environment } from '@agentdb/environments';

const environment = new Environment({
  name: 'grid-world',
  stateSpace: {
    type: 'continuous',
    shape: [10, 10],
    bounds: [[0, 10], [0, 10]]
  },
  actionSpace: {
    type: 'discrete',
    actions: ['up', 'down', 'left', 'right']
  },
  rewardFunction: (state, action, nextState) => {
    // Distance to goal reward
    const goalDistance = Math.sqrt(
      Math.pow(nextState[0] - 9, 2) +
      Math.pow(nextState[1] - 9, 2)
    );
    return -goalDistance + (goalDistance === 0 ? 100 : 0);
  },
  terminalCondition: (state) => {
    return state[0] === 9 && state[1] === 9; // Reached goal
  }
});

await environment.initialize();

Setup monitoring


const monitor = learningPlugin.createMonitor({
metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'],
logInterval: 100, // Log every 100 episodes
saveCheckpoints: true,
checkpointInterval: 1000
});

monitor.on('episode-complete', (episode) => { console.log('Episode:', episode.number, 'Reward:', episode.totalReward); });


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/environment', {
  name: environment.name,
  stateSpace: environment.stateSpace,
  actionSpace: environment.actionSpace,
  initialized: Date.now()
});

Validation:

Learning database initialized
Environment configured and tested
Monitor capturing metrics
Configuration stored in memory

Phase 2: Configure RL Algorithm (1-2 hours)

Objective: Select and configure RL algorithm for the learning task

Agent: ml-developer

Steps:

Select algorithm


// Example: Deep Q-Network (DQN)
const dqnAgent = learningPlugin.createAgent({
algorithm: 'dqn',
config: {
 networkArchitecture: {
   layers: [
     { type: 'dense', units: 128, activation: 'relu' },
     { type: 'dense', units: 128, activation: 'relu' },
     { type: 'dense', units: environment.actionSpace.size, activation: 'linear' }
   ]
 },
 learningRate: 0.001,
 batchSize: 64,
 replayBuffer: {
   size: 100000,
   prioritized: true,
   alpha: 0.6,
   beta: 0.4
 },
 targetNetwork: {
   updateFrequency: 1000,
   tauSync: 0.001 // Soft update
 },
 exploration: {
   initial: 1.0,
   final: 0.01,
   decay: 0.995
 },
 training: {
   startAfter: 1000, // Start training after 1000 experiences
   updateFrequency: 4
 }
}
});

await dqnAgent.initialize();


2. **Configure hyperparameters**
```typescript
const hyperparameters = {
  // Learning parameters
  learningRate: 0.001,
  discountFactor: 0.99, // Gamma
  batchSize: 64,

  // Exploration
  epsilonStart: 1.0,
  epsilonEnd: 0.01,
  epsilonDecay: 0.995,

  // Experience replay
  replayBufferSize: 100000,
  minReplaySize: 1000,
  prioritizedReplay: true,

  // Training
  maxEpisodes: 10000,
  maxStepsPerEpisode: 1000,
  targetUpdateFrequency: 1000,

  // Evaluation
  evalFrequency: 100,
  evalEpisodes: 10
};

dqnAgent.setHyperparameters(hyperparameters);

Setup experience replay


import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms';

const replayBuffer = new PrioritizedReplayBuffer({ capacity: 100000, alpha: 0.6, // Prioritization exponent beta: 0.4, // Importance sampling betaIncrement: 0.001, epsilon: 0.01 // Small constant for stability });

dqnAgent.setReplayBuffer(replayBuffer);


4. **Configure training loop**
```typescript
const trainingConfig = {
  episodes: 10000,
  stepsPerEpisode: 1000,
  warmupSteps: 1000,
  trainFrequency: 4,
  targetUpdateFrequency: 1000,
  saveFrequency: 1000,
  evalFrequency: 100,
  earlyStoppingPatience: 500,
  earlyStoppingThreshold: 0.01
};

dqnAgent.setTrainingConfig(trainingConfig);

Memory Pattern:

await agentDB.memory.store('agentdb/learning/algorithm-config', {
  algorithm: 'dqn',
  hyperparameters: hyperparameters,
  trainingConfig: trainingConfig,
  configured: Date.now()
});

Validation:

Algorithm selected and configured
Hyperparameters validated
Replay buffer initialized
Training config set

Phase 3: Train Agents (3-4 hours)

Objective: Execute training iterations and optimize agent behavior

Agent: safla-neural

Steps:

Start training loop


async function trainAgent() {
console.log('Starting RL training...');

const trainingStats = { episodes: [], totalReward: [], episodeLength: [], loss: [], explorationRate: [] };

for (let episode = 0; episode < trainingConfig.episodes; episode++) { let state = await environment.reset(); let episodeReward = 0; let episodeLength = 0; let episodeLoss = 0;

for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) {
  // Select action
  const action = await dqnAgent.selectAction(state, {
    explore: true
  });

  // Execute action
  const { nextState, reward, done } = await environment.step(action);

  // Store experience
  await dqnAgent.storeExperience({
    state,
    action,
    reward,
    nextState,
    done
  });

  // Train if enough experiences
  if (dqnAgent.canTrain()) {
    const loss = await dqnAgent.train();
    episodeLoss += loss;
  }

  episodeReward += reward;
  episodeLength += 1;
  state = nextState;

  if (done) break;
}

// Update target network
if (episode % trainingConfig.targetUpdateFrequency === 0) {
  await dqnAgent.updateTargetNetwork();
}

// Decay exploration
dqnAgent.decayExploration();

// Log progress
trainingStats.episodes.push(episode);
trainingStats.totalReward.push(episodeReward);
trainingStats.episodeLength.push(episodeLength);
trainingStats.loss.push(episodeLoss / episodeLength);
trainingStats.explorationRate.push(dqnAgent.getExplorationRate());

if (episode % 100 === 0) {
  console.log(`Episode ${episode}:`, {
    reward: episodeReward.toFixed(2),
    length: episodeLength,
    loss: (episodeLoss / episodeLength).toFixed(4),
    epsilon: dqnAgent.getExplorationRate().toFixed(3)
  });
}

// Save checkpoint
if (episode % trainingConfig.saveFrequency === 0) {
  await dqnAgent.save(`checkpoint-${episode}`);
}

// Evaluate
if (episode % trainingConfig.evalFrequency === 0) {
  const evalReward = await evaluateAgent(dqnAgent, environment);
  console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`);
}

// Early stopping
if (checkEarlyStopping(trainingStats, episode)) {
  console.log('Early stopping triggered');
  break;
}

}

return trainingStats; }

const trainingStats = await trainAgent();


2. **Monitor training progress**
```typescript
monitor.on('training-update', (stats) => {
  // Calculate moving averages
  const window = 100;
  const recentRewards = stats.totalReward.slice(-window);
  const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length;

  // Store metrics
  agentDB.memory.store('agentdb/learning/training-progress', {
    episode: stats.episodes[stats.episodes.length - 1],
    avgReward: avgReward,
    explorationRate: stats.explorationRate[stats.explorationRate.length - 1],
    timestamp: Date.now()
  });

  // Plot learning curve (if visualization enabled)
  if (monitor.visualization) {
    monitor.plot('reward-curve', stats.episodes, stats.totalReward);
    monitor.plot('loss-curve', stats.episodes, stats.loss);
  }
});

Handle convergence


function checkConvergence(stats, windowSize = 100, threshold = 0.01) {
if (stats.totalReward.length < windowSize * 2) {
 return false;
}

const recent = stats.totalReward.slice(-windowSize); const previous = stats.totalReward.slice(-windowSize * 2, -windowSize);

const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length; const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length;

const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg);

return improvement < threshold; }


4. **Save trained model**
```typescript
await dqnAgent.save('trained-agent-final', {
  includeReplayBuffer: false,
  includeOptimizer: false,
  metadata: {
    trainingStats: trainingStats,
    hyperparameters: hyperparameters,
    finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1]
  }
});

console.log('Training complete. Model saved.');

Memory Pattern:

await agentDB.memory.store('agentdb/learning/training-results', {
  algorithm: 'dqn',
  episodes: trainingStats.episodes.length,
  finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1],
  converged: checkConvergence(trainingStats),
  modelPath: 'trained-agent-final',
  timestamp: Date.now()
});

Validation:

Training completed or converged
Reward curve shows improvement
Model saved successfully
Training stats stored

Phase 4: Validate Performance (1-2 hours)

Objective: Benchmark trained agent and validate performance

Agent: performance-benchmarker

Steps:

Load trained agent

const trainedAgent = await learningPlugin.loadAgent('trained-agent-final');

Run evaluation episodes


async function evaluateAgent(agent, env, numEpisodes = 100) {
const results = {
 rewards: [],
 episodeLengths: [],
 successRate: 0
};

for (let i = 0; i < numEpisodes; i++) { let state = await env.reset(); let episodeReward = 0; let episodeLength = 0; let success = false;

for (let step = 0; step < 1000; step++) {
  const action = await agent.selectAction(state, { explore: false });
  const { nextState, reward, done } = await env.step(action);

  episodeReward += reward;
  episodeLength += 1;
  state = nextState;

  if (done) {
    success = env.isSuccessful(state);
    break;
  }
}

results.rewards.push(episodeReward);
results.episodeLengths.push(episodeLength);
if (success) results.successRate += 1;

}

results.successRate /= numEpisodes;

return { meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length, stdReward: calculateStd(results.rewards), meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length, successRate: results.successRate, results: results }; }

const evalResults = await evaluateAgent(trainedAgent, environment, 100); console.log('Evaluation results:', evalResults);


3. **Compare with baseline**
```typescript
// Random policy baseline
const randomAgent = learningPlugin.createAgent({ algorithm: 'random' });
const randomResults = await evaluateAgent(randomAgent, environment, 100);

// Calculate improvement
const improvement = {
  rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward),
  lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength,
  successImprovement: evalResults.successRate - randomResults.successRate
};

console.log('Improvement over random:', improvement);

Run comprehensive benchmarks


const benchmarks = {
performanceMetrics: {
 meanReward: evalResults.meanReward,
 stdReward: evalResults.stdReward,
 successRate: evalResults.successRate,
 meanEpisodeLength: evalResults.meanLength
},
algorithmComparison: {
 dqn: evalResults,
 random: randomResults,
 improvement: improvement
},
inferenceTiming: {
 actionSelection: 0,
 totalEpisode: 0
}
};

// Measure inference speed const timingTrials = 1000; const startTime = performance.now(); for (let i = 0; i < timingTrials; i++) { const state = await environment.randomState(); await trainedAgent.selectAction(state, { explore: false }); } const endTime = performance.now(); benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials;

await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks);


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/validation', {
  evaluated: true,
  meanReward: evalResults.meanReward,
  successRate: evalResults.successRate,
  improvement: improvement,
  timestamp: Date.now()
});

Validation:

Evaluation completed (100 episodes)
Mean reward exceeds threshold
Success rate acceptable
Improvement over baseline demonstrated

Phase 5: Deploy Trained Agents (1-2 hours)

Objective: Deploy trained agents to production environment

Agent: ml-developer

Steps:

Export production model

await trainedAgent.export('production-agent', {
format: 'onnx', // or 'tensorflowjs', 'pytorch'
optimize: true,
quantize: 'int8', // Quantization for faster inference
includeMetadata: true
});

Create inference API
```
import express from 'express';
```

const app = express(); app.use(express.json());

// Load production agent const productionAgent = await learningPlugin.loadAgent('production-agent');

app.post('/api/predict', async (req, res) => { try { const { state } = req.body;

const action = await productionAgent.selectAction(state, {
  explore: false,
  returnProbabilities: true
});

res.json({
  action: action.action,
  probabilities: action.probabilities,
  confidence: action.confidence
});

} catch (error) { res.status(500).json({ error: error.message }); } });

app.listen(3000, () => { console.log('RL agent API running on port 3000'); });


3. **Setup monitoring**
```typescript
import { ProductionMonitor } from '@agentdb/monitoring';

const prodMonitor = new ProductionMonitor({
  agent: productionAgent,
  metrics: ['inference-latency', 'action-distribution', 'reward-feedback'],
  alerting: {
    latencyThreshold: 100, // ms
    anomalyDetection: true
  }
});

await prodMonitor.start();

Create deployment pipeline


const deploymentPipeline = {
stages: [
 {
   name: 'validation',
   steps: [
     'Load trained model',
     'Run validation suite',
     'Check performance metrics',
     'Verify inference speed'
   ]
 },
 {
   name: 'export',
   steps: [
     'Export to production format',
     'Optimize model',
     'Quantize weights',
     'Package artifacts'
   ]
 },
 {
   name: 'deployment',
   steps: [
     'Deploy to staging',
     'Run smoke tests',
     'Deploy to production',
     'Monitor performance'
   ]
 }
]
};

await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline);


**Memory Pattern:**
```typescript
await agentDB.memory.store('agentdb/learning/production', {
  deployed: true,
  modelPath: 'production-agent',
  apiEndpoint: 'http://localhost:3000/api/predict',
  monitoring: true,
  timestamp: Date.now()
});

Validation:

Model exported successfully
API running and responding
Monitoring active
Deployment pipeline documented

Integration Scripts

Complete Training Script

#!/bin/bash
# train-rl-agent.sh

set -e

echo "AgentDB RL Training Script"
echo "=========================="

# Phase 1: Initialize
echo "Phase 1: Initializing learning environment..."
npm install agentdb-learning @agentdb/rl-algorithms

# Phase 2: Configure
echo "Phase 2: Configuring algorithm..."
node -e "require('./config-algorithm.js')"

# Phase 3: Train
echo "Phase 3: Training agent..."
node -e "require('./train-agent.js')"

# Phase 4: Validate
echo "Phase 4: Validating performance..."
node -e "require('./evaluate-agent.js')"

# Phase 5: Deploy
echo "Phase 5: Deploying to production..."
node -e "require('./deploy-agent.js')"

echo "Training complete!"

Quick Start Script

// quickstart-rl.ts
import { setupRLTraining } from './setup';

async function quickStart() {
  console.log('Starting RL training quick setup...');

  // Setup
  const { learningDB, environment, agent } = await setupRLTraining({
    algorithm: 'dqn',
    environment: 'grid-world',
    episodes: 1000
  });

  // Train
  console.log('Training agent...');
  const stats = await agent.train(environment, {
    episodes: 1000,
    logInterval: 100
  });

  // Evaluate
  console.log('Evaluating agent...');
  const results = await agent.evaluate(environment, {
    episodes: 100
  });
  console.log('Results:', results);

  // Save
  await agent.save('quickstart-agent');
  console.log('Quick start complete!');
}

quickStart().catch(console.error);

Evidence-Based Success Criteria

Training Convergence (Self-Consistency)
- Reward curve stabilizes
- Moving average improvement < 1%
- Agent achieves consistent performance
Performance Benchmarks (Quantitative)
- Mean reward exceeds baseline by 50%
- Success rate > 80%
- Inference time < 10ms per action
Algorithm Validation (Chain-of-Verification)
- Hyperparameters validated
- Exploration-exploitation balanced
- Experience replay functioning
Production Readiness (Multi-Agent Consensus)
- Model exported successfully
- API responds within latency threshold
- Monitoring active and alerting
- Deployment pipeline documented

Additional Resources

AgentDB Learning Documentation: https://agentdb.dev/docs/learning
RL Algorithms Guide: https://agentdb.dev/docs/rl-algorithms
Training Best Practices: https://agentdb.dev/docs/training
Production Deployment: https://agentdb.dev/docs/deployment

agentdb-reinforcement-learning-training

🎯 このSkillでできること

📦 インストール方法 (3ステップ)

AgentDB Reinforcement Learning Training

Overview

When to Use This Skill

Available RL Algorithms

SOP Framework: 5-Phase RL Training Deployment

Phase 1: Initialize Learning Environment (1-2 hours)

Phase 2: Configure RL Algorithm (1-2 hours)

Phase 3: Train Agents (3-4 hours)

Phase 4: Validate Performance (1-2 hours)

Phase 5: Deploy Trained Agents (1-2 hours)

Integration Scripts

Complete Training Script

Quick Start Script

Evidence-Based Success Criteria

Additional Resources

同梱ファイル