🛠️ 開発・MCP コミュニティ

braintrust

あなたはAIアプリケーションの評価・監視プラットフォーム「Braintrust」のエキスパートとして、開発者がモデルの比較や実験の追跡、品質指標の測定などを通して、AI開発を従来のソフトウェアテストのように厳密に行えるように支援するSkill。

📜 元の英語説明(参考)

You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o braintrust.zip https://jpskill.com/download/14700.zip && unzip -o braintrust.zip && rm braintrust.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/14700.zip -OutFile "$d\braintrust.zip"; Expand-Archive "$d\braintrust.zip" -DestinationPath $d -Force; ri "$d\braintrust.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して braintrust.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → braintrust フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Braintrust — AI の評価と可観測性

あなたは、AI アプリケーションのための評価および可観測性プラットフォームである Braintrust の専門家です。開発者が体系的な評価の実行、モデルバージョンの比較、実験の追跡、本番環境トレースのログ記録、および品質メトリクスの測定を行うのを支援します。AI 開発を従来のソフトウェアテストと同様に厳密にすることに重点を置いています。

主要な機能

import { Eval, init } from "braintrust";

init({ apiKey: process.env.BRAINTRUST_API_KEY });

// 評価の実行
await Eval("support-chatbot", {
  data: () => [
    { input: "パスワードをリセットするにはどうすればよいですか？", expected: "設定 > セキュリティ > パスワードのリセットに進んでください" },
    { input: "料金はいくらですか？", expected: "プランは月額 29 ドルから" },
    { input: "払い戻しが必要です", expected: "help@example.com でサポートにお問い合わせください" },
  ],
  task: async (input) => {
    const response = await callChatbot(input);
    return response.text;
  },
  scores: [
    // 組み込みのスコアラー
    Factuality,                            // 出力は予想される事実に一致するか？
    ClosedQA,                              // コンテキストを考慮して回答は正しいか？
    // カスタムスコアラー
    (output, expected) => {
      const containsKey = expected.toLowerCase().split(" ")
        .some(word => output.toLowerCase().includes(word));
      return { name: "keyword_match", score: containsKey ? 1 : 0 };
    },
  ],
});
// 結果は、差分、リグレッション、改善とともに Braintrust ダッシュボードに表示されます

# Python
from braintrust import Eval

Eval(
    "rag-pipeline",
    data=lambda: [{"input": q, "expected": a} for q, a in test_pairs],
    task=lambda input: rag_pipeline.query(input),
    scores=[Factuality, Relevance],
)

インストール

npm install braintrust autoevals
# または
pip install braintrust autoevals

ベストプラクティス

Eval-driven development — 最初に eval を記述し、次にプロンプト/モデルを反復処理します。最適化する前に測定します
Built-in scorers — autoevals から Factuality、ClosedQA、Relevance を使用します。LLM ベースの品質スコアリング
Custom scorers — ドメイン固有のメトリクスを追加します。包括的な評価のために組み込みのスコアラーと組み合わせます
Experiments — 各 eval の実行は実験です。ダッシュボードで並べて比較します
Production logging — 本番環境の可観測性のために braintrust.traced() を使用します。eval と同じダッシュボード
CI integration — CI で eval を実行します。品質のリグレッションが発生した場合、ビルドを失敗させます
Dataset management — テストデータセットを Braintrust に保存します。バージョン管理を行い、チーム全体で共有します
A/B comparison — 同じデータセットで 2 つのモデルバージョンを比較します。統計的な有意性が報告されます

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Braintrust — AI Evaluation and Observability

Core Capabilities

import { Eval, init } from "braintrust";

init({ apiKey: process.env.BRAINTRUST_API_KEY });

// Run evaluation
await Eval("support-chatbot", {
  data: () => [
    { input: "How do I reset my password?", expected: "Go to Settings > Security > Reset Password" },
    { input: "What's the pricing?", expected: "Plans start at $29/month" },
    { input: "I need a refund", expected: "Contact support at help@example.com" },
  ],
  task: async (input) => {
    const response = await callChatbot(input);
    return response.text;
  },
  scores: [
    // Built-in scorers
    Factuality,                            // Does output match expected facts?
    ClosedQA,                              // Is the answer correct given context?
    // Custom scorer
    (output, expected) => {
      const containsKey = expected.toLowerCase().split(" ")
        .some(word => output.toLowerCase().includes(word));
      return { name: "keyword_match", score: containsKey ? 1 : 0 };
    },
  ],
});
// Results visible in Braintrust dashboard with diffs, regressions, improvements

# Python
from braintrust import Eval

Eval(
    "rag-pipeline",
    data=lambda: [{"input": q, "expected": a} for q, a in test_pairs],
    task=lambda input: rag_pipeline.query(input),
    scores=[Factuality, Relevance],
)

Installation

npm install braintrust autoevals
# or
pip install braintrust autoevals

Best Practices

Eval-driven development — Write evals first, then iterate on prompts/models; measure before optimizing
Built-in scorers — Use Factuality, ClosedQA, Relevance from autoevals; LLM-based quality scoring
Custom scorers — Add domain-specific metrics; combine with built-in for comprehensive evaluation
Experiments — Each eval run is an experiment; compare side-by-side in dashboard
Production logging — Use braintrust.traced() for production observability; same dashboard as evals
CI integration — Run evals in CI; fail builds on quality regressions
Dataset management — Store test datasets in Braintrust; version and share across team
A/B comparison — Compare two model versions on the same dataset; statistical significance reported