💬 コミュニケーションコミュニティ

experiment-design-planner

機械学習やAIの実験を始める前に、仮説に基づいた計画や評価指標などを詳細に設計するSkill。

📜 元の英語説明(参考)

Design hypothesis-driven ML/AI experiments before running them. Use this skill whenever the user wants to plan experiments, ablations, baselines, metrics, controls, seeds, logging, stop conditions, reviewer-proof evidence, or an experiment matrix for a paper claim before using run-experiment or writing results.

🇯🇵 日本人クリエイター向け解説

一言でいうと

機械学習やAIの実験を始める前に、仮説に基づいた計画や評価指標などを詳細に設計するSkill。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⬇ このSkillをダウンロード(.skill) 元のソースを見る ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

実験デザインプランナー

研究の主張を、実際にそれに答えられる実験計画に変換します。このスキルは、実行前の計画を立てるためのものであり、完了した結果を報告するためのものではありません。

このスキルを使用する状況：

ユーザーが新しい実験またはアブレーションを実行しようとしている場合
論文の主張に証拠が必要な場合
ベースライン、メトリクス、コントロール、またはデータセットが不明確な場合
ユーザーが一度に多くの変数を変更している場合
曖昧な実行にクラスター/計算時間を無駄にしたくない場合
提出前に査読者にも通用する証拠が必要な場合

このスキルと組み合わせて使用するスキル：

research-project-memory：実験計画がプロジェクトレベルの証拠、リスク、および行動の記憶となるべき場合
run-experiment：デザインが実行準備完了になった後
experiment-report-writer：結果が存在した後
paper-reviewer-simulator：証拠が査読者を満足させるかどうかをストレステストする場合
baseline-selection-audit：ベースラインの選択、公平性、または査読者にも通用する比較についてより深いレビューが必要な場合に、実験マトリックスを最終決定する前
figure-results-review：プロットまたは表形式の結果が存在し、主張を裏付けるレビューが必要な場合

スキルディレクトリのレイアウト

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

プログレッシブローディング

常にreferences/evidence-standards.mdとreferences/metrics-and-controls.mdを読みます。
計画がバリアント、コンポーネント、ベースライン、ハイパーパラメータ、データセット、またはモデルサイズを比較する場合、references/ablation-matrix.mdを読みます。
実質的な実験計画を保存または返却する場合、references/report-template.mdを使用します。
ターゲットリポジトリにmemory/がある場合、research-project-memoryの慣例に従って、計画された証拠、実験ファミリー、リスク、および行動を更新します。
実験が現在のベースライン、ベンチマーク、またはリーダーボードの慣例に依存する場合、ウェブ検索で現在のソースを確認します。

核となる原則

コマンドラインからではなく、主張から始めます。
実験を実行する前に仮説を述べます。
新しい手法を導入する前にベースラインを使用します。
実験が明示的に要因計画でない限り、一度に1つの変数のみを変更します。
結果を解釈する前に、コントロールと交絡変数を定義します。
反証とフォールバックの決定を定義することで、ネガティブな結果を役立つものにします。
実験を実行する前に、表または図をデザインします。
停止条件は重要です。次に進むのに十分な結果を決定します。

ステップ1 - 主張と質問の定義

抽出する情報：

論文またはプロジェクトの主張
研究課題
ターゲットオーディエンス：内部デバッグ、アドバイザーへの報告、論文の証拠、反論、ベンチマークの主張
期待される出力：Markdown計画、LaTeX実験セクションのアウトライン、実行マトリックス、または保存されたファイル
実験モード：
- single：1つの制御された比較
- ablation：コンポーネントまたは変数の分離
- benchmark：データセット/タスク間での手法の比較
- theory：理論的予測に対する経験的サポート
- diagnostic：失敗モードまたは驚くべき結果の理解

曖昧な目標をテスト可能な質問に書き換えます。

Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?

ステップ2 - 仮説の提示

記述する情報：

主要仮説
代替説明
予想されるメトリクスの方向と大まかな効果量
反証条件
決定ルール

ユーザーが反証条件を述べられない場合、実験は準備ができていません。

ステップ3 - 証拠基準の定義

references/evidence-standards.mdを読みます。

必要な証拠を決定します。

1つの表、1つの曲線、1つのアブレーション、1つの定性的例、1つの定理に沿った診断、またはベンチマークスイート
データセット/タスクの数
シードまたは繰り返しの数
必要なベースライン
許容可能な分散
統計的検定または信頼区間が必要かどうか
結果が論文の主張を裏付ける必要があるか、または次のステップを導くだけでよいか

ステップ4 - ベースラインとコントロールの選択

特定する情報：

主要ベースライン
最も強力な先行手法または現在のSOTA（関連する場合）
シンプルなベースライン
アブレーションベースライン
オラクルまたは上限（有用な場合）
制御変数
交絡変数

ベースラインが存在しない場合、最初の実験をベースライン確立実験とします。

ステップ5 - メトリクスとロギングの選択

references/metrics-and-controls.mdを読みます。

各メトリクスについて、以下を指定します。

定義
方向
集計
分割
分散報告
失敗の解釈
なぜそれが質問に答えるのか

必要なロギングを定義します。

コマンド
設定パス
gitコミット
データセットバージョン
シード
ハイパーパラメータ
ハードウェア/ランタイム
メトリクス
アーティファクト：表、図、チェックポイント、ログ

ステップ6 - 実行マトリックスの構築

複数の実行がある場合、references/ablation-matrix.mdを読みます。

以下の情報を含む実行テーブルを作成します。

実行ID
変更された変数
固定されたコントロール
データセット/分割
メトリクス
シード/繰り返し
予想される結果
ステータス
出力パス

実行が複数の概念的変数を変更する場合、実験を分割します。

ステップ7 - 停止条件と次の決定の定義

記述する情報：

主張を裏付けるのに十分な結果とは何か
主張を反証または弱める結果とは何か
別のablationをトリガーする結果とは何か
停止して記述/報告することを意味する結果とは何か
計算予算の上限
期限の制約

ステップ8 - 査読者リスクチェック

最終決定の前に、以下の質問をします。

査読者はベースラインが弱いと不満を言うでしょうか？
比較は公平ですか？
シード/繰り返しは十分ですか？
実験は主張されたメカニズムを分離していますか？
メトリクスは主張と一致していますか？
結果を説明できる交絡因子はありますか？
否定的な結果でも何かを教えてくれますか？

回答が大きな弱点を露呈する場合、実行前にデザインを更新します。

ステップ9 - 実験計画の作成

references/report-template.mdを使用します。

プロジェクトに保存し、パスが指定されていない場合、以下を使用します。

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

コードリポジトリまたはコードワーク内で作業している場合

（原文がここで切り詰められています）

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Experiment Design Planner

Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.

Use this skill when:

a user is about to run a new experiment or ablation
a paper claim needs evidence
baselines, metrics, controls, or datasets are unclear
the user is changing too many variables at once
cluster/compute time should not be wasted on ambiguous runs
reviewer-proof evidence is needed before submission

Pair this skill with:

research-project-memory when the experiment plan should become project-level evidence, risk, and action memory
run-experiment after the design is ready to execute
experiment-report-writer after results exist
paper-reviewer-simulator to stress-test whether the evidence will satisfy reviewers
baseline-selection-audit before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
figure-results-review after plotted or tabulated results exist and need claim-support review

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

Progressive Loading

Always read references/evidence-standards.md and references/metrics-and-controls.md.
Read references/ablation-matrix.md when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
Use references/report-template.md when saving or returning a substantial experiment plan.
If the target repo has memory/, update planned evidence, experiment families, risks, and actions using research-project-memory conventions.
If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.

Core Principles

Start from the claim, not the command line.
State the hypothesis before running experiments.
Use a baseline before introducing a new method.
Change one variable at a time unless the experiment is explicitly factorial.
Define controls and nuisance variables before interpreting results.
Make negative results useful by defining falsification and fallback decisions.
Design the table or figure before running the experiment.
Stop conditions matter: decide what result is enough to move on.

Step 1 - Define the Claim and Question

Extract:

paper or project claim
research question
target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
experiment mode:
- single: one controlled comparison
- ablation: component or variable isolation
- benchmark: compare methods across datasets/tasks
- theory: empirical support for a theoretical prediction
- diagnostic: understand a failure mode or surprising result

Rewrite vague goals into testable questions:

Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?

Step 2 - State Hypotheses

Write:

primary hypothesis
alternative explanations
expected metric direction and rough effect size
falsification condition
decision rule

If the user cannot state a falsification condition, the experiment is not ready.

Step 3 - Define Evidence Standard

Read references/evidence-standards.md.

Decide what evidence is needed:

one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
number of datasets/tasks
number of seeds or repeats
required baselines
acceptable variance
whether statistical testing or confidence intervals are needed
whether results must support a paper claim or only guide next steps

Step 4 - Choose Baselines and Controls

Identify:

primary baseline
strongest prior method or current SOTA, if relevant
simple baseline
ablation baseline
oracle or upper bound, if useful
controlled variables
nuisance variables

If no baseline exists, make the first experiment a baseline-establishment experiment.

Step 5 - Choose Metrics and Logging

Read references/metrics-and-controls.md.

For each metric, specify:

definition
direction
aggregation
split
variance reporting
failure interpretation
why it answers the question

Define required logging:

command
config path
git commit
dataset version
seed
hyperparameters
hardware/runtime
metrics
artifacts: tables, figures, checkpoints, logs

Step 6 - Build Run Matrix

Read references/ablation-matrix.md when there is more than one run.

Create a run table with:

run ID
changed variable
fixed controls
dataset/split
metric
seed/repeats
expected result
status
output path

Split experiments if a run changes more than one conceptual variable.

Step 7 - Define Stop Conditions and Next Decisions

Write:

what result is sufficient to support the claim
what result falsifies or weakens the claim
what result triggers another ablation
what result means stop and write/report
compute budget ceiling
deadline constraints

Step 8 - Reviewer Risk Check

Before finalizing, ask:

Would a reviewer complain that the baseline is weak?
Is the comparison fair?
Are seeds/repeats enough?
Does the experiment isolate the claimed mechanism?
Are metrics aligned with the claim?
Is there a confounder that could explain the result?
Would a negative result still teach something?

If the answer exposes a major weakness, update the design before execution.

Step 9 - Write the Experiment Plan

Use references/report-template.md.

If saving to a project and no path is given, use:

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:

docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md

The final plan should be runnable by run-experiment and later reportable by experiment-report-writer.

Step 10 - Write Back to Project Memory

If the project uses research-project-memory, update:

memory/evidence-board.md: planned EVD-### items and EXP-### experiment families
memory/claim-board.md: linked claims, marking unsupported or planned claims honestly
memory/risk-board.md: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
memory/action-board.md: runnable next actions, including which experiment to launch first
relevant worktree .agent/worktree-status.md: experiment purpose and exit condition if a branch/worktree is involved

Use planned status for experiments that have not run. Do not record expected outcomes as observed evidence.

Final Sanity Check

Before finalizing:

claim and hypothesis are explicit
baseline is defined
independent variable is isolated
controls and nuisance variables are listed
metrics are tied to the question
run matrix is concrete
logging requirements are sufficient for reproduction
stop condition and decision rule are explicit
reviewer risks are stated
project memory is updated when the repo has memory/