💼 ビジネスコミュニティ 🔴 エンジニア向け 👤 エンジニア・AI開発者

☁️ SkyPilotで複数クラウド横断ML実行

skypilot-multi-cloud-orchestration

複数クラウドを跨いでML訓練/バッチを実行し、自動コスト最適化するSkill。スポットインスタンス活用。

⚡ ⏱ 経費仕訳 1時間 → 5分

📺 まず動画で見る(YouTube)

※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。

📜 元の英語説明(参考)

Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.

🇯🇵 日本人クリエイター向け解説

一言でいうと

複数クラウドを跨いでML訓練/バッチを実行し、自動コスト最適化するSkill。スポットインスタンス活用。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o skypilot-multi-cloud-orchestration.zip https://jpskill.com/download/90.zip && unzip -o skypilot-multi-cloud-orchestration.zip && rm skypilot-multi-cloud-orchestration.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/90.zip -OutFile "$d\skypilot-multi-cloud-orchestration.zip"; Expand-Archive "$d\skypilot-multi-cloud-orchestration.zip" -DestinationPath $d -Force; ri "$d\skypilot-multi-cloud-orchestration.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して skypilot-multi-cloud-orchestration.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → skypilot-multi-cloud-orchestration フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 3

💬 こう話しかけるだけ — サンプルプロンプト

› SkyPilotで複数クラウド横断ML実行を使って、最小構成のサンプルコードを示して
› SkyPilotで複数クラウド横断ML実行の主な使い方と注意点を教えて
› SkyPilotで複数クラウド横断ML実行を既存プロジェクトに組み込む方法を教えて

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

[Skill 名] skypilot-multi-cloud-orchestration

SkyPilot マルチクラウドオーケストレーション

SkyPilot を使用して、自動コスト最適化によりクラウド間で ML ワークロードを実行するための包括的なガイドです。

SkyPilot を使用するタイミング

SkyPilot を使用するのは次の場合です。

複数のクラウド (AWS、GCP、Azure など) で ML ワークロードを実行している場合
自動的なクラウド/リージョン選択によるコスト最適化が必要な場合
自動復旧機能を備えたスポットインスタンスで長時間ジョブを実行している場合
分散型マルチノードトレーニングを管理している場合
20 以上のクラウドプロバイダー向けに統合されたインターフェースが必要な場合
ベンダーロックインを回避したい場合

主な機能:

マルチクラウド: AWS、GCP、Azure、Kubernetes、Lambda、RunPod、その他 20 以上のプロバイダー
コスト最適化: 最も安価なクラウド/リージョンを自動選択
スポットインスタンス: 自動復旧機能により 3～6 倍のコスト削減
分散トレーニング: ギャングスケジューリングによるマルチノードジョブ
マネージドジョブ: 自動復旧、チェックポイント、フォールトトレランス
Sky Serve: オートスケーリングによるモデルサービング

代わりに代替手段を使用するのは次の場合です。

Modal: よりシンプルなサーバーレス GPU を Python ネイティブ API で使用する場合
RunPod: シングルクラウドの永続的なポッドを使用する場合
Kubernetes: 既存の K8s インフラストラクチャを使用する場合
Ray: 純粋な Ray ベースのオーケストレーションを使用する場合

クイックスタート

インストール

pip install "skypilot[aws,gcp,azure,kubernetes]"

# Verify cloud credentials
sky check

Hello World

hello.yaml を作成します。

resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"

起動します。

sky launch -c hello hello.yaml

# SSH to cluster
ssh hello

# Terminate
sky down hello

コアコンセプト

タスク YAML 構造

# Task name (optional)
name: my-task

# Resource requirements
resources:
  cloud: aws              # Optional: auto-select if omitted
  region: us-west-2       # Optional: auto-select if omitted
  accelerators: A100:4    # GPU type and count
  cpus: 8+                # Minimum CPUs
  memory: 32+             # Minimum memory (GB)
  use_spot: true          # Use spot instances
  disk_size: 256          # Disk size (GB)

# Number of nodes for distributed training
num_nodes: 2

# Working directory (synced to ~/sky_workdir)
workdir: .

# Setup commands (run once)
setup: |
  pip install -r requirements.txt

# Run commands
run: |
  python train.py

主要コマンド

コマンド	目的
`sky launch`	クラスターを起動し、タスクを実行します
`sky exec`	既存のクラスターでタスクを実行します
`sky status`	クラスターのステータスを表示します
`sky stop`	クラスターを停止します (状態を保持します)
`sky down`	クラスターを終了します
`sky logs`	タスクのログを表示します
`sky queue`	ジョブキューを表示します
`sky jobs launch`	マネージドジョブを起動します
`sky serve up`	サービングエンドポイントをデプロイします

GPU 設定

利用可能なアクセラレーター

# NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8

# Cloud-specific
accelerators: V100:4         # AWS/GCP
accelerators: TPU-v4-8       # GCP TPUs

GPU フォールバック

resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

スポットインスタンス

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

クラスター管理

起動と実行

# Launch new cluster
sky launch -c mycluster task.yaml

# Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml

# Interactive SSH
ssh mycluster

# Stream logs
sky logs mycluster

オートストップ

resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop

# Set autostop via CLI
sky autostop mycluster -i 30 --down

クラスターのステータス

# All clusters
sky status

# Detailed view
sky status -a

分散トレーニング

マルチノード設定

resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

環境変数

変数	説明
`SKYPILOT_NODE_RANK`	ノードインデックス (0 から num_nodes-1)
`SKYPILOT_NODE_IPS`	改行区切りの IP アドレス
`SKYPILOT_NUM_NODES`	ノードの総数
`SKYPILOT_NUM_GPUS_PER_NODE`	ノードあたりの GPU 数

ヘッドノードのみの実行

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

マネージドジョブ

スポット復旧

# Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml

チェックポイント

name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

ジョブ管理

# List jobs
sky jobs queue

# View logs
sky jobs logs my-job

# Cancel job
sky jobs cancel my-job

ファイルマウントとストレージ

ローカルファイル同期

workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

クラウドストレージ

file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

ストレージモード

モード	説明	最適な用途
`MOUNT`	クラウドからストリーム	大規模データセット、読み込み中心
`COPY`	ディスクにプリフェッチ	小規模ファイル、ランダムアクセス
`MOUNT_CACHED`	キャッシュ (高速書き込み)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

SkyPilot Multi-Cloud Orchestration

Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.

When to use SkyPilot

Use SkyPilot when:

Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
Need cost optimization with automatic cloud/region selection
Running long jobs on spot instances with auto-recovery
Managing distributed multi-node training
Want unified interface for 20+ cloud providers
Need to avoid vendor lock-in

Key features:

Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
Cost optimization: Automatic cheapest cloud/region selection
Spot instances: 3-6x cost savings with automatic recovery
Distributed training: Multi-node jobs with gang scheduling
Managed jobs: Auto-recovery, checkpointing, fault tolerance
Sky Serve: Model serving with autoscaling

Use alternatives instead:

Modal: For simpler serverless GPU with Python-native API
RunPod: For single-cloud persistent pods
Kubernetes: For existing K8s infrastructure
Ray: For pure Ray-based orchestration

Quick start

Installation

pip install "skypilot[aws,gcp,azure,kubernetes]"

# Verify cloud credentials
sky check

Hello World

Create hello.yaml:

resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"

Launch:

sky launch -c hello hello.yaml

# SSH to cluster
ssh hello

# Terminate
sky down hello

Core concepts

Task YAML structure

# Task name (optional)
name: my-task

# Resource requirements
resources:
  cloud: aws              # Optional: auto-select if omitted
  region: us-west-2       # Optional: auto-select if omitted
  accelerators: A100:4    # GPU type and count
  cpus: 8+                # Minimum CPUs
  memory: 32+             # Minimum memory (GB)
  use_spot: true          # Use spot instances
  disk_size: 256          # Disk size (GB)

# Number of nodes for distributed training
num_nodes: 2

# Working directory (synced to ~/sky_workdir)
workdir: .

# Setup commands (run once)
setup: |
  pip install -r requirements.txt

# Run commands
run: |
  python train.py

Key commands

Command	Purpose
`sky launch`	Launch cluster and run task
`sky exec`	Run task on existing cluster
`sky status`	Show cluster status
`sky stop`	Stop cluster (preserve state)
`sky down`	Terminate cluster
`sky logs`	View task logs
`sky queue`	Show job queue
`sky jobs launch`	Launch managed job
`sky serve up`	Deploy serving endpoint

GPU configuration

Available accelerators

# NVIDIA GPUs
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8

# Cloud-specific
accelerators: V100:4         # AWS/GCP
accelerators: TPU-v4-8       # GCP TPUs

GPU fallbacks

resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

Spot instances

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

Cluster management

Launch and execute

# Launch new cluster
sky launch -c mycluster task.yaml

# Run on existing cluster (skip setup)
sky exec mycluster another_task.yaml

# Interactive SSH
ssh mycluster

# Stream logs
sky logs mycluster

Autostop

resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop

# Set autostop via CLI
sky autostop mycluster -i 30 --down

Cluster status

# All clusters
sky status

# Detailed view
sky status -a

Distributed training

Multi-node setup

resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

Environment variables

Variable	Description
`SKYPILOT_NODE_RANK`	Node index (0 to num_nodes-1)
`SKYPILOT_NODE_IPS`	Newline-separated IP addresses
`SKYPILOT_NUM_NODES`	Total number of nodes
`SKYPILOT_NUM_GPUS_PER_NODE`	GPUs per node

Head-node-only execution

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

Managed jobs

Spot recovery

# Launch managed job with spot recovery
sky jobs launch -n my-job train.yaml

Checkpointing

name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

Job management

# List jobs
sky jobs queue

# View logs
sky jobs logs my-job

# Cancel job
sky jobs cancel my-job

File mounts and storage

Local file sync

workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

Cloud storage

file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

Storage modes

Mode	Description	Best For
`MOUNT`	Stream from cloud	Large datasets, read-heavy
`COPY`	Pre-fetch to disk	Small files, random access
`MOUNT_CACHED`	Cache with async upload	Checkpoints, outputs

Sky Serve (Model Serving)

Basic service

# service.yaml
service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0

resources:
  accelerators: A100:1

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000

# Deploy
sky serve up -n my-service service.yaml

# Check status
sky serve status

# Get endpoint
sky serve status my-service

Autoscaling policies

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin

Cost optimization

Automatic cloud selection

# SkyPilot finds cheapest option
resources:
  accelerators: A100:8
  # No cloud specified - auto-select cheapest

# Show optimizer decision
sky launch task.yaml --dryrun

Cloud preferences

resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure

Environment variables

envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY

# Or use secrets
secrets:
  - HF_TOKEN
  - WANDB_API_KEY

Common workflows

Workflow 1: Fine-tuning with checkpoints

name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume

Workflow 2: Hyperparameter sweep

name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID

# Launch multiple jobs
for i in {1..10}; do
  sky jobs launch sweep.yaml \
    --env RUN_ID=$i \
    --env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done

Debugging

# SSH to cluster
ssh mycluster

# View logs
sky logs mycluster

# Check job queue
sky queue mycluster

# View managed job logs
sky jobs logs my-job

Common issues

Issue	Solution
Quota exceeded	Request quota increase, try different region
Spot preemption	Use `sky jobs launch` for auto-recovery
Slow file sync	Use `MOUNT_CACHED` mode for outputs
GPU not available	Use `any_of` for fallback clouds

References

Advanced Usage - Multi-cloud, optimization, production patterns
Troubleshooting - Common issues and solutions

Resources

Documentation: https://docs.skypilot.co
GitHub: https://github.com/skypilot-org/skypilot
Slack: https://slack.skypilot.co
Examples: https://github.com/skypilot-org/skypilot/tree/master/examples

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。

📄 SKILL.md (9,641 bytes)
📎 references/advanced-usage.md (7,469 bytes)
📎 references/troubleshooting.md (10,493 bytes)