💼 ビジネスコミュニティ

音声文字起こし

audio-transcribe

音声ファイルをテキストに変換し、タイムスタンプや話者情報も付与できるため、会議の議事録作成や音声データの文字起こし、動画の字幕作成などを効率的に行うSkill。

📜 元の英語説明(参考)

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o audio-transcribe.zip https://jpskill.com/download/10476.zip && unzip -o audio-transcribe.zip && rm audio-transcribe.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10476.zip -OutFile "$d\audio-transcribe.zip"; Expand-Archive "$d\audio-transcribe.zip" -DestinationPath $d -Force; ri "$d\audio-transcribe.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して audio-transcribe.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → audio-transcribe フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Audio Transcribe

音声ファイルをタイムスタンプ付きのテキストに書き起こします。自動言語検出、話者識別（ダイアライゼーション）をサポートし、セグメントレベルのタイミングを含む構造化された JSON を出力します。

Command

agent-media audio transcribe --in <path> [options]

Inputs

Option	Required	Description
`--in`	Yes	入力音声ファイルのパスまたは URL (mp3, wav, m4a, ogg をサポート)
`--diarize`	No	話者識別を有効にする
`--language`	No	言語コード (指定しない場合は自動検出)
`--speakers`	No	ダイアライゼーションのための話者数のヒント
`--out`	No	出力パス、ファイル名、またはディレクトリ (デフォルト: ./)
`--provider`	No	使用するプロバイダ (local, fal, replicate)

Output

書き起こしデータを含む JSON オブジェクトを返します。

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

基本的な書き起こし (言語を自動検出):

agent-media audio transcribe --in interview.mp3

話者識別付きの書き起こし:

agent-media audio transcribe --in meeting.wav --diarize

特定の言語と話者数での書き起こし:

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

特定のプロバイダを使用する:

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

ビデオファイルを書き起こすには、まず音声を抽出します。

# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3

Providers

local

Transformers.js を使用して CPU 上でローカルに実行します。API キーは不要です。

Moonshine モデルを使用 (Whisper より 5 倍高速)
モデルは初回使用時にダウンロードされます (~100MB)
ダイアライゼーションはサポートしていません — 話者識別には fal または replicate を使用してください
mutex lock failed エラーが表示される場合があります — 無視してください。"ok": true であれば出力は正しいです

agent-media audio transcribe --in audio.mp3 --provider local

fal

FAL_API_KEY が必要です
ダイアライゼーションが無効になっている場合は、高速な書き起こしのために wizper モデルを使用します (2 倍高速)
ダイアライゼーションが有効になっている場合は、whisper モデルを使用します (ネイティブサポート)

replicate

REPLICATE_API_TOKEN が必要です
Whisper Large V3 Turbo を使用した whisper-diarization モデルを使用します
単語レベルのタイムスタンプによるネイティブなダイアライゼーションのサポート

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

agent-media audio transcribe --in <path> [options]

Inputs

Option	Required	Description
`--in`	Yes	Input audio file path or URL (supports mp3, wav, m4a, ogg)
`--diarize`	No	Enable speaker identification
`--language`	No	Language code (auto-detected if not provided)
`--speakers`	No	Number of speakers hint for diarization
`--out`	No	Output path, filename or directory (default: ./)
`--provider`	No	Provider to use (local, fal, replicate)

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

agent-media audio transcribe --in interview.mp3

Transcription with speaker identification:

agent-media audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a mutex lock failed error — ignore it, the output is correct if "ok": true

agent-media audio transcribe --in audio.mp3 --provider local

fal

Requires FAL_API_KEY
Uses wizper model for fast transcription (2x faster) when diarization is disabled
Uses whisper model when diarization is enabled (native support)

replicate

Requires REPLICATE_API_TOKEN
Uses whisper-diarization model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps