💼 PDFテキストExtractor
OCR機能でPDFからテキストを抽出でき、書類のデジタル化や請求書の処理、コンテンツ分析に役立つSkill。
📜 元の英語説明(参考)
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
🇯🇵 日本人クリエイター向け解説
OCR機能でPDFからテキストを抽出でき、書類のデジタル化や請求書の処理、コンテンツ分析に役立つSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o pdf-text-extractor.zip https://jpskill.com/download/5193.zip && unzip -o pdf-text-extractor.zip && rm pdf-text-extractor.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/5193.zip -OutFile "$d\pdf-text-extractor.zip"; Expand-Archive "$d\pdf-text-extractor.zip" -DestinationPath $d -Force; ri "$d\pdf-text-extractor.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
pdf-text-extractor.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
pdf-text-extractorフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 2
💬 こう話しかけるだけ — サンプルプロンプト
- › Pdf Text Extractor で、私のビジネスを分析して改善案を3つ提案して
- › Pdf Text Extractor を使って、来週の会議用の資料を作って
- › Pdf Text Extractor で、現状の課題を整理してアクションプランに落として
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
PDF-Text-Extractor - Extract Text from PDFs
Vernox Utility Skill - Perfect for document digitization.
Overview
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
Features
✅ Text Extraction
- Extract text from PDFs without external tools
- Support for both text-based and scanned PDFs
- Preserve document structure and formatting
- Fast extraction (milliseconds for text-based)
✅ OCR Support
- Use Tesseract.js for scanned documents
- Support multiple languages (English, Spanish, French, German)
- Configurable OCR quality/speed
- Fallback to text extraction when possible
✅ Batch Processing
- Process multiple PDFs at once
- Batch extraction for document workflows
- Progress tracking for large files
- Error handling and retry logic
✅ Output Options
- Plain text output
- JSON output with metadata
- Markdown conversion
- HTML output (preserving links)
✅ Utility Features
- Page-by-page extraction
- Character/word counting
- Language detection
- Metadata extraction (author, title, creation date)
Installation
clawhub install pdf-text-extractor
Quick Start
Extract Text from PDF
const result = await extractText({
pdfPath: './document.pdf',
options: {
outputFormat: 'text',
ocr: true,
language: 'eng'
}
});
console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);
Batch Extract Multiple PDFs
const results = await extractBatch({
pdfFiles: [
'./document1.pdf',
'./document2.pdf',
'./document3.pdf'
],
options: {
outputFormat: 'json',
ocr: true
}
});
console.log(`Extracted ${results.length} PDFs`);
Extract with OCR
const result = await extractText({
pdfPath: './scanned-document.pdf',
options: {
ocr: true,
language: 'eng',
ocrQuality: 'high'
}
});
// OCR will be used (scanned document detected)
Tool Functions
extractText
Extract text content from a single PDF file.
Parameters:
pdfPath(string, required): Path to PDF fileoptions(object, optional): Extraction optionsoutputFormat(string): 'text' | 'json' | 'markdown' | 'html'ocr(boolean): Enable OCR for scanned docslanguage(string): OCR language code ('eng', 'spa', 'fra', 'deu')preserveFormatting(boolean): Keep headings/structureminConfidence(number): Minimum OCR confidence score (0-100)
Returns:
text(string): Extracted text contentpages(number): Number of pages processedwordCount(number): Total word countcharCount(number): Total character countlanguage(string): Detected languagemetadata(object): PDF metadata (title, author, creation date)method(string): 'text' or 'ocr' (extraction method)
extractBatch
Extract text from multiple PDF files at once.
Parameters:
pdfFiles(array, required): Array of PDF file pathsoptions(object, optional): Same as extractText
Returns:
results(array): Array of extraction resultstotalPages(number): Total pages across all PDFssuccessCount(number): Successfully extractedfailureCount(number): Failed extractionserrors(array): Error details for failures
countWords
Count words in extracted text.
Parameters:
text(string, required): Text to countoptions(object, optional):minWordLength(number): Minimum characters per word (default: 3)excludeNumbers(boolean): Don't count numbers as wordscountByPage(boolean): Return word count per page
Returns:
wordCount(number): Total word countcharCount(number): Total character countpageCounts(array): Word count per pageaverageWordsPerPage(number): Average words per page
detectLanguage
Detect the language of extracted text.
Parameters:
text(string, required): Text to analyzeminConfidence(number): Minimum confidence for detection
Returns:
language(string): Detected language codelanguageName(string): Full language nameconfidence(number): Confidence score (0-100)
Use Cases
Document Digitization
- Convert paper documents to digital text
- Process invoices and receipts
- Digitize contracts and agreements
- Archive physical documents
Content Analysis
- Extract text for analysis tools
- Prepare content for LLM processing
- Clean up scanned documents
- Parse PDF-based reports
Data Extraction
- Extract data from PDF reports
- Parse tables from PDFs
- Pull structured data
- Automate document workflows
Text Processing
- Prepare content for translation
- Clean up OCR output
- Extract specific sections
- Search within PDF content
Performance
Text-Based PDFs
- Speed: ~100ms for 10-page PDF
- Accuracy: 100% (exact text)
- Memory: ~10MB for typical document
OCR Processing
- Speed: ~1-3s per page (high quality)
- Accuracy: 85-95% (depends on scan quality)
- Memory: ~50-100MB peak during OCR
Technical Details
PDF Parsing
- Uses native PDF.js library
- Extracts text layer directly (no OCR needed)
- Preserves document structure
- Handles password-protected PDFs
OCR Engine
- Tesseract.js under the hood
- Supports 100+ languages
- Adjustable quality/speed tradeoff
- Confidence scoring for accuracy
Dependencies
- ZERO external dependencies
- Uses Node.js built-in modules only
- PDF.js included in skill
- Tesseract.js bundled
Error Handling
Invalid PDF
- Clear error message
- Suggest fix (check file format)
- Skip to next file in batch
OCR Failure
- Report confidence score
- Suggest rescan at higher quality
- Fallback to basic extraction
Memory Issues
- Stream processing for large files
- Progress reporting
- Graceful degradation
Configuration
Edit config.json:
{
"ocr": {
"enabled": true,
"defaultLanguage": "eng",
"quality": "medium",
"languages": ["eng", "spa", "fra", "deu"]
},
"output": {
"defaultFormat": "text",
"preserveFormatting": true,
"includeMetadata": true
},
"batch": {
"maxConcurrent": 3,
"timeoutSeconds": 30
}
}
Examples
Extract from Invoice
const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."
Extract from Scanned Contract
const contract = await extractText('./scanned-contract.pdf', {
ocr: true,
language: 'eng',
ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."
Batch Process Documents
const docs = await extractBatch([
'./doc1.pdf',
'./doc2.pdf',
'./doc3.pdf',
'./doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
Troubleshooting
OCR Not Working
- Check if PDF is truly scanned (not text-based)
- Try different quality settings (low/medium/high)
- Ensure language matches document
- Check image quality of scan
Extraction Returns Empty
- PDF may be image-only
- OCR failed with low confidence
- Try different language setting
Slow Processing
- Large PDF takes longer
- Reduce quality for speed
- Process in smaller batches
Tips
Best Results
- Use text-based PDFs when possible (faster, 100% accurate)
- High-quality scans for OCR (300 DPI+)
- Clean background before scanning
- Use correct language setting
Performance Optimization
- Batch processing for multiple files
- Disable OCR for text-based PDFs
- Lower OCR quality for speed when acceptable
Roadmap
- [ ] PDF/A support
- [ ] Advanced OCR pre-processing
- [ ] Table extraction from OCR
- [ ] Handwriting OCR
- [ ] PDF form field extraction
- [ ] Batch language detection
- [ ] Confidence scoring visualization
License
MIT
Extract text from PDFs. Fast, accurate, zero dependencies. 🔮
同梱ファイル
※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。
- 📄 SKILL.md (8,499 bytes)
- 📎 README.md (4,281 bytes)