jpskill.com
🛠️ 開発・MCP コミュニティ 🟡 少し慣れが必要 👤 幅広いユーザー

📦 Web Scraper

web-scraper

ウェブサイトから表やリスト、価格などの情報を効率的に抽出して、CSVやJSON形式で出力できる、ビジネスに必要なデータ収集を自動化するSkill。

⏱ 手作業のあれこれ 1日 → 1時間

📺 まず動画で見る(YouTube)

▶ 【Claude Code完全入門】誰でも使える/Skills活用法/経営者こそ使うべき ↗

※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。

📜 元の英語説明(参考)

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

🇯🇵 日本人クリエイター向け解説

一言でいうと

ウェブサイトから表やリスト、価格などの情報を効率的に抽出して、CSVやJSON形式で出力できる、ビジネスに必要なデータ収集を自動化するSkill。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o web-scraper.zip https://jpskill.com/download/3694.zip && unzip -o web-scraper.zip && rm web-scraper.zip
🪟 Windows (PowerShell)
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/3694.zip -OutFile "$d\web-scraper.zip"; Expand-Archive "$d\web-scraper.zip" -DestinationPath $d -Force; ri "$d\web-scraper.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)
  1. 1. 下の青いボタンを押して web-scraper.zip をダウンロード
  2. 2. ZIPファイルをダブルクリックで解凍 → web-scraper フォルダができる
  3. 3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
  4. 4. Claude Code を再起動

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-17
取得日時
2026-05-17
同梱ファイル
4

💬 こう話しかけるだけ — サンプルプロンプト

  • Web Scraper の使い方を教えて
  • Web Scraper で何ができるか具体例で見せて
  • Web Scraper を初めて使う人向けにステップを案内して

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

[Skill 名] web-scraper

Web スクレイパー

概要

インテリジェントなマルチ戦略ウェブスクレイピングツールです。ウェブページから構造化データ(テーブル、リスト、価格など)を抽出します。ページネーション、監視、CSV/JSON エクスポートに対応しています。

このスキルを使用する場面

  • ユーザーが「スクレイパー」または関連するトピックに言及した場合
  • ユーザーが「スクレイピング」または関連するトピックに言及した場合
  • ユーザーが「ウェブデータ抽出」または関連するトピックに言及した場合
  • ユーザーが「ウェブスクレイピング」または関連するトピックに言及した場合
  • ユーザーが「データ収集」または関連するトピックに言及した場合
  • ユーザーが「サイトデータ収集」または関連するトピックに言及した場合

このスキルを使用しない場面

  • タスクがウェブスクレイパーと無関係な場合
  • よりシンプルで具体的なツールでリクエストを処理できる場合
  • ユーザーがドメイン専門知識なしで汎用的な支援を必要としている場合

仕組み

厳密な順序でフェーズを実行します。各フェーズは次のフェーズに情報を供給します。

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

フェーズ 1 またはフェーズ 2 をスキップしないでください。これらは無駄な労力と抽出の失敗を防ぎます。

高速パス: ユーザーが URL と明確なデータターゲットを提供し、リクエストがシンプル(単一ページ、単一データタイプ)な場合、フェーズ 1~3 を単一のアクションに圧縮します。WebFetch 呼び出しでフェッチ、分類、抽出を一度に行います。ただし、検証とフォーマットは引き続き行います。


機能

  • マルチ戦略: WebFetch (静的)、ブラウザ自動化 (JS レンダリング)、Bash/curl (API)、WebSearch (発見)
  • 抽出モード: table、list、article、product、contact、FAQ、pricing、events、jobs、custom
  • 出力形式: Markdown テーブル (デフォルト)、JSON、CSV
  • ページネーション: 自動検出と追跡 (ページ番号、無限スクロール、さらに読み込む)
  • マルチ URL: 複数のソース間で同じ構造を抽出し、比較と差分を表示
  • 検証: すべての抽出に信頼度評価 (HIGH/MEDIUM/LOW) を付与
  • 自動エスカレーション: WebFetch がサイレントに失敗した場合 -> 自動的にブラウザにフォールバック
  • データ変換: クリーニング、正規化、重複排除、エンリッチメント
  • 差分モード: スクレイピング実行間の変更を検出

Web スクレイパー

インテリジェントなアプローチ選択、自動フォールバックエスカレーション、データ変換、構造化出力を備えたマルチ戦略ウェブデータ抽出ツールです。

フェーズ 1: 明確化

URL に触れる前に抽出パラメータを確立します。

必須パラメータ

パラメータ 解決内容 デフォルト
ターゲット URL(s) スクレイピングするページはどれですか? (必須)
データターゲット 抽出する特定のデータは何ですか? (必須)
出力形式 Markdown テーブル、JSON、CSV、またはテキストですか? Markdown テーブル
スコープ 単一ページ、ページネーション、またはマルチ URL ですか? 単一ページ

オプションパラメータ

パラメータ 解決内容 デフォルト
ページネーション ページネーションを追跡しますか?最大ページ数は? いいえ、1 ページ
最大アイテム数 収集するアイテムの最大数は? 無制限
フィルター 除外または含めるデータは? なし
ソート順 結果をどのようにソートしますか? ソース順
保存パス ファイルに保存しますか?どのパスに? 表示のみ
言語 どの言語で応答しますか? ユーザーの言語
差分モード 以前の実行と比較しますか? いいえ

明確化ルール

  • ユーザーが URL と明確なデータターゲットを提供した場合、直接フェーズ 2 に進みます。 不必要な質問はしないでください
  • リクエストが曖昧な場合(例:「このサイトをスクレイピングして」)、のみ次のように尋ねます。 「このページから具体的にどのようなデータを抽出したいですか?」
  • デフォルトは Markdown テーブル出力です。関連する場合にのみ代替案を提示します。
  • あらゆる言語でのリクエストを受け入れます。常にユーザーの言語で応答します。
  • ユーザーが「すべて」または「すべてのデータ」と言った場合、まず偵察を実行し、利用可能なものを提示してユーザーに選択させます。

発見モード

ユーザーがトピックは持っているが特定の URL を持っていない場合:

  1. WebSearch を使用して最も関連性の高いページを見つけます
  2. 上位 3~5 件の URL を説明付きで提示します
  3. ユーザーにスクレイピングするものを選択させるか、すべてをスクレイピングします
  4. 選択された URL でフェーズ 2 に進みます

例:「CRM ツールの価格データを見つけて抽出して」 -> WebSearch("CRM tools pricing comparison 2026") -> 上位結果を提示 -> ユーザーが選択 -> 抽出


フェーズ 2: 偵察

抽出前にターゲットページを分析します。

ステップ 2.1: 初期フェッチ

WebFetch を使用してページ構造を取得し、分析します。

WebFetch(
  url = TARGET_URL,
  prompt = "このページ構造を分析し、以下を報告してください:
    1. ページタイプ: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, またはその他
    2. メインコンテンツ構造: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. 表示されている個別のデータアイテムのおおよその数
    4. JavaScript レンダリングの指標: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. ページネーション: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. データ密度: 構造化された抽出可能なデータがどの程度存在するか
    7. 抽出可能な主要なデータフィールド/列をリストアップ
    8. 埋め込み構造化データ: JSON-LD, microdata, OpenGraph tags
    9. 利用可能なダウンロードリンク: CSV, Excel, PDF, API endpoints"
)

ステップ 2.2: フェッチ品質の評価

シグナル 解釈 アクション
データが明確に表示された豊富なコンテンツ 静的ページ 戦略 A (WebFetch)
空のコンテナ、「読み込み中...」、最小限のテキスト JS レンダリング済み 戦略 B (Browser)
L
📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Web Scraper

Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

When to Use This Skill

  • When the user mentions "scraper" or related topics
  • When the user mentions "scraping" or related topics
  • When the user mentions "extrair dados web" or related topics
  • When the user mentions "web scraping" or related topics
  • When the user mentions "raspar dados" or related topics
  • When the user mentions "coletar dados site" or related topics

Do Not Use This Skill When

  • The task is unrelated to web scraper
  • A simpler, more specific tool can handle the request
  • The user needs general-purpose assistance without domain expertise

How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.


Capabilities

  • Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
  • Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
  • Output formats: Markdown tables (default), JSON, CSV
  • Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
  • Multi-URL: extract same structure across sources with comparison and diff
  • Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
  • Auto-escalation: WebFetch fails silently -> automatic Browser fallback
  • Data transforms: cleaning, normalization, deduplication, enrichment
  • Differential mode: detect changes between scraping runs

Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Phase 1: Clarify

Establish extraction parameters before touching any URL.

Required Parameters

Parameter Resolve Default
Target URL(s) Which page(s) to scrape? (required)
Data Target What specific data to extract? (required)
Output Format Markdown table, JSON, CSV, or text? Markdown table
Scope Single page, paginated, or multi-URL? Single page

Optional Parameters

Parameter Resolve Default
Pagination Follow pagination? Max pages? No, 1 page
Max Items Maximum number of items to collect? Unlimited
Filters Data to exclude or include? None
Sort Order How to sort results? Source order
Save Path Save to file? Which path? Display only
Language Respond in which language? User's lang
Diff Mode Compare with previous run? No

Clarification Rules

  • If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
  • If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
  • Default to Markdown table output. Mention alternatives only if relevant.
  • Accept requests in any language. Always respond in the user's language.
  • If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Discovery Mode

When user has a topic but no specific URL:

  1. Use WebSearch to find the most relevant pages
  2. Present top 3-5 URLs with descriptions
  3. Let user choose which to scrape, or scrape all
  4. Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract


Phase 2: Reconnaissance

Analyze the target page before extraction.

Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Step 2.2: Evaluate Fetch Quality

Signal Interpretation Action
Rich content with data clearly visible Static page Strategy A (WebFetch)
Empty containers, "loading...", minimal text JS-rendered Strategy B (Browser)
Login wall, CAPTCHA, 403/401 response Blocked Report to user
Content present but poorly structured Needs precision Strategy B (Browser)
JSON or XML response body API endpoint Strategy C (Bash/curl)
Download links for CSV/Excel available Direct data file Strategy C (download)

Step 2.3: Content Classification

Classify into an extraction mode:

Mode Indicators Examples
table HTML <table>, grid layout with headers Price comparison, statistics, specs
list Repeated similar elements, card grids Search results, product listings
article Long-form text with headings/paragraphs Blog post, news article, docs
product Product name, price, specs, images, rating E-commerce product page
contact Names, emails, phones, addresses, roles Team page, staff directory
faq Question-answer pairs, accordions FAQ page, help center
pricing Plan names, prices, features, tiers SaaS pricing page
events Dates, locations, titles, descriptions Event listings, conferences
jobs Titles, companies, locations, salaries Job boards, career pages
custom User specified CSS selectors or fields Anything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.


Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

Decision Tree

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

Strategy A: Webfetch With Ai Extraction

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

Strategy B: Browser Automation

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

  1. Get tab context: tabs_context_mcp(createIfEmpty=true) -> get tabId
  2. Navigate to URL: navigate(url=TARGET_URL, tabId=TAB)
  3. Wait for content to load: computer(action="wait", duration=3, tabId=TAB)
  4. Check for cookie/consent banners: find(query="cookie consent or accept button", tabId=TAB)
    • If found, dismiss it (prefer privacy-preserving option)
  5. Read page structure: read_page(tabId=TAB) or get_page_text(tabId=TAB)
  6. Locate target elements: find(query="[DESCRIPTION]", tabId=TAB)
  7. Extract with JavaScript for precise data via javascript_tool
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
  1. For lazy-loaded content, scroll and re-extract: computer(action="scroll", scroll_direction="down", tabId=TAB) then computer(action="wait", duration=2, tabId=TAB)

Strategy C: Bash (Curl + Jq)

Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.


## Json Api

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

## Csv Download

curl -s "CSV_URL" -o /tmp/scraped_data.csv

## Xml Parsing

curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)

## ... Parse And Output Json

"

Strategy D: Hybrid

When a single strategy is insufficient, combine:

  1. WebSearch to discover relevant URLs
  2. WebFetch for initial content assessment
  3. Browser automation for JS-heavy sections
  4. Bash for post-processing (jq, python for data cleaning)

Strategy E: Structured Data Extraction

When JSON-LD, microdata, or OpenGraph is present:

  1. Use Browser javascript_tool to extract structured data:
    const scripts = document.querySelectorAll('script[type="application/ld+json"]');
    const data = Array.from(scripts).map(s => {
    try { return JSON.parse(s.textContent); } catch { return null; }
    }).filter(Boolean);
    JSON.stringify(data);
  2. This often provides cleaner, more reliable data than DOM scraping
  3. Fall back to DOM extraction only for fields not in structured data

Pagination Handling

When pagination is detected and user wants multiple pages:

Page-number pagination (any strategy):

  1. Extract data from current page
  2. Identify URL pattern (e.g. ?page=N, /page/N, &offset=N)
  3. Iterate through pages up to user's max (default: 5 pages)
  4. Show progress: "Extracting page 2/5..."
  5. Concatenate all results, deduplicate if needed

Infinite scroll (Browser only):

  1. Extract currently visible data
  2. Record item count
  3. Scroll down: computer(action="scroll", scroll_direction="down", tabId=TAB)
  4. Wait: computer(action="wait", duration=2, tabId=TAB)
  5. Extract newly loaded data
  6. Compare count - if no new items after 2 scrolls, stop
  7. Repeat until no new content or max iterations (default: 5)

"Load More" button (Browser only):

  1. Extract currently visible data
  2. Find button: find(query="load more button", tabId=TAB)
  3. Click it: computer(action="left_click", ref=REF, tabId=TAB)
  4. Wait and extract new content
  5. Repeat until button disappears or max iterations reached

Phase 4: Extract

Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.

Table Mode

WebFetch prompt:

"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."

List Mode

WebFetch prompt:

"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."

Article Mode

WebFetch prompt:

"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."

Product Mode

WebFetch prompt:

"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."

Also check for JSON-LD Product schema (Strategy E) first.

Contact Mode

WebFetch prompt:

"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."

Faq Mode

WebFetch prompt:

"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."

Pricing Mode

WebFetch prompt:

"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."

Events Mode

WebFetch prompt:

"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."

Jobs Mode

WebFetch prompt:

"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."

Custom Mode

When user provides specific selectors or field descriptions:

  • Use Browser automation with javascript_tool and user's CSS selectors
  • Or use WebFetch with a prompt built from user's field descriptions
  • Always confirm extracted schema with user before proceeding to multi-URL

Multi-Url Extraction

When extracting from multiple URLs:

  1. Extract from the first URL to establish the data schema
  2. Show user the first results and confirm the schema is correct
  3. Extract from remaining URLs using the same schema
  4. Add a source column/field to every record with the origin URL
  5. Combine all results into a single output
  6. Show progress: "Extracting 3/7 URLs..."

Phase 5: Transform

Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.

Automatic Transforms (Always Apply)

Transform Action
Whitespace cleanup Trim, collapse multiple spaces, remove \n in cells
HTML entity decode &amp; -> &, &lt; -> <, &#39; -> '
Unicode normalization NFKC normalization for consistent characters
Empty string to null "" -> null (for JSON), "" -> N/A (for tables)

Conditional Transforms (Apply When Relevant)

Transform When Action
Price normalization Product/pricing modes Extract numeric value + currency symbol
Date normalization Any dates found Normalize to ISO-8601 (YYYY-MM-DD)
URL resolution Relative URLs extracted Convert to absolute URLs
Phone normalization Contact mode Standardize to E.164 format if possible
Deduplication Multi-page or multi-URL Remove exact duplicate rows
Sorting User requested or natural Sort by user-specified field

Data Enrichment (Only When Useful)

Enrichment When Action
Currency conversion User asks for single currency Note original + convert (approximate)
Domain extraction URLs in data Add domain column from full URLs
Word count Article mode Count words in extracted text
Relative dates Dates present Add "X days ago" column if useful

Deduplication Strategy

When combining data from multiple pages or URLs:

  1. Exact match: rows with identical values in all fields -> keep first
  2. Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
  3. Report: "Removed N duplicate rows" in delivery notes

Phase 6: Validate

Verify extraction quality before delivering results.

Validation Checks

Check Action
Item count Compare extracted count to expected count from recon
Empty fields Count N/A or null values per field
Data type consistency Numbers should be numeric, dates parseable
Duplicates Flag exact duplicate rows (post-dedup)
Encoding Check for HTML entities, garbled characters
Completeness All user-requested fields present in output
Truncation Verify data wasn't cut off (check last items)
Outliers Flag values that seem anomalous (e.g. $0.00 price)

Confidence Rating

Assign to every extraction:

Rating Criteria
HIGH All fields populated, count matches expected, no anomalies
MEDIUM Minor gaps (<10% empty fields) or count slightly differs
LOW Significant gaps (>10% empty), structural issues, partial data

Always report confidence with specifics:

Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.

Auto-Recovery (Try Before Reporting Issues)

Issue Auto-Recovery Action
Missing data Re-attempt with Browser if WebFetch was used
Encoding problems Apply HTML entity decode + unicode normalization
Incomplete results Check for pagination or lazy-loading, fetch more
Count mismatch Scroll/paginate to find remaining items
All fields empty Page likely JS-rendered, switch to Browser strategy
Partial fields Try JSON-LD extraction as supplement

Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.


Phase 7: Format And Deliver

Structure results according to user preference. See references/output-templates.md for complete formatting templates.

Delivery Envelope

ALWAYS wrap results with this metadata header:


## Extraction Results

**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV

---

[DATA HERE]

---

**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]

Markdown Table Rules

  • Left-align text columns (:---), right-align numbers (---:)
  • Consistent column widths for readability
  • Include summary row for numeric data when useful (totals, averages)
  • Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
  • Truncate long cell values to 60 chars with ... indicator
  • Use N/A for missing values, never leave cells empty
  • For multi-page results, show combined table (not per-page)

Json Rules

  • Use camelCase for keys (e.g. productName, unitPrice)
  • Wrap in metadata envelope:
    {
      "metadata": {
        "source": "URL",
        "title": "Page Title",
        "extractedAt": "ISO-8601",
        "itemCount": 47,
        "fieldCount": 6,
        "confidence": "HIGH",
        "strategy": "A",
        "transforms": ["deduplication", "priceNormalization"],
        "notes": []
      },
      "data": [ ... ]
    }
  • Pretty-print with 2-space indentation
  • Numbers as numbers (not strings), booleans as booleans
  • null for missing values (not empty strings)

Csv Rules

  • First row is always headers
  • Quote any field containing commas, quotes, or newlines
  • UTF-8 encoding with BOM for Excel compatibility
  • Use , as delimiter (standard)
  • Include metadata as comments: # Source: URL

File Output

When user requests file save:

  • Markdown: .md extension
  • JSON: .json extension
  • CSV: .csv extension
  • Confirm path before writing
  • Report full file path and item count after saving

Multi-Url Comparison Format

When comparing data across multiple sources:

  • Add Source as the first column/field
  • Use short identifiers for sources (domain name or user label)
  • Group by source or interleave based on user preference
  • Highlight differences if user asks for comparison
  • Include summary: "Best price: $X at store-b.com"

Differential Output

When user requests change detection (diff mode):

  • Compare current extraction with previous run
  • Mark new items with [NEW]
  • Mark removed items with [REMOVED]
  • Mark changed values with [WAS: old_value]
  • Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

Rate Limiting

  • Maximum 1 request per 2 seconds for sequential page fetches
  • For multi-URL jobs, process sequentially with pauses
  • If a site returns 429 (Too Many Requests), stop and report to user

Access Respect

  • If a page blocks access (403, CAPTCHA, login wall), report to user
  • Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
  • Do NOT scrape behind authentication unless user explicitly provides access
  • Respect robots.txt directives when known

Copyright

  • Do NOT reproduce large blocks of copyrighted article text
  • For articles: extract factual data, statistics, and structured info; summarize narrative content
  • Always include source attribution (http://example.com) in output

Data Scope

  • Extract ONLY what the user explicitly requested
  • Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
  • Do not store or transmit extracted data beyond what the user sees

Failure Protocol

When extraction fails or is blocked:

  1. Explain the specific reason (JS rendering, bot detection, login, etc.)
  2. Suggest alternatives (different URL, API if available, manual approach)
  3. Never retry aggressively or escalate access attempts

Quick Reference: Mode Cheat Sheet

User Says... Mode Strategy Output Default
"extract the table" table A or B Markdown table
"get all products/prices" product E then A Markdown table
"scrape the listings" list A or B Markdown table
"extract contact info / team page" contact A Markdown table
"get the article data" article A Markdown text
"extract the FAQ" faq A or B JSON
"get pricing plans" pricing A or B Markdown table
"scrape job listings" jobs A or B Markdown table
"get event schedule" events A or B Markdown table
"find and extract [topic]" discovery WebSearch Markdown table
"compare prices across sites" multi-URL A or B Comparison table
"what changed since last time" diff any Diff format

References

Best Practices

  • Provide clear, specific context about your project and requirements
  • Review all suggestions before applying them to production code
  • Combine with other complementary skills for comprehensive analysis

Common Pitfalls

  • Using this skill for tasks outside its domain expertise
  • Applying recommendations without understanding your specific context
  • Not providing enough project context for accurate analysis

Limitations

  • Use this skill only when the task clearly matches the scope described above.
  • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
  • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。