この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

Cheerio

Overview

Cheerio is a fast, lightweight HTML/XML parser for Node.js that implements a jQuery-like API. Unlike Puppeteer, it does not run a browser — it parses raw HTML strings, making it 100x faster and ideal for scraping server-rendered pages, parsing HTML files, and transforming HTML content. Pair it with fetch or axios for web scraping, or use it standalone for HTML processing.

Instructions

Step 1: Installation

npm install cheerio

Step 2: Parse HTML and Extract Data

// parse_html.js — Load HTML and extract structured data with CSS selectors
import * as cheerio from 'cheerio'

const html = `
<html>
  <body>
    <h1>Products</h1>
    <div class="product" data-id="1">
      <h2>Widget Pro</h2>
      <span class="price">$29.99</span>
      <a href="/products/widget-pro">Details</a>
    </div>
    <div class="product" data-id="2">
      <h2>Gadget Max</h2>
      <span class="price">$49.99</span>
      <a href="/products/gadget-max">Details</a>
    </div>
  </body>
</html>`

const $ = cheerio.load(html)

// Extract all products
const products = []
$('.product').each((i, el) => {
  products.push({
    id: $(el).attr('data-id'),
    title: $(el).find('h2').text().trim(),
    price: $(el).find('.price').text().trim(),
    link: $(el).find('a').attr('href'),
  })
})

console.log(products)
// [{ id: '1', title: 'Widget Pro', price: '$29.99', link: '/products/widget-pro' }, ...]

Step 3: Web Scraping with Fetch

// scrape_site.js — Fetch a page and extract data
import * as cheerio from 'cheerio'

async function scrape(url) {
  const response = await fetch(url)
  const html = await response.text()
  const $ = cheerio.load(html)

  // Extract all links
  const links = []
  $('a[href]').each((i, el) => {
    links.push({
      text: $(el).text().trim(),
      href: $(el).attr('href'),
    })
  })

  // Extract meta tags
  const meta = {
    title: $('title').text(),
    description: $('meta[name="description"]').attr('content'),
    ogImage: $('meta[property="og:image"]').attr('content'),
  }

  return { links, meta }
}

Step 4: Advanced Selectors and Traversal

// selectors.js — Complex CSS selectors and DOM traversal
const $ = cheerio.load(html)

// Attribute selectors
$('a[href^="https"]')           // links starting with https
$('img[src$=".png"]')           // PNG images
$('div[class*="product"]')      // divs with "product" in class

// Traversal
$('.product').first()            // first product
$('.product').last()             // last product
$('.product').eq(2)              // third product (0-indexed)
$('.price').parent()             // parent of each .price element
$('.product').children('h2')     // direct h2 children
$('.product').find('.price')     // descendants matching .price
$('.product').next()             // next sibling
$('.product').prev()             // previous sibling

// Filtering
$('.product').filter((i, el) => {
  const price = parseFloat($(el).find('.price').text().replace('$', ''))
  return price < 50
})

// Text and HTML
$('.product').first().text()     // all text content, flattened
$('.product').first().html()     // inner HTML

Step 5: Table Extraction

// extract_table.js — Parse HTML tables into structured data
function extractTable($, tableSelector) {
  /**
   * Convert an HTML table to an array of objects using headers as keys.
   * Args:
   *   $: Cheerio instance
   *   tableSelector: CSS selector for the table element
   */
  const headers = []
  $(`${tableSelector} thead th`).each((i, el) => {
    headers.push($(el).text().trim())
  })

  const rows = []
  $(`${tableSelector} tbody tr`).each((i, tr) => {
    const row = {}
    $(tr).find('td').each((j, td) => {
      row[headers[j]] = $(td).text().trim()
    })
    rows.push(row)
  })
  return rows
}

// Usage
const tableData = extractTable($, '#pricing-table')
// [{ Plan: 'Free', Price: '$0', Users: '1' }, { Plan: 'Pro', Price: '$29', Users: '10' }]

Step 6: HTML Transformation

// transform.js — Modify HTML content
const $ = cheerio.load(html)

// Add class
$('.product').addClass('featured')

// Remove elements
$('.ad-banner').remove()

// Replace content
$('h1').text('Updated Title')

// Wrap elements
$('.product').wrap('<section class="product-section"></section>')

// Add attributes
$('a').attr('target', '_blank')
$('img').attr('loading', 'lazy')

// Get modified HTML
const modifiedHtml = $.html()

Examples

Example 1: Build a price monitoring scraper

User prompt: "Scrape product prices from 5 competitor websites daily and save to a CSV. The sites are server-rendered (no JavaScript needed)."

The agent will:

Use fetch + cheerio for each site (no browser overhead).
Write site-specific selectors for product name, price, and availability.
Parse prices into numbers, normalize currency.
Append results to a CSV with timestamps.
Set up as a cron job for daily execution.

Example 2: Extract and clean article content from HTML

User prompt: "I have 1,000 saved HTML pages from a blog. Extract just the article title, author, date, and body text from each, ignoring navigation, ads, and footers."