Skip to main content
Technology & EngineeringWeb Scraping157 lines

Cheerio

Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax

Quick Summary27 lines
You are an expert in Cheerio for parsing HTML and extracting structured data from web pages on the server side.

## Key Points

- **Use Cheerio for static HTML only.** If the page requires JavaScript execution to render content, use Puppeteer or Playwright instead and feed the rendered HTML into Cheerio for parsing.
- **Remove noise before extraction.** Strip `<script>`, `<style>`, `<nav>`, and ad containers early to simplify selectors and avoid extracting irrelevant text.
- **Use `.get()` after `.map()`.** Cheerio's `.map()` returns a Cheerio object, not a plain array. Call `.get()` to convert it.
- **Prefer specific selectors.** Use classes and data attributes rather than fragile positional selectors like `div > div:nth-child(3)`.
- **Parse numbers explicitly.** Always `parseFloat` or `parseInt` extracted text; do not assume the string is already numeric.
- **Set `decodeEntities: true`** (the default) to handle HTML entities like `&amp;` correctly.
- **Expecting JavaScript rendering.** Cheerio parses raw HTML. If the page loads data via XHR/fetch after initial load, Cheerio will not see that content.
- **Forgetting `.trim()`.** Extracted text often includes leading/trailing whitespace and newlines.
- **Using `.text()` on a collection.** `$('p').text()` concatenates the text of all matched `<p>` elements. Use `.each()` or `.map()` to process them individually.
- **Selector specificity issues.** A selector that works on one page version may break when the site updates its markup. Prefer data attributes and semantic selectors when available.
- **Memory with large documents.** While Cheerio is fast, loading extremely large HTML documents (50MB+) into memory can still cause issues. Stream-parse with `htmlparser2` directly for such cases.

## Quick Example

```bash
npm install cheerio
```

```bash
npm install axios    # or node-fetch, undici, got
```
skilldb get web-scraping-skills/CheerioFull skill: 157 lines
Paste into your CLAUDE.md or agent config

Cheerio — Web Scraping

You are an expert in Cheerio for parsing HTML and extracting structured data from web pages on the server side.

Core Philosophy

Overview

Cheerio is a fast, lightweight library that implements a subset of jQuery for server-side HTML/XML parsing in Node.js. It does not execute JavaScript or render pages — it works directly on raw HTML strings, making it extremely fast and memory-efficient for scraping static content.

Setup & Configuration

npm install cheerio

For fetching HTML, pair Cheerio with an HTTP client:

npm install axios    # or node-fetch, undici, got

Basic usage:

const cheerio = require('cheerio');
const axios = require('axios');

const { data: html } = await axios.get('https://example.com');
const $ = cheerio.load(html);

Core Patterns

Selecting and extracting text

const $ = cheerio.load(html);

// Single element
const title = $('h1').text().trim();

// Attribute extraction
const imageUrl = $('img.hero').attr('src');

// Multiple elements into an array
const links = $('a.nav-link')
  .map((i, el) => ({
    text: $(el).text().trim(),
    href: $(el).attr('href'),
  }))
  .get(); // .get() converts Cheerio object to plain array

Scraping a table

const rows = $('table.data tbody tr')
  .map((i, row) => {
    const cells = $(row).find('td');
    return {
      name: $(cells[0]).text().trim(),
      value: $(cells[1]).text().trim(),
      date: $(cells[2]).text().trim(),
    };
  })
  .get();

Navigating the DOM

// Parent, children, siblings
const parent = $('.target').parent();
const children = $('.container').children('div');
const nextSibling = $('.item').next();

// Find within a subtree
$('.product-card').each((i, card) => {
  const name = $(card).find('.name').text();
  const price = $(card).find('.price').text();
  console.log({ name, price });
});

// Filtering
const activeItems = $('li').filter('.active');

Handling HTML content

// Get inner HTML
const content = $('.article-body').html();

// Get outer HTML
const outer = $.html($('.article-body'));

// Modify the DOM (useful for cleaning before extraction)
$('script, style, nav, footer').remove();
const cleanText = $('body').text().trim();

Fetching and parsing in a pipeline

async function scrapeProducts(url) {
  const { data: html } = await axios.get(url, {
    headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)' },
    timeout: 10000,
  });

  const $ = cheerio.load(html);

  return $('.product')
    .map((i, el) => ({
      name: $(el).find('.product-name').text().trim(),
      price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
      available: $(el).find('.stock').hasClass('in-stock'),
    }))
    .get();
}

Best Practices

  • Use Cheerio for static HTML only. If the page requires JavaScript execution to render content, use Puppeteer or Playwright instead and feed the rendered HTML into Cheerio for parsing.
  • Remove noise before extraction. Strip <script>, <style>, <nav>, and ad containers early to simplify selectors and avoid extracting irrelevant text.
  • Use .get() after .map(). Cheerio's .map() returns a Cheerio object, not a plain array. Call .get() to convert it.
  • Prefer specific selectors. Use classes and data attributes rather than fragile positional selectors like div > div:nth-child(3).
  • Parse numbers explicitly. Always parseFloat or parseInt extracted text; do not assume the string is already numeric.
  • Set decodeEntities: true (the default) to handle HTML entities like &amp; correctly.

Common Pitfalls

  • Expecting JavaScript rendering. Cheerio parses raw HTML. If the page loads data via XHR/fetch after initial load, Cheerio will not see that content.
  • Forgetting .trim(). Extracted text often includes leading/trailing whitespace and newlines.
  • Using .text() on a collection. $('p').text() concatenates the text of all matched <p> elements. Use .each() or .map() to process them individually.
  • Selector specificity issues. A selector that works on one page version may break when the site updates its markup. Prefer data attributes and semantic selectors when available.
  • Character encoding mismatches. If the page uses a non-UTF-8 encoding, pass { decodeEntities: true } and ensure the HTTP response is decoded properly (use responseType: 'arraybuffer' with axios and decode with iconv-lite if needed).
  • Memory with large documents. While Cheerio is fast, loading extremely large HTML documents (50MB+) into memory can still cause issues. Stream-parse with htmlparser2 directly for such cases.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →