Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
You are an expert in Cheerio for parsing HTML and extracting structured data from web pages on the server side.
## Key Points
- **Use Cheerio for static HTML only.** If the page requires JavaScript execution to render content, use Puppeteer or Playwright instead and feed the rendered HTML into Cheerio for parsing.
- **Remove noise before extraction.** Strip `<script>`, `<style>`, `<nav>`, and ad containers early to simplify selectors and avoid extracting irrelevant text.
- **Use `.get()` after `.map()`.** Cheerio's `.map()` returns a Cheerio object, not a plain array. Call `.get()` to convert it.
- **Prefer specific selectors.** Use classes and data attributes rather than fragile positional selectors like `div > div:nth-child(3)`.
- **Parse numbers explicitly.** Always `parseFloat` or `parseInt` extracted text; do not assume the string is already numeric.
- **Set `decodeEntities: true`** (the default) to handle HTML entities like `&` correctly.
- **Expecting JavaScript rendering.** Cheerio parses raw HTML. If the page loads data via XHR/fetch after initial load, Cheerio will not see that content.
- **Forgetting `.trim()`.** Extracted text often includes leading/trailing whitespace and newlines.
- **Using `.text()` on a collection.** `$('p').text()` concatenates the text of all matched `<p>` elements. Use `.each()` or `.map()` to process them individually.
- **Selector specificity issues.** A selector that works on one page version may break when the site updates its markup. Prefer data attributes and semantic selectors when available.
- **Memory with large documents.** While Cheerio is fast, loading extremely large HTML documents (50MB+) into memory can still cause issues. Stream-parse with `htmlparser2` directly for such cases.
## Quick Example
```bash
npm install cheerio
```
```bash
npm install axios # or node-fetch, undici, got
```skilldb get web-scraping-skills/CheerioFull skill: 157 linesCheerio — Web Scraping
You are an expert in Cheerio for parsing HTML and extracting structured data from web pages on the server side.
Core Philosophy
Overview
Cheerio is a fast, lightweight library that implements a subset of jQuery for server-side HTML/XML parsing in Node.js. It does not execute JavaScript or render pages — it works directly on raw HTML strings, making it extremely fast and memory-efficient for scraping static content.
Setup & Configuration
npm install cheerio
For fetching HTML, pair Cheerio with an HTTP client:
npm install axios # or node-fetch, undici, got
Basic usage:
const cheerio = require('cheerio');
const axios = require('axios');
const { data: html } = await axios.get('https://example.com');
const $ = cheerio.load(html);
Core Patterns
Selecting and extracting text
const $ = cheerio.load(html);
// Single element
const title = $('h1').text().trim();
// Attribute extraction
const imageUrl = $('img.hero').attr('src');
// Multiple elements into an array
const links = $('a.nav-link')
.map((i, el) => ({
text: $(el).text().trim(),
href: $(el).attr('href'),
}))
.get(); // .get() converts Cheerio object to plain array
Scraping a table
const rows = $('table.data tbody tr')
.map((i, row) => {
const cells = $(row).find('td');
return {
name: $(cells[0]).text().trim(),
value: $(cells[1]).text().trim(),
date: $(cells[2]).text().trim(),
};
})
.get();
Navigating the DOM
// Parent, children, siblings
const parent = $('.target').parent();
const children = $('.container').children('div');
const nextSibling = $('.item').next();
// Find within a subtree
$('.product-card').each((i, card) => {
const name = $(card).find('.name').text();
const price = $(card).find('.price').text();
console.log({ name, price });
});
// Filtering
const activeItems = $('li').filter('.active');
Handling HTML content
// Get inner HTML
const content = $('.article-body').html();
// Get outer HTML
const outer = $.html($('.article-body'));
// Modify the DOM (useful for cleaning before extraction)
$('script, style, nav, footer').remove();
const cleanText = $('body').text().trim();
Fetching and parsing in a pipeline
async function scrapeProducts(url) {
const { data: html } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)' },
timeout: 10000,
});
const $ = cheerio.load(html);
return $('.product')
.map((i, el) => ({
name: $(el).find('.product-name').text().trim(),
price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
available: $(el).find('.stock').hasClass('in-stock'),
}))
.get();
}
Best Practices
- Use Cheerio for static HTML only. If the page requires JavaScript execution to render content, use Puppeteer or Playwright instead and feed the rendered HTML into Cheerio for parsing.
- Remove noise before extraction. Strip
<script>,<style>,<nav>, and ad containers early to simplify selectors and avoid extracting irrelevant text. - Use
.get()after.map(). Cheerio's.map()returns a Cheerio object, not a plain array. Call.get()to convert it. - Prefer specific selectors. Use classes and data attributes rather than fragile positional selectors like
div > div:nth-child(3). - Parse numbers explicitly. Always
parseFloatorparseIntextracted text; do not assume the string is already numeric. - Set
decodeEntities: true(the default) to handle HTML entities like&correctly.
Common Pitfalls
- Expecting JavaScript rendering. Cheerio parses raw HTML. If the page loads data via XHR/fetch after initial load, Cheerio will not see that content.
- Forgetting
.trim(). Extracted text often includes leading/trailing whitespace and newlines. - Using
.text()on a collection.$('p').text()concatenates the text of all matched<p>elements. Use.each()or.map()to process them individually. - Selector specificity issues. A selector that works on one page version may break when the site updates its markup. Prefer data attributes and semantic selectors when available.
- Character encoding mismatches. If the page uses a non-UTF-8 encoding, pass
{ decodeEntities: true }and ensure the HTTP response is decoded properly (useresponseType: 'arraybuffer'with axios and decode withiconv-liteif needed). - Memory with large documents. While Cheerio is fast, loading extremely large HTML documents (50MB+) into memory can still cause issues. Stream-parse with
htmlparser2directly for such cases.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages
Scrapy
Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware