Skip to main content
Technology & EngineeringWeb Scraping175 lines

Beautifulsoup

HTML and XML parsing with Beautiful Soup in Python for flexible data extraction

Quick Summary25 lines
You are an expert in Beautiful Soup (bs4) for parsing HTML/XML documents and extracting data in Python.

## Key Points

- **Use `lxml` as the parser** for the best balance of speed and leniency. Fall back to `html5lib` only for extremely broken HTML.
- **Prefer `soup.select()` (CSS selectors)** for complex queries — they are often more readable than chained `find` calls.
- **Always use `get_text(strip=True)`** to avoid leading/trailing whitespace in extracted text.
- **Use `requests.Session`** for multi-page scrapes to persist cookies and connection pooling.
- **Check for `None` before accessing attributes.** `soup.find('h2')` returns `None` if not found; accessing `['href']` on `None` raises an error.
- **Respect rate limits.** Add `time.sleep()` between requests when scraping multiple pages.
- **Parser inconsistency.** Different parsers produce different trees from the same HTML. Always specify the parser explicitly (`'lxml'`, not just relying on the default).
- **`class_` vs `class`.** `class` is a Python keyword, so Beautiful Soup uses the `class_` parameter: `soup.find('div', class_='name')`.
- **`.string` vs `.get_text()`.** `.string` returns `None` if a tag has multiple children. Always prefer `.get_text()` for reliable text extraction.
- **Encoding issues.** If the page uses a non-UTF-8 charset, set `response.encoding` before parsing: `response.encoding = response.apparent_encoding`.
- **Loading the entire page into memory.** For very large HTML documents, consider using `lxml.etree.iterparse` for streaming instead of loading everything into Beautiful Soup at once.

## Quick Example

```bash
pip install beautifulsoup4
pip install lxml          # recommended fast parser
pip install requests      # for fetching pages
```
skilldb get web-scraping-skills/BeautifulsoupFull skill: 175 lines
Paste into your CLAUDE.md or agent config

Beautiful Soup — Web Scraping

You are an expert in Beautiful Soup (bs4) for parsing HTML/XML documents and extracting data in Python.

Core Philosophy

Overview

Beautiful Soup is a Python library for pulling data out of HTML and XML documents. It sits on top of a parser (like html.parser, lxml, or html5lib) and provides Pythonic idioms for navigating, searching, and modifying the parse tree. It is forgiving with malformed markup and ideal for quick scraping scripts and data extraction tasks.

Setup & Configuration

pip install beautifulsoup4
pip install lxml          # recommended fast parser
pip install requests      # for fetching pages

Basic usage:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com', headers={
    'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)',
})
soup = BeautifulSoup(response.text, 'lxml')

Parser comparison:

ParserSpeedLenientInstall
html.parserMediumYesBuilt-in
lxmlFastYespip install lxml
html5libSlowMostpip install html5lib

Core Patterns

Finding elements

# By tag
title = soup.find('h1').get_text(strip=True)

# By CSS class
cards = soup.find_all('div', class_='product-card')

# By ID
sidebar = soup.find(id='sidebar')

# By attribute
links = soup.find_all('a', attrs={'data-type': 'external'})

# Using CSS selectors
prices = soup.select('div.product span.price')
first_item = soup.select_one('.listing:first-child')

Extracting data from elements

for card in soup.find_all('div', class_='product'):
    name = card.find('h2').get_text(strip=True)
    price = card.find('span', class_='price').get_text(strip=True)
    link = card.find('a')['href']    # access attribute like a dict
    image = card.find('img').get('src', '')  # .get() for safe access
    print(f'{name}: {price} — {link}')

Navigating the tree

# Parent
parent_div = soup.find('span', class_='price').parent

# Children (direct descendants only)
for child in soup.find('ul').children:
    if child.name == 'li':
        print(child.get_text())

# Siblings
first_item = soup.find('li', class_='first')
next_item = first_item.find_next_sibling('li')

# Descendants (all nested elements)
for tag in soup.find('div', class_='content').descendants:
    if hasattr(tag, 'name') and tag.name == 'a':
        print(tag['href'])

Scraping a table into structured data

def parse_table(soup, table_selector):
    table = soup.select_one(table_selector)
    headers = [th.get_text(strip=True) for th in table.select('thead th')]
    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        rows.append(dict(zip(headers, cells)))
    return rows

data = parse_table(soup, 'table.results')

Cleaning HTML content

# Remove unwanted tags
for tag in soup.find_all(['script', 'style', 'nav', 'footer']):
    tag.decompose()

# Get clean text
clean_text = soup.get_text(separator='\n', strip=True)

Combining with requests.Session for multi-page scraping

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'})

all_items = []
url = 'https://example.com/page/1'

while url:
    resp = session.get(url)
    soup = BeautifulSoup(resp.text, 'lxml')

    items = [tag.get_text(strip=True) for tag in soup.select('.item-title')]
    all_items.extend(items)

    next_link = soup.select_one('a.next')
    url = next_link['href'] if next_link else None

Best Practices

  • Use lxml as the parser for the best balance of speed and leniency. Fall back to html5lib only for extremely broken HTML.
  • Prefer soup.select() (CSS selectors) for complex queries — they are often more readable than chained find calls.
  • Always use get_text(strip=True) to avoid leading/trailing whitespace in extracted text.
  • Use requests.Session for multi-page scrapes to persist cookies and connection pooling.
  • Check for None before accessing attributes. soup.find('h2') returns None if not found; accessing ['href'] on None raises an error.
  • Respect rate limits. Add time.sleep() between requests when scraping multiple pages.

Common Pitfalls

  • AttributeError: 'NoneType' ... — The most common error. Always check that find() returned a result before calling methods on it: tag = soup.find('h2'); text = tag.get_text() if tag else ''.
  • Parser inconsistency. Different parsers produce different trees from the same HTML. Always specify the parser explicitly ('lxml', not just relying on the default).
  • class_ vs class. class is a Python keyword, so Beautiful Soup uses the class_ parameter: soup.find('div', class_='name').
  • .string vs .get_text(). .string returns None if a tag has multiple children. Always prefer .get_text() for reliable text extraction.
  • Encoding issues. If the page uses a non-UTF-8 charset, set response.encoding before parsing: response.encoding = response.apparent_encoding.
  • Loading the entire page into memory. For very large HTML documents, consider using lxml.etree.iterparse for streaming instead of loading everything into Beautiful Soup at once.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →