Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
You are an expert in Beautiful Soup (bs4) for parsing HTML/XML documents and extracting data in Python.
## Key Points
- **Use `lxml` as the parser** for the best balance of speed and leniency. Fall back to `html5lib` only for extremely broken HTML.
- **Prefer `soup.select()` (CSS selectors)** for complex queries — they are often more readable than chained `find` calls.
- **Always use `get_text(strip=True)`** to avoid leading/trailing whitespace in extracted text.
- **Use `requests.Session`** for multi-page scrapes to persist cookies and connection pooling.
- **Check for `None` before accessing attributes.** `soup.find('h2')` returns `None` if not found; accessing `['href']` on `None` raises an error.
- **Respect rate limits.** Add `time.sleep()` between requests when scraping multiple pages.
- **Parser inconsistency.** Different parsers produce different trees from the same HTML. Always specify the parser explicitly (`'lxml'`, not just relying on the default).
- **`class_` vs `class`.** `class` is a Python keyword, so Beautiful Soup uses the `class_` parameter: `soup.find('div', class_='name')`.
- **`.string` vs `.get_text()`.** `.string` returns `None` if a tag has multiple children. Always prefer `.get_text()` for reliable text extraction.
- **Encoding issues.** If the page uses a non-UTF-8 charset, set `response.encoding` before parsing: `response.encoding = response.apparent_encoding`.
- **Loading the entire page into memory.** For very large HTML documents, consider using `lxml.etree.iterparse` for streaming instead of loading everything into Beautiful Soup at once.
## Quick Example
```bash
pip install beautifulsoup4
pip install lxml # recommended fast parser
pip install requests # for fetching pages
```skilldb get web-scraping-skills/BeautifulsoupFull skill: 175 linesBeautiful Soup — Web Scraping
You are an expert in Beautiful Soup (bs4) for parsing HTML/XML documents and extracting data in Python.
Core Philosophy
Overview
Beautiful Soup is a Python library for pulling data out of HTML and XML documents. It sits on top of a parser (like html.parser, lxml, or html5lib) and provides Pythonic idioms for navigating, searching, and modifying the parse tree. It is forgiving with malformed markup and ideal for quick scraping scripts and data extraction tasks.
Setup & Configuration
pip install beautifulsoup4
pip install lxml # recommended fast parser
pip install requests # for fetching pages
Basic usage:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://example.com', headers={
'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)',
})
soup = BeautifulSoup(response.text, 'lxml')
Parser comparison:
| Parser | Speed | Lenient | Install |
|---|---|---|---|
html.parser | Medium | Yes | Built-in |
lxml | Fast | Yes | pip install lxml |
html5lib | Slow | Most | pip install html5lib |
Core Patterns
Finding elements
# By tag
title = soup.find('h1').get_text(strip=True)
# By CSS class
cards = soup.find_all('div', class_='product-card')
# By ID
sidebar = soup.find(id='sidebar')
# By attribute
links = soup.find_all('a', attrs={'data-type': 'external'})
# Using CSS selectors
prices = soup.select('div.product span.price')
first_item = soup.select_one('.listing:first-child')
Extracting data from elements
for card in soup.find_all('div', class_='product'):
name = card.find('h2').get_text(strip=True)
price = card.find('span', class_='price').get_text(strip=True)
link = card.find('a')['href'] # access attribute like a dict
image = card.find('img').get('src', '') # .get() for safe access
print(f'{name}: {price} — {link}')
Navigating the tree
# Parent
parent_div = soup.find('span', class_='price').parent
# Children (direct descendants only)
for child in soup.find('ul').children:
if child.name == 'li':
print(child.get_text())
# Siblings
first_item = soup.find('li', class_='first')
next_item = first_item.find_next_sibling('li')
# Descendants (all nested elements)
for tag in soup.find('div', class_='content').descendants:
if hasattr(tag, 'name') and tag.name == 'a':
print(tag['href'])
Scraping a table into structured data
def parse_table(soup, table_selector):
table = soup.select_one(table_selector)
headers = [th.get_text(strip=True) for th in table.select('thead th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
rows.append(dict(zip(headers, cells)))
return rows
data = parse_table(soup, 'table.results')
Cleaning HTML content
# Remove unwanted tags
for tag in soup.find_all(['script', 'style', 'nav', 'footer']):
tag.decompose()
# Get clean text
clean_text = soup.get_text(separator='\n', strip=True)
Combining with requests.Session for multi-page scraping
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'})
all_items = []
url = 'https://example.com/page/1'
while url:
resp = session.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
items = [tag.get_text(strip=True) for tag in soup.select('.item-title')]
all_items.extend(items)
next_link = soup.select_one('a.next')
url = next_link['href'] if next_link else None
Best Practices
- Use
lxmlas the parser for the best balance of speed and leniency. Fall back tohtml5libonly for extremely broken HTML. - Prefer
soup.select()(CSS selectors) for complex queries — they are often more readable than chainedfindcalls. - Always use
get_text(strip=True)to avoid leading/trailing whitespace in extracted text. - Use
requests.Sessionfor multi-page scrapes to persist cookies and connection pooling. - Check for
Nonebefore accessing attributes.soup.find('h2')returnsNoneif not found; accessing['href']onNoneraises an error. - Respect rate limits. Add
time.sleep()between requests when scraping multiple pages.
Common Pitfalls
AttributeError: 'NoneType' ...— The most common error. Always check thatfind()returned a result before calling methods on it:tag = soup.find('h2'); text = tag.get_text() if tag else ''.- Parser inconsistency. Different parsers produce different trees from the same HTML. Always specify the parser explicitly (
'lxml', not just relying on the default). class_vsclass.classis a Python keyword, so Beautiful Soup uses theclass_parameter:soup.find('div', class_='name')..stringvs.get_text()..stringreturnsNoneif a tag has multiple children. Always prefer.get_text()for reliable text extraction.- Encoding issues. If the page uses a non-UTF-8 charset, set
response.encodingbefore parsing:response.encoding = response.apparent_encoding. - Loading the entire page into memory. For very large HTML documents, consider using
lxml.etree.iterparsefor streaming instead of loading everything into Beautiful Soup at once.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages
Scrapy
Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware