Selenium
Browser-based web scraping and automation with Selenium WebDriver across multiple languages
You are an expert in Selenium WebDriver for browser-based web scraping, form automation, and data extraction. ## Key Points - **Always use explicit waits** (`WebDriverWait` + `expected_conditions`) instead of `time.sleep()`. Explicit waits are faster and more reliable. - **Use `webdriver-manager`** to handle driver binary downloads automatically instead of managing chromedriver versions manually. - **Call `driver.quit()`** (not just `driver.close()`) to terminate the browser process. Wrap in try/finally or a context manager. - **Disable images and CSS** via Chrome preferences to speed up scraping when visual rendering is not needed. - **Use `execute_script` for bulk extraction** — it is often faster to run a single JavaScript call that returns all data than to make many `find_element` calls from Python. - **Set page load timeouts** with `driver.set_page_load_timeout(30)` to avoid hanging on slow pages. - **`StaleElementReferenceException`.** Occurs when the DOM changes after you found an element. Re-locate elements after any page navigation or dynamic update. - **`NoSuchElementException` vs timing.** If content loads asynchronously, `find_element` may fail before the element exists. Always wait for elements first. - **Zombie browser processes.** Forgetting `driver.quit()` leaves Chrome processes running. In production, add signal handlers and cleanup logic. - **Headless vs headed differences.** Some sites detect headless mode. Use `--headless=new` (Chrome 112+) which is harder to detect than the old `--headless` flag. - **Session/cookie confusion.** Each WebDriver instance starts a fresh session. To reuse cookies, save them with `driver.get_cookies()` and restore with `driver.add_cookie()`. - **Thread safety.** A single WebDriver instance is not thread-safe. Use separate driver instances per thread or use a process pool. ## Quick Example ```bash pip install selenium webdriver-manager ``` ```bash npm install selenium-webdriver ```
skilldb get web-scraping-skills/SeleniumFull skill: 204 linesSelenium — Web Scraping
You are an expert in Selenium WebDriver for browser-based web scraping, form automation, and data extraction.
Core Philosophy
Overview
Selenium WebDriver is a browser automation tool that controls real browsers (Chrome, Firefox, Edge, Safari) programmatically. It supports multiple languages (Python, Java, JavaScript, C#, Ruby) and is widely used for scraping JavaScript-heavy pages, interacting with forms, and end-to-end testing. While heavier than headless-only tools, Selenium's broad browser support and mature ecosystem make it a reliable choice.
Setup & Configuration
Python setup:
pip install selenium webdriver-manager
Basic Chrome setup with automatic driver management:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument('--headless=new')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--window-size=1280,800')
options.add_argument(
'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options,
)
Node.js setup:
npm install selenium-webdriver
const { Builder, By, until } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
const options = new chrome.Options().addArguments('--headless=new');
const driver = await new Builder()
.forBrowser('chrome')
.setChromeOptions(options)
.build();
Core Patterns
Basic page scraping (Python)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get('https://example.com/products')
# Wait for content to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.product-card')))
cards = driver.find_elements(By.CSS_SELECTOR, '.product-card')
products = []
for card in cards:
products.append({
'name': card.find_element(By.CSS_SELECTOR, '.title').text.strip(),
'price': card.find_element(By.CSS_SELECTOR, '.price').text.strip(),
'link': card.find_element(By.TAG_NAME, 'a').get_attribute('href'),
})
Interacting with forms
from selenium.webdriver.common.keys import Keys
search_box = driver.find_element(By.NAME, 'q')
search_box.clear()
search_box.send_keys('web scraping')
search_box.send_keys(Keys.RETURN)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.results')))
Handling dropdowns and selects
from selenium.webdriver.support.ui import Select
dropdown = Select(driver.find_element(By.ID, 'sort-by'))
dropdown.select_by_visible_text('Price: Low to High')
Scrolling and lazy-loaded content
import time
last_height = driver.execute_script('return document.body.scrollHeight')
while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(2)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
break
last_height = new_height
Working with iframes
# Switch to iframe
iframe = driver.find_element(By.CSS_SELECTOR, 'iframe#content-frame')
driver.switch_to.frame(iframe)
# Extract data from inside the iframe
data = driver.find_element(By.CSS_SELECTOR, '.inner-content').text
# Switch back to main document
driver.switch_to.default_content()
Executing JavaScript
# Scroll element into view
element = driver.find_element(By.CSS_SELECTOR, '.target')
driver.execute_script('arguments[0].scrollIntoView(true);', element)
# Extract data via JS
result = driver.execute_script('''
return Array.from(document.querySelectorAll('.item')).map(el => ({
text: el.textContent.trim(),
href: el.querySelector('a')?.href
}));
''')
Using a context manager for cleanup
from contextlib import contextmanager
@contextmanager
def get_driver():
options = Options()
options.add_argument('--headless=new')
driver = webdriver.Chrome(options=options)
try:
yield driver
finally:
driver.quit()
with get_driver() as driver:
driver.get('https://example.com')
title = driver.title
Best Practices
- Always use explicit waits (
WebDriverWait+expected_conditions) instead oftime.sleep(). Explicit waits are faster and more reliable. - Use
webdriver-managerto handle driver binary downloads automatically instead of managing chromedriver versions manually. - Call
driver.quit()(not justdriver.close()) to terminate the browser process. Wrap in try/finally or a context manager. - Disable images and CSS via Chrome preferences to speed up scraping when visual rendering is not needed.
- Use
execute_scriptfor bulk extraction — it is often faster to run a single JavaScript call that returns all data than to make manyfind_elementcalls from Python. - Set page load timeouts with
driver.set_page_load_timeout(30)to avoid hanging on slow pages.
Common Pitfalls
StaleElementReferenceException. Occurs when the DOM changes after you found an element. Re-locate elements after any page navigation or dynamic update.NoSuchElementExceptionvs timing. If content loads asynchronously,find_elementmay fail before the element exists. Always wait for elements first.- Zombie browser processes. Forgetting
driver.quit()leaves Chrome processes running. In production, add signal handlers and cleanup logic. - Headless vs headed differences. Some sites detect headless mode. Use
--headless=new(Chrome 112+) which is harder to detect than the old--headlessflag. - Session/cookie confusion. Each WebDriver instance starts a fresh session. To reuse cookies, save them with
driver.get_cookies()and restore withdriver.add_cookie(). - Thread safety. A single WebDriver instance is not thread-safe. Use separate driver instances per thread or use a process pool.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages