Skip to main content
Technology & EngineeringWeb Scraping204 lines

Selenium

Browser-based web scraping and automation with Selenium WebDriver across multiple languages

Quick Summary28 lines
You are an expert in Selenium WebDriver for browser-based web scraping, form automation, and data extraction.

## Key Points

- **Always use explicit waits** (`WebDriverWait` + `expected_conditions`) instead of `time.sleep()`. Explicit waits are faster and more reliable.
- **Use `webdriver-manager`** to handle driver binary downloads automatically instead of managing chromedriver versions manually.
- **Call `driver.quit()`** (not just `driver.close()`) to terminate the browser process. Wrap in try/finally or a context manager.
- **Disable images and CSS** via Chrome preferences to speed up scraping when visual rendering is not needed.
- **Use `execute_script` for bulk extraction** — it is often faster to run a single JavaScript call that returns all data than to make many `find_element` calls from Python.
- **Set page load timeouts** with `driver.set_page_load_timeout(30)` to avoid hanging on slow pages.
- **`StaleElementReferenceException`.** Occurs when the DOM changes after you found an element. Re-locate elements after any page navigation or dynamic update.
- **`NoSuchElementException` vs timing.** If content loads asynchronously, `find_element` may fail before the element exists. Always wait for elements first.
- **Zombie browser processes.** Forgetting `driver.quit()` leaves Chrome processes running. In production, add signal handlers and cleanup logic.
- **Headless vs headed differences.** Some sites detect headless mode. Use `--headless=new` (Chrome 112+) which is harder to detect than the old `--headless` flag.
- **Session/cookie confusion.** Each WebDriver instance starts a fresh session. To reuse cookies, save them with `driver.get_cookies()` and restore with `driver.add_cookie()`.
- **Thread safety.** A single WebDriver instance is not thread-safe. Use separate driver instances per thread or use a process pool.

## Quick Example

```bash
pip install selenium webdriver-manager
```

```bash
npm install selenium-webdriver
```
skilldb get web-scraping-skills/SeleniumFull skill: 204 lines
Paste into your CLAUDE.md or agent config

Selenium — Web Scraping

You are an expert in Selenium WebDriver for browser-based web scraping, form automation, and data extraction.

Core Philosophy

Overview

Selenium WebDriver is a browser automation tool that controls real browsers (Chrome, Firefox, Edge, Safari) programmatically. It supports multiple languages (Python, Java, JavaScript, C#, Ruby) and is widely used for scraping JavaScript-heavy pages, interacting with forms, and end-to-end testing. While heavier than headless-only tools, Selenium's broad browser support and mature ecosystem make it a reliable choice.

Setup & Configuration

Python setup:

pip install selenium webdriver-manager

Basic Chrome setup with automatic driver management:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument('--headless=new')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--window-size=1280,800')
options.add_argument(
    'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options,
)

Node.js setup:

npm install selenium-webdriver
const { Builder, By, until } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');

const options = new chrome.Options().addArguments('--headless=new');
const driver = await new Builder()
  .forBrowser('chrome')
  .setChromeOptions(options)
  .build();

Core Patterns

Basic page scraping (Python)

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get('https://example.com/products')

# Wait for content to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.product-card')))

cards = driver.find_elements(By.CSS_SELECTOR, '.product-card')
products = []
for card in cards:
    products.append({
        'name': card.find_element(By.CSS_SELECTOR, '.title').text.strip(),
        'price': card.find_element(By.CSS_SELECTOR, '.price').text.strip(),
        'link': card.find_element(By.TAG_NAME, 'a').get_attribute('href'),
    })

Interacting with forms

from selenium.webdriver.common.keys import Keys

search_box = driver.find_element(By.NAME, 'q')
search_box.clear()
search_box.send_keys('web scraping')
search_box.send_keys(Keys.RETURN)

wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.results')))

Handling dropdowns and selects

from selenium.webdriver.support.ui import Select

dropdown = Select(driver.find_element(By.ID, 'sort-by'))
dropdown.select_by_visible_text('Price: Low to High')

Scrolling and lazy-loaded content

import time

last_height = driver.execute_script('return document.body.scrollHeight')
while True:
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)
    new_height = driver.execute_script('return document.body.scrollHeight')
    if new_height == last_height:
        break
    last_height = new_height

Working with iframes

# Switch to iframe
iframe = driver.find_element(By.CSS_SELECTOR, 'iframe#content-frame')
driver.switch_to.frame(iframe)

# Extract data from inside the iframe
data = driver.find_element(By.CSS_SELECTOR, '.inner-content').text

# Switch back to main document
driver.switch_to.default_content()

Executing JavaScript

# Scroll element into view
element = driver.find_element(By.CSS_SELECTOR, '.target')
driver.execute_script('arguments[0].scrollIntoView(true);', element)

# Extract data via JS
result = driver.execute_script('''
    return Array.from(document.querySelectorAll('.item')).map(el => ({
        text: el.textContent.trim(),
        href: el.querySelector('a')?.href
    }));
''')

Using a context manager for cleanup

from contextlib import contextmanager

@contextmanager
def get_driver():
    options = Options()
    options.add_argument('--headless=new')
    driver = webdriver.Chrome(options=options)
    try:
        yield driver
    finally:
        driver.quit()

with get_driver() as driver:
    driver.get('https://example.com')
    title = driver.title

Best Practices

  • Always use explicit waits (WebDriverWait + expected_conditions) instead of time.sleep(). Explicit waits are faster and more reliable.
  • Use webdriver-manager to handle driver binary downloads automatically instead of managing chromedriver versions manually.
  • Call driver.quit() (not just driver.close()) to terminate the browser process. Wrap in try/finally or a context manager.
  • Disable images and CSS via Chrome preferences to speed up scraping when visual rendering is not needed.
  • Use execute_script for bulk extraction — it is often faster to run a single JavaScript call that returns all data than to make many find_element calls from Python.
  • Set page load timeouts with driver.set_page_load_timeout(30) to avoid hanging on slow pages.

Common Pitfalls

  • StaleElementReferenceException. Occurs when the DOM changes after you found an element. Re-locate elements after any page navigation or dynamic update.
  • NoSuchElementException vs timing. If content loads asynchronously, find_element may fail before the element exists. Always wait for elements first.
  • Zombie browser processes. Forgetting driver.quit() leaves Chrome processes running. In production, add signal handlers and cleanup logic.
  • Headless vs headed differences. Some sites detect headless mode. Use --headless=new (Chrome 112+) which is harder to detect than the old --headless flag.
  • Session/cookie confusion. Each WebDriver instance starts a fresh session. To reuse cookies, save them with driver.get_cookies() and restore with driver.add_cookie().
  • Thread safety. A single WebDriver instance is not thread-safe. Use separate driver instances per thread or use a process pool.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →