Skip to main content
Technology & EngineeringFile Formats243 lines

XML

Extensible Markup Language — a verbose but powerful structured data format widely used in enterprise systems, document formats, and legacy APIs.

Quick Summary35 lines
You are a file format specialist with deep expertise in XML, including well-formedness and validation rules, namespace handling, XPath/XSLT/XQuery processing, schema languages (XSD, RELAX NG, Schematron), and DOM/SAX/StAX parsing across Python, Java, JavaScript, and Go.

## Key Points

- Every document must have exactly one root element.
- Tags are case-sensitive — `<Book>` and `<book>` are different elements.
- Every opening tag must have a matching closing tag (or be self-closing: `<br/>`).
- Attribute values must be quoted (single or double quotes).
- Five predefined entities: `&lt;` `&gt;` `&amp;` `&quot;` `&apos;`.
- `CDATA` sections contain unescaped text: `<![CDATA[raw content]]>`.
- Namespaces prevent element name collisions: `xmlns:prefix="URI"`.
- A well-formed document follows syntax rules; a valid document also conforms to a schema.
- **DTD** (Document Type Definition): Original, limited type system.
- **XML Schema (XSD)**: W3C standard with rich type system, namespaces.
- **RELAX NG**: Simpler alternative to XSD, available in compact syntax.
- **Schematron**: Rule-based validation using XPath assertions.

## Quick Example

```python
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
for book in root.findall("book"):
    print(book.find("title").text)
```

```javascript
// Browser
const parser = new DOMParser();
const doc = parser.parseFromString(xmlString, "application/xml");
// Node.js — use fast-xml-parser or xml2js
```
skilldb get file-formats-skills/XMLFull skill: 243 lines
Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in XML, including well-formedness and validation rules, namespace handling, XPath/XSLT/XQuery processing, schema languages (XSD, RELAX NG, Schematron), and DOM/SAX/StAX parsing across Python, Java, JavaScript, and Go.

XML — Extensible Markup Language

Overview

XML (Extensible Markup Language) is a W3C standard markup language designed to store and transport structured data in a human-readable and machine-readable format. Released in 1998, XML became the backbone of enterprise data exchange, SOAP web services, document formats (DOCX, SVG, XHTML), and configuration systems. While JSON has replaced XML in many web API contexts, XML remains dominant in enterprise, publishing, and document-oriented domains.

Core Philosophy

XML (Extensible Markup Language) is a meta-language for defining structured document formats. Unlike JSON or CSV, which have fixed structures, XML provides the rules for creating your own markup languages with custom element names, attributes, and validation schemas. XHTML, SVG, SOAP, RSS, Atom, DOCX (Office Open XML), and hundreds of industry-specific formats are all applications of XML.

XML's design philosophy prioritizes explicitness, validation, and extensibility over brevity. Every piece of data has a name (element or attribute), every structure is explicitly opened and closed, namespaces prevent naming collisions, and schemas (XSD, RelaxNG, DTD) enable rigorous structural validation. This verbosity — XML's most criticized characteristic — is the cost of its self-describing, validatable nature.

For new data interchange formats, JSON has largely replaced XML due to its simplicity and smaller payload size. XML remains the right choice when you need namespace support, schema validation, document-centric markup (mixed content with text and elements), XSLT transformations, or when interfacing with systems that require XML (SOAP services, enterprise integrations, regulatory submissions). Do not choose XML for simplicity — it is explicitly not simple — choose it when its power is needed.

Technical Specifications

Syntax and Structure

XML documents consist of a prolog, elements, attributes, and text content:

<?xml version="1.0" encoding="UTF-8"?>
<library xmlns="http://example.com/library">
  <!-- A collection of books -->
  <book id="1" genre="fiction">
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <year>1925</year>
    <price currency="USD">12.99</price>
  </book>
  <book id="2" genre="science">
    <title>A Brief History of Time</title>
    <author>Stephen Hawking</author>
    <year>1988</year>
    <chapters>
      <![CDATA[Chapter titles may contain <special> characters]]>
    </chapters>
  </book>
</library>

Key Rules

  • Every document must have exactly one root element.
  • Tags are case-sensitive — <Book> and <book> are different elements.
  • Every opening tag must have a matching closing tag (or be self-closing: <br/>).
  • Attribute values must be quoted (single or double quotes).
  • Five predefined entities: &lt; &gt; &amp; &quot; &apos;.
  • CDATA sections contain unescaped text: <![CDATA[raw content]]>.
  • Namespaces prevent element name collisions: xmlns:prefix="URI".
  • A well-formed document follows syntax rules; a valid document also conforms to a schema.

Schema Languages

  • DTD (Document Type Definition): Original, limited type system.
  • XML Schema (XSD): W3C standard with rich type system, namespaces.
  • RELAX NG: Simpler alternative to XSD, available in compact syntax.
  • Schematron: Rule-based validation using XPath assertions.

How to Work With It

Parsing

Two primary parsing models:

  • DOM: Loads entire document into memory as a tree. Good for small-to-medium documents.
  • SAX/StAX: Event-driven or streaming. Good for large documents.
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
for book in root.findall("book"):
    print(book.find("title").text)
// Browser
const parser = new DOMParser();
const doc = parser.parseFromString(xmlString, "application/xml");
// Node.js — use fast-xml-parser or xml2js

Creating

root = ET.Element("library")
book = ET.SubElement(root, "book", id="1")
ET.SubElement(book, "title").text = "Example"
ET.indent(root)
ET.ElementTree(root).write("out.xml", encoding="unicode", xml_declaration=True)

Querying

  • XPath: Path-based query language — //book[@genre='fiction']/title
  • XQuery: Full query language for XML databases.
  • XSLT: Transformation language to convert XML to other formats.
  • CLI: xmllint --xpath "//title" data.xml or xmlstarlet sel -t -v "//title" data.xml

Validating

xmllint --schema schema.xsd data.xml --noout
xmlstarlet val -e -s schema.xsd data.xml

Common Use Cases

  • Enterprise integration: SOAP services, EDI, HL7 (healthcare), FpML (finance).
  • Document formats: OOXML (DOCX/XLSX), ODF, EPUB, DocBook, DITA.
  • Graphics: SVG vector graphics, XAML UI definitions.
  • Configuration: Maven pom.xml, Android layouts, Spring, .csproj files.
  • Data feeds: RSS, Atom, sitemaps.
  • Markup: XHTML, MathML, MusicXML.

Pros & Cons

Pros

  • Self-describing with rich metadata through attributes and namespaces.
  • Mature schema validation (XSD, RELAX NG, Schematron).
  • Powerful query/transformation (XPath, XSLT, XQuery).
  • Supports mixed content (text interleaved with elements) — ideal for documents.
  • Comments are part of the spec.
  • Well-suited for complex, hierarchical document structures.

Cons

  • Extremely verbose — high tag overhead relative to payload data.
  • Parsing is slower and more memory-intensive than JSON.
  • Namespace handling is notoriously complex and error-prone.
  • Multiple competing schema languages create confusion.
  • No native array type — lists require repeated elements or wrapper elements.
  • Overkill for simple key-value configuration.

Compatibility

LanguageBuilt-inPopular Library
PythonYeslxml, xmltodict
JavaScriptBrowser DOM onlyfast-xml-parser, xml2js
JavaYesJAXB, Dom4j, Jackson XML
C#YesSystem.Xml.Linq
GoYesencoding/xml
RustNoquick-xml, roxmltree
PHPYesSimpleXML, DOMDocument

MIME type: application/xml or text/xml. File extension: .xml.

Practical Usage

Parse and transform XML with Python lxml and XPath

from lxml import etree

# Parse and query with XPath
tree = etree.parse("catalog.xml")
ns = {"ns": "http://example.com/catalog"}

# Find all products over $50
expensive = tree.xpath("//ns:product[ns:price > 50]", namespaces=ns)
for product in expensive:
    name = product.find("ns:name", ns).text
    price = product.find("ns:price", ns).text
    print(f"{name}: ${price}")

# Transform with XSLT
xslt = etree.parse("transform.xsl")
transform = etree.XSLT(xslt)
result = transform(tree)
print(str(result))

Validate XML against an XSD schema from the command line

# Validate with xmllint
xmllint --schema schema.xsd data.xml --noout
# Output: data.xml validates

# Pretty-print and reformat XML
xmllint --format messy.xml > formatted.xml

# Extract values with xmlstarlet
xmlstarlet sel -t -m "//book" -v "title" -n library.xml

# Edit XML in-place (add an attribute)
xmlstarlet ed -i "//book[1]" -t attr -n "status" -v "available" library.xml

Stream-process large XML files with SAX in Python

import xml.sax

class ProductHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current = ""
        self.count = 0

    def startElement(self, name, attrs):
        self.current = name
        if name == "product":
            self.count += 1

    def characters(self, content):
        if self.current == "name":
            print(f"Product: {content.strip()}")

# Process a multi-GB XML file with constant memory usage
handler = ProductHandler()
xml.sax.parse("huge_catalog.xml", handler)
print(f"Total products: {handler.count}")

Anti-Patterns

Using DOM parsing for very large XML files (hundreds of MB or GB). DOM loads the entire document tree into memory, easily consuming 5-10x the file size in RAM. Use SAX (event-driven) or StAX (pull-based streaming) for large files; switch to DOM only for small documents that need random access.

Disabling external entity resolution without understanding XXE vulnerabilities. XML External Entity (XXE) attacks can read local files, perform SSRF, or cause denial of service. Always disable external entity processing in your parser: in Python use defusedxml, in Java set XMLConstants.FEATURE_SECURE_PROCESSING, in PHP use libxml_disable_entity_loader(true).

Hardcoding namespace prefixes instead of matching by namespace URI. Namespace prefixes are arbitrary and can change between documents. Code that searches for ns1:title will break when the same namespace is declared with a different prefix. Always match elements by their namespace URI, not by prefix.

Using XML for simple key-value configuration when YAML, TOML, or JSON would suffice. XML's verbosity (opening tags, closing tags, attributes) makes simple configuration files 3-5x larger and harder to read than equivalent JSON or TOML. Reserve XML for documents with mixed content, complex schemas, or enterprise integration requirements.

Generating XML by string concatenation instead of using a proper serialization library. Manual string building leads to encoding errors, missing escapes for special characters (<, >, &), and malformed XML. Always use a library like lxml.etree, xml.etree.ElementTree, or equivalent to generate well-formed output.

Related Formats

  • JSON: Lighter-weight alternative for data interchange.
  • HTML: Non-strict markup language derived from SGML (XML's parent).
  • XHTML: HTML reformulated as valid XML.
  • SVG: XML-based vector graphics format.
  • YAML: Human-friendly alternative for configuration.
  • Protocol Buffers: Binary alternative for structured data exchange.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →