XML
Extensible Markup Language — a verbose but powerful structured data format widely used in enterprise systems, document formats, and legacy APIs.
You are a file format specialist with deep expertise in XML, including well-formedness and validation rules, namespace handling, XPath/XSLT/XQuery processing, schema languages (XSD, RELAX NG, Schematron), and DOM/SAX/StAX parsing across Python, Java, JavaScript, and Go.
## Key Points
- Every document must have exactly one root element.
- Tags are case-sensitive — `<Book>` and `<book>` are different elements.
- Every opening tag must have a matching closing tag (or be self-closing: `<br/>`).
- Attribute values must be quoted (single or double quotes).
- Five predefined entities: `<` `>` `&` `"` `'`.
- `CDATA` sections contain unescaped text: `<![CDATA[raw content]]>`.
- Namespaces prevent element name collisions: `xmlns:prefix="URI"`.
- A well-formed document follows syntax rules; a valid document also conforms to a schema.
- **DTD** (Document Type Definition): Original, limited type system.
- **XML Schema (XSD)**: W3C standard with rich type system, namespaces.
- **RELAX NG**: Simpler alternative to XSD, available in compact syntax.
- **Schematron**: Rule-based validation using XPath assertions.
## Quick Example
```python
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
for book in root.findall("book"):
print(book.find("title").text)
```
```javascript
// Browser
const parser = new DOMParser();
const doc = parser.parseFromString(xmlString, "application/xml");
// Node.js — use fast-xml-parser or xml2js
```skilldb get file-formats-skills/XMLFull skill: 243 linesYou are a file format specialist with deep expertise in XML, including well-formedness and validation rules, namespace handling, XPath/XSLT/XQuery processing, schema languages (XSD, RELAX NG, Schematron), and DOM/SAX/StAX parsing across Python, Java, JavaScript, and Go.
XML — Extensible Markup Language
Overview
XML (Extensible Markup Language) is a W3C standard markup language designed to store and transport structured data in a human-readable and machine-readable format. Released in 1998, XML became the backbone of enterprise data exchange, SOAP web services, document formats (DOCX, SVG, XHTML), and configuration systems. While JSON has replaced XML in many web API contexts, XML remains dominant in enterprise, publishing, and document-oriented domains.
Core Philosophy
XML (Extensible Markup Language) is a meta-language for defining structured document formats. Unlike JSON or CSV, which have fixed structures, XML provides the rules for creating your own markup languages with custom element names, attributes, and validation schemas. XHTML, SVG, SOAP, RSS, Atom, DOCX (Office Open XML), and hundreds of industry-specific formats are all applications of XML.
XML's design philosophy prioritizes explicitness, validation, and extensibility over brevity. Every piece of data has a name (element or attribute), every structure is explicitly opened and closed, namespaces prevent naming collisions, and schemas (XSD, RelaxNG, DTD) enable rigorous structural validation. This verbosity — XML's most criticized characteristic — is the cost of its self-describing, validatable nature.
For new data interchange formats, JSON has largely replaced XML due to its simplicity and smaller payload size. XML remains the right choice when you need namespace support, schema validation, document-centric markup (mixed content with text and elements), XSLT transformations, or when interfacing with systems that require XML (SOAP services, enterprise integrations, regulatory submissions). Do not choose XML for simplicity — it is explicitly not simple — choose it when its power is needed.
Technical Specifications
Syntax and Structure
XML documents consist of a prolog, elements, attributes, and text content:
<?xml version="1.0" encoding="UTF-8"?>
<library xmlns="http://example.com/library">
<!-- A collection of books -->
<book id="1" genre="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
<price currency="USD">12.99</price>
</book>
<book id="2" genre="science">
<title>A Brief History of Time</title>
<author>Stephen Hawking</author>
<year>1988</year>
<chapters>
<![CDATA[Chapter titles may contain <special> characters]]>
</chapters>
</book>
</library>
Key Rules
- Every document must have exactly one root element.
- Tags are case-sensitive —
<Book>and<book>are different elements. - Every opening tag must have a matching closing tag (or be self-closing:
<br/>). - Attribute values must be quoted (single or double quotes).
- Five predefined entities:
<>&"'. CDATAsections contain unescaped text:<![CDATA[raw content]]>.- Namespaces prevent element name collisions:
xmlns:prefix="URI". - A well-formed document follows syntax rules; a valid document also conforms to a schema.
Schema Languages
- DTD (Document Type Definition): Original, limited type system.
- XML Schema (XSD): W3C standard with rich type system, namespaces.
- RELAX NG: Simpler alternative to XSD, available in compact syntax.
- Schematron: Rule-based validation using XPath assertions.
How to Work With It
Parsing
Two primary parsing models:
- DOM: Loads entire document into memory as a tree. Good for small-to-medium documents.
- SAX/StAX: Event-driven or streaming. Good for large documents.
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
for book in root.findall("book"):
print(book.find("title").text)
// Browser
const parser = new DOMParser();
const doc = parser.parseFromString(xmlString, "application/xml");
// Node.js — use fast-xml-parser or xml2js
Creating
root = ET.Element("library")
book = ET.SubElement(root, "book", id="1")
ET.SubElement(book, "title").text = "Example"
ET.indent(root)
ET.ElementTree(root).write("out.xml", encoding="unicode", xml_declaration=True)
Querying
- XPath: Path-based query language —
//book[@genre='fiction']/title - XQuery: Full query language for XML databases.
- XSLT: Transformation language to convert XML to other formats.
- CLI:
xmllint --xpath "//title" data.xmlorxmlstarlet sel -t -v "//title" data.xml
Validating
xmllint --schema schema.xsd data.xml --noout
xmlstarlet val -e -s schema.xsd data.xml
Common Use Cases
- Enterprise integration: SOAP services, EDI, HL7 (healthcare), FpML (finance).
- Document formats: OOXML (DOCX/XLSX), ODF, EPUB, DocBook, DITA.
- Graphics: SVG vector graphics, XAML UI definitions.
- Configuration: Maven
pom.xml, Android layouts, Spring,.csprojfiles. - Data feeds: RSS, Atom, sitemaps.
- Markup: XHTML, MathML, MusicXML.
Pros & Cons
Pros
- Self-describing with rich metadata through attributes and namespaces.
- Mature schema validation (XSD, RELAX NG, Schematron).
- Powerful query/transformation (XPath, XSLT, XQuery).
- Supports mixed content (text interleaved with elements) — ideal for documents.
- Comments are part of the spec.
- Well-suited for complex, hierarchical document structures.
Cons
- Extremely verbose — high tag overhead relative to payload data.
- Parsing is slower and more memory-intensive than JSON.
- Namespace handling is notoriously complex and error-prone.
- Multiple competing schema languages create confusion.
- No native array type — lists require repeated elements or wrapper elements.
- Overkill for simple key-value configuration.
Compatibility
| Language | Built-in | Popular Library |
|---|---|---|
| Python | Yes | lxml, xmltodict |
| JavaScript | Browser DOM only | fast-xml-parser, xml2js |
| Java | Yes | JAXB, Dom4j, Jackson XML |
| C# | Yes | System.Xml.Linq |
| Go | Yes | encoding/xml |
| Rust | No | quick-xml, roxmltree |
| PHP | Yes | SimpleXML, DOMDocument |
MIME type: application/xml or text/xml. File extension: .xml.
Practical Usage
Parse and transform XML with Python lxml and XPath
from lxml import etree
# Parse and query with XPath
tree = etree.parse("catalog.xml")
ns = {"ns": "http://example.com/catalog"}
# Find all products over $50
expensive = tree.xpath("//ns:product[ns:price > 50]", namespaces=ns)
for product in expensive:
name = product.find("ns:name", ns).text
price = product.find("ns:price", ns).text
print(f"{name}: ${price}")
# Transform with XSLT
xslt = etree.parse("transform.xsl")
transform = etree.XSLT(xslt)
result = transform(tree)
print(str(result))
Validate XML against an XSD schema from the command line
# Validate with xmllint
xmllint --schema schema.xsd data.xml --noout
# Output: data.xml validates
# Pretty-print and reformat XML
xmllint --format messy.xml > formatted.xml
# Extract values with xmlstarlet
xmlstarlet sel -t -m "//book" -v "title" -n library.xml
# Edit XML in-place (add an attribute)
xmlstarlet ed -i "//book[1]" -t attr -n "status" -v "available" library.xml
Stream-process large XML files with SAX in Python
import xml.sax
class ProductHandler(xml.sax.ContentHandler):
def __init__(self):
self.current = ""
self.count = 0
def startElement(self, name, attrs):
self.current = name
if name == "product":
self.count += 1
def characters(self, content):
if self.current == "name":
print(f"Product: {content.strip()}")
# Process a multi-GB XML file with constant memory usage
handler = ProductHandler()
xml.sax.parse("huge_catalog.xml", handler)
print(f"Total products: {handler.count}")
Anti-Patterns
Using DOM parsing for very large XML files (hundreds of MB or GB). DOM loads the entire document tree into memory, easily consuming 5-10x the file size in RAM. Use SAX (event-driven) or StAX (pull-based streaming) for large files; switch to DOM only for small documents that need random access.
Disabling external entity resolution without understanding XXE vulnerabilities. XML External Entity (XXE) attacks can read local files, perform SSRF, or cause denial of service. Always disable external entity processing in your parser: in Python use defusedxml, in Java set XMLConstants.FEATURE_SECURE_PROCESSING, in PHP use libxml_disable_entity_loader(true).
Hardcoding namespace prefixes instead of matching by namespace URI. Namespace prefixes are arbitrary and can change between documents. Code that searches for ns1:title will break when the same namespace is declared with a different prefix. Always match elements by their namespace URI, not by prefix.
Using XML for simple key-value configuration when YAML, TOML, or JSON would suffice. XML's verbosity (opening tags, closing tags, attributes) makes simple configuration files 3-5x larger and harder to read than equivalent JSON or TOML. Reserve XML for documents with mixed content, complex schemas, or enterprise integration requirements.
Generating XML by string concatenation instead of using a proper serialization library. Manual string building leads to encoding errors, missing escapes for special characters (<, >, &), and malformed XML. Always use a library like lxml.etree, xml.etree.ElementTree, or equivalent to generate well-formed output.
Related Formats
- JSON: Lighter-weight alternative for data interchange.
- HTML: Non-strict markup language derived from SGML (XML's parent).
- XHTML: HTML reformulated as valid XML.
- SVG: XML-based vector graphics format.
- YAML: Human-friendly alternative for configuration.
- Protocol Buffers: Binary alternative for structured data exchange.
Install this skill directly: skilldb add file-formats-skills
Related Skills
3MF 3D Manufacturing Format
The 3MF file format — the modern replacement for STL in 3D printing, supporting colors, materials, multi-object assemblies, and precise manufacturing data in a single package.
7-Zip Compressed Archive
The 7z archive format — open-source high-ratio compression using LZMA2, with strong AES-256 encryption, solid archives, and multi-threading support.
AAC (Advanced Audio Coding)
A lossy audio codec standardized as part of MPEG-2 and MPEG-4, designed to supersede MP3 with better quality at equivalent or lower bitrates.
AC3 (Dolby Digital)
Dolby's surround sound audio codec used in cinema, DVD, Blu-ray, and broadcast television for multichannel 5.1 audio delivery.
AI Adobe Illustrator Format
AI is Adobe Illustrator's native vector graphics file format, used for
AIFF (Audio Interchange File Format)
Apple's uncompressed audio format storing raw PCM data, serving as the Mac equivalent of WAV for professional audio production.