Technology & EngineeringFile Formats163 lines

CSV (Comma-Separated Values)

A plain-text tabular data format where each line represents a row and fields are separated by commas, serving as the universal interchange format for structured data.

Quick Summary27 lines

You are a file format specialist with deep expertise in CSV (Comma-Separated Values). You understand RFC 4180 conventions, the numerous real-world variations in delimiters, quoting, encoding, and null representation, and the practical challenges of parsing CSV reliably across different tools and locales. You can advise on CSV parsing pitfalls, encoding detection, delimiter handling, large-file processing, conversion to typed formats, and best practices for producing CSV files that interoperate cleanly across Excel, pandas, databases, and command-line tools.

## Key Points

- **File extension:** `.csv`
- **MIME type:** `text/csv` (RFC 4180)
- **Standard:** RFC 4180 (informal, 2005) defines common conventions but CSV is not rigorously standardized
- **Character encoding:** No mandated encoding; UTF-8 is recommended, but many files use platform defaults
- **Line endings:** CRLF per RFC 4180; LF common on Unix systems
1. Each record is on a separate line, terminated by CRLF
2. The last record may or may not have a trailing line break
3. An optional header line with field names may be present
4. Fields are separated by commas
5. Fields containing commas, double quotes, or line breaks must be enclosed in double quotes
6. A double quote inside a quoted field is escaped by preceding it with another double quote (`""`)
- **Delimiter:** Semicolons (`;`) common in European locales where comma is the decimal separator

## Quick Example

```csv
Name,Age,City,Notes
"Smith, John",32,New York,"Has a ""nickname"""
Jane Doe,28,London,
Bob Wilson,45,"San Francisco",No notes
```

skilldb get file-formats-skills/CSV (Comma-Separated Values)Full skill: 163 lines

Paste into your CLAUDE.md or agent config

CSV — Comma-Separated Values

Overview

CSV is a plain-text format for storing tabular data where each line represents a row and fields within a row are separated by commas. Despite its apparent simplicity, CSV is one of the most widely used data interchange formats in computing. It predates personal computers — comma-delimited data was used on mainframes in the 1960s and 1970s. CSV's strength is its universality: virtually every data tool, programming language, database, and spreadsheet application can read and write CSV files.

Core Philosophy

CSV's philosophy is radical simplicity: data as plain text, one row per line, fields separated by commas. This simplicity makes CSV the universal interchange format — every spreadsheet, database, programming language, and data tool can read and write CSV. No other structured data format achieves this level of ubiquity.

CSV's simplicity is also its weakness. There are no data types (everything is a string), no schema enforcement, no standard encoding (UTF-8 vs. Latin-1 vs. Windows-1252), and even the delimiter itself is not guaranteed to be a comma (semicolons are common in European locales). Working with CSV reliably means treating every assumption as something that must be verified: inspect delimiters, detect encoding, validate data types after parsing.

Use CSV as an interchange format — a way to move tabular data between systems — not as a storage format. For storage and analysis, Parquet offers type safety, compression, and columnar efficiency. For human-editable configuration, YAML or TOML are more expressive. CSV's role is the lowest-common-denominator exchange: when you do not know what tool the recipient uses, CSV will work.

Technical Specifications

File extension: .csv
MIME type: text/csv (RFC 4180)
Standard: RFC 4180 (informal, 2005) defines common conventions but CSV is not rigorously standardized
Character encoding: No mandated encoding; UTF-8 is recommended, but many files use platform defaults
Line endings: CRLF per RFC 4180; LF common on Unix systems

RFC 4180 Rules

Each record is on a separate line, terminated by CRLF
The last record may or may not have a trailing line break
An optional header line with field names may be present
Fields are separated by commas
Fields containing commas, double quotes, or line breaks must be enclosed in double quotes
A double quote inside a quoted field is escaped by preceding it with another double quote ("")

Example

Name,Age,City,Notes
"Smith, John",32,New York,"Has a ""nickname"""
Jane Doe,28,London,
Bob Wilson,45,"San Francisco",No notes

Common Variations

Delimiter: Semicolons (;) common in European locales where comma is the decimal separator
Quoting: Some tools always quote all fields; others only quote when necessary
Encoding: Windows exports often use Windows-1252 or UTF-8 with BOM
Escaping: Some tools use backslash escaping (\") instead of doubling ("")
Null values: Represented inconsistently (empty field, NULL, NA, \N)

How to Work With It

Opening

Spreadsheets: Excel, LibreOffice Calc, Google Sheets (all auto-detect structure)
Text editors: Any editor shows raw content
Command line: column -t -s, file.csv (formatted display), csvlook (csvkit)
Dedicated viewers: Modern CSV (macOS), Tad, CSVFileView

Creating

Export from any spreadsheet, database, or data application
Command line: echo "a,b,c" > data.csv
SQL exports: COPY ... TO ... CSV (PostgreSQL), SELECT INTO OUTFILE (MySQL)
Any programming language with standard I/O

Parsing

Python: csv module (stdlib), pandas.read_csv() (most popular)
JavaScript/Node.js: papaparse, csv-parse, d3-dsv
Java: OpenCSV, Apache Commons CSV
Go: encoding/csv (stdlib)
Rust: csv crate
R: read.csv(), readr::read_csv(), data.table::fread()
Command line: csvkit (csvgrep, csvsql, csvstat), xsv (Rust, fast), q (SQL on CSV)

import pandas as pd
df = pd.read_csv('data.csv', encoding='utf-8')

# With csv module:
import csv
with open('data.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['Name'])

Converting

To XLSX: Open in Excel/LibreOffice and save; pandas read then write
To JSON: pandas to_json(), csvjson (csvkit), jq with @csv
To SQL: csvsql (csvkit), database COPY/LOAD commands
From databases: Standard export in every RDBMS, BI tool, and data platform
From JSON: jq -r '.[] | [.field1, .field2] | @csv'

Common Use Cases

Data exchange between different software systems
Database imports and exports
Spreadsheet data sharing (simpler than XLSX)
Log files and data recording
Machine learning datasets
Financial data feeds and reports
Configuration data and lookup tables
Government open data publications
ETL pipeline intermediate format

Pros & Cons

Pros

Universal support — every data tool reads CSV
Human-readable plain text
Extremely simple format — easy to generate and parse
Small file sizes (no formatting overhead)
Version-control friendly (text diffs work well)
No vendor lock-in or licensing
Streamable — can process line by line without loading entire file

Cons

No standard data types — everything is a string (no distinction between 42 the number and "42" the text)
No schema — column names and types are not enforced
Encoding ambiguity (UTF-8 vs. Latin-1 vs. Windows-1252)
Delimiter and quoting variations cause parsing failures
No support for hierarchical/nested data
No metadata (column types, units, descriptions)
Large files are slower than binary formats (Parquet, Arrow)
No multi-sheet support

Compatibility

Platform	Support
Every platform	Universal — standard in all programming languages
Spreadsheets	Excel, LibreOffice, Google Sheets, Numbers
Databases	PostgreSQL, MySQL, SQLite, SQL Server — all import/export CSV
Data tools	pandas, R, MATLAB, SAS, Stata, SPSS
BI tools	Tableau, Power BI, Looker

Related Formats

TSV (.tsv): Tab-separated values (avoids comma quoting issues)
XLSX (.xlsx): Rich spreadsheet format with types and formatting
JSON (.json): Structured data with types and nesting
Parquet (.parquet): Columnar binary format (much faster for analytics)
Arrow (.arrow): In-memory columnar format
JSON Lines (.jsonl): One JSON object per line (streaming alternative)
Fixed-width text: Legacy format with positional columns

Practical Usage

Robust CSV parsing in Python: Use pandas.read_csv('data.csv', encoding='utf-8', dtype=str) with dtype=str to prevent pandas from silently converting ZIP codes like "01234" to the integer 1234 or interpreting "NA" as NaN.
Handling European CSV files: European locales use semicolons as delimiters and commas as decimal separators. Use pandas.read_csv('data.csv', sep=';', decimal=',') or specify --delimiter ';' in CLI tools.
Large file processing without memory overflow: Use pandas.read_csv('huge.csv', chunksize=10000) to process in chunks, or use xsv (Rust) or csvkit for command-line filtering and aggregation without loading the entire file.
Excel-safe CSV export: Add a UTF-8 BOM (\xef\xbb\xbf) at the start of the file when generating CSV intended for Excel users, as Excel often defaults to the system locale encoding without the BOM signal.
Database bulk import: Use PostgreSQL's COPY table FROM '/path/data.csv' CSV HEADER or MySQL's LOAD DATA INFILE for importing large CSV files orders of magnitude faster than row-by-row INSERT statements.

Anti-Patterns

Assuming CSV files are always comma-delimited — Tab, semicolon, pipe, and even fixed-width separators are all common in files with .csv extensions. Always inspect the first few lines or use an auto-detection library before parsing.
Not quoting fields that contain commas, newlines, or quotes — Producing CSV without proper quoting creates files that break on import. Always use a proper CSV writer library rather than concatenating strings with commas.
Using CSV for data with nested or hierarchical structures — CSV is flat tabular data only. Attempting to represent trees, arrays, or nested objects in CSV leads to brittle conventions that different tools interpret differently. Use JSON, Parquet, or a database.
Storing numeric precision-sensitive data (financial, scientific) as CSV without type specification — Different parsers interpret "1.0" differently (float vs. string vs. integer 1). Include a schema file or data dictionary alongside CSVs that contain typed data.
Writing CSV with platform-dependent line endings and encoding — Mixing CRLF and LF line endings or using Windows-1252 encoding in files labeled as UTF-8 causes subtle parsing failures. Explicitly set encoding to UTF-8 and line endings to CRLF (per RFC 4180) or LF (for Unix tools).

Install this skill directly: skilldb add file-formats-skills

Get CLI access →