CSV (Comma-Separated Values)
A plain-text tabular data format where each line represents a row and fields are separated by commas, serving as the universal interchange format for structured data.
You are a file format specialist with deep expertise in CSV (Comma-Separated Values). You understand RFC 4180 conventions, the numerous real-world variations in delimiters, quoting, encoding, and null representation, and the practical challenges of parsing CSV reliably across different tools and locales. You can advise on CSV parsing pitfalls, encoding detection, delimiter handling, large-file processing, conversion to typed formats, and best practices for producing CSV files that interoperate cleanly across Excel, pandas, databases, and command-line tools. ## Key Points - **File extension:** `.csv` - **MIME type:** `text/csv` (RFC 4180) - **Standard:** RFC 4180 (informal, 2005) defines common conventions but CSV is not rigorously standardized - **Character encoding:** No mandated encoding; UTF-8 is recommended, but many files use platform defaults - **Line endings:** CRLF per RFC 4180; LF common on Unix systems 1. Each record is on a separate line, terminated by CRLF 2. The last record may or may not have a trailing line break 3. An optional header line with field names may be present 4. Fields are separated by commas 5. Fields containing commas, double quotes, or line breaks must be enclosed in double quotes 6. A double quote inside a quoted field is escaped by preceding it with another double quote (`""`) - **Delimiter:** Semicolons (`;`) common in European locales where comma is the decimal separator ## Quick Example ```csv Name,Age,City,Notes "Smith, John",32,New York,"Has a ""nickname""" Jane Doe,28,London, Bob Wilson,45,"San Francisco",No notes ```
skilldb get file-formats-skills/CSV (Comma-Separated Values)Full skill: 163 linesYou are a file format specialist with deep expertise in CSV (Comma-Separated Values). You understand RFC 4180 conventions, the numerous real-world variations in delimiters, quoting, encoding, and null representation, and the practical challenges of parsing CSV reliably across different tools and locales. You can advise on CSV parsing pitfalls, encoding detection, delimiter handling, large-file processing, conversion to typed formats, and best practices for producing CSV files that interoperate cleanly across Excel, pandas, databases, and command-line tools.
CSV — Comma-Separated Values
Overview
CSV is a plain-text format for storing tabular data where each line represents a row and fields within a row are separated by commas. Despite its apparent simplicity, CSV is one of the most widely used data interchange formats in computing. It predates personal computers — comma-delimited data was used on mainframes in the 1960s and 1970s. CSV's strength is its universality: virtually every data tool, programming language, database, and spreadsheet application can read and write CSV files.
Core Philosophy
CSV's philosophy is radical simplicity: data as plain text, one row per line, fields separated by commas. This simplicity makes CSV the universal interchange format — every spreadsheet, database, programming language, and data tool can read and write CSV. No other structured data format achieves this level of ubiquity.
CSV's simplicity is also its weakness. There are no data types (everything is a string), no schema enforcement, no standard encoding (UTF-8 vs. Latin-1 vs. Windows-1252), and even the delimiter itself is not guaranteed to be a comma (semicolons are common in European locales). Working with CSV reliably means treating every assumption as something that must be verified: inspect delimiters, detect encoding, validate data types after parsing.
Use CSV as an interchange format — a way to move tabular data between systems — not as a storage format. For storage and analysis, Parquet offers type safety, compression, and columnar efficiency. For human-editable configuration, YAML or TOML are more expressive. CSV's role is the lowest-common-denominator exchange: when you do not know what tool the recipient uses, CSV will work.
Technical Specifications
- File extension:
.csv - MIME type:
text/csv(RFC 4180) - Standard: RFC 4180 (informal, 2005) defines common conventions but CSV is not rigorously standardized
- Character encoding: No mandated encoding; UTF-8 is recommended, but many files use platform defaults
- Line endings: CRLF per RFC 4180; LF common on Unix systems
RFC 4180 Rules
- Each record is on a separate line, terminated by CRLF
- The last record may or may not have a trailing line break
- An optional header line with field names may be present
- Fields are separated by commas
- Fields containing commas, double quotes, or line breaks must be enclosed in double quotes
- A double quote inside a quoted field is escaped by preceding it with another double quote (
"")
Example
Name,Age,City,Notes
"Smith, John",32,New York,"Has a ""nickname"""
Jane Doe,28,London,
Bob Wilson,45,"San Francisco",No notes
Common Variations
- Delimiter: Semicolons (
;) common in European locales where comma is the decimal separator - Quoting: Some tools always quote all fields; others only quote when necessary
- Encoding: Windows exports often use Windows-1252 or UTF-8 with BOM
- Escaping: Some tools use backslash escaping (
\") instead of doubling ("") - Null values: Represented inconsistently (empty field,
NULL,NA,\N)
How to Work With It
Opening
- Spreadsheets: Excel, LibreOffice Calc, Google Sheets (all auto-detect structure)
- Text editors: Any editor shows raw content
- Command line:
column -t -s, file.csv(formatted display),csvlook(csvkit) - Dedicated viewers: Modern CSV (macOS), Tad, CSVFileView
Creating
- Export from any spreadsheet, database, or data application
- Command line:
echo "a,b,c" > data.csv - SQL exports:
COPY ... TO ... CSV(PostgreSQL),SELECT INTO OUTFILE(MySQL) - Any programming language with standard I/O
Parsing
- Python:
csvmodule (stdlib),pandas.read_csv()(most popular) - JavaScript/Node.js:
papaparse,csv-parse,d3-dsv - Java: OpenCSV, Apache Commons CSV
- Go:
encoding/csv(stdlib) - Rust:
csvcrate - R:
read.csv(),readr::read_csv(),data.table::fread() - Command line:
csvkit(csvgrep, csvsql, csvstat),xsv(Rust, fast),q(SQL on CSV)
import pandas as pd
df = pd.read_csv('data.csv', encoding='utf-8')
# With csv module:
import csv
with open('data.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['Name'])
Converting
- To XLSX: Open in Excel/LibreOffice and save;
pandasread then write - To JSON:
pandasto_json(),csvjson(csvkit),jqwith@csv - To SQL:
csvsql(csvkit), database COPY/LOAD commands - From databases: Standard export in every RDBMS, BI tool, and data platform
- From JSON:
jq -r '.[] | [.field1, .field2] | @csv'
Common Use Cases
- Data exchange between different software systems
- Database imports and exports
- Spreadsheet data sharing (simpler than XLSX)
- Log files and data recording
- Machine learning datasets
- Financial data feeds and reports
- Configuration data and lookup tables
- Government open data publications
- ETL pipeline intermediate format
Pros & Cons
Pros
- Universal support — every data tool reads CSV
- Human-readable plain text
- Extremely simple format — easy to generate and parse
- Small file sizes (no formatting overhead)
- Version-control friendly (text diffs work well)
- No vendor lock-in or licensing
- Streamable — can process line by line without loading entire file
Cons
- No standard data types — everything is a string (no distinction between
42the number and"42"the text) - No schema — column names and types are not enforced
- Encoding ambiguity (UTF-8 vs. Latin-1 vs. Windows-1252)
- Delimiter and quoting variations cause parsing failures
- No support for hierarchical/nested data
- No metadata (column types, units, descriptions)
- Large files are slower than binary formats (Parquet, Arrow)
- No multi-sheet support
Compatibility
| Platform | Support |
|---|---|
| Every platform | Universal — standard in all programming languages |
| Spreadsheets | Excel, LibreOffice, Google Sheets, Numbers |
| Databases | PostgreSQL, MySQL, SQLite, SQL Server — all import/export CSV |
| Data tools | pandas, R, MATLAB, SAS, Stata, SPSS |
| BI tools | Tableau, Power BI, Looker |
Related Formats
- TSV (.tsv): Tab-separated values (avoids comma quoting issues)
- XLSX (.xlsx): Rich spreadsheet format with types and formatting
- JSON (.json): Structured data with types and nesting
- Parquet (.parquet): Columnar binary format (much faster for analytics)
- Arrow (.arrow): In-memory columnar format
- JSON Lines (.jsonl): One JSON object per line (streaming alternative)
- Fixed-width text: Legacy format with positional columns
Practical Usage
- Robust CSV parsing in Python: Use
pandas.read_csv('data.csv', encoding='utf-8', dtype=str)withdtype=strto prevent pandas from silently converting ZIP codes like "01234" to the integer 1234 or interpreting "NA" as NaN. - Handling European CSV files: European locales use semicolons as delimiters and commas as decimal separators. Use
pandas.read_csv('data.csv', sep=';', decimal=',')or specify--delimiter ';'in CLI tools. - Large file processing without memory overflow: Use
pandas.read_csv('huge.csv', chunksize=10000)to process in chunks, or usexsv(Rust) orcsvkitfor command-line filtering and aggregation without loading the entire file. - Excel-safe CSV export: Add a UTF-8 BOM (
\xef\xbb\xbf) at the start of the file when generating CSV intended for Excel users, as Excel often defaults to the system locale encoding without the BOM signal. - Database bulk import: Use PostgreSQL's
COPY table FROM '/path/data.csv' CSV HEADERor MySQL'sLOAD DATA INFILEfor importing large CSV files orders of magnitude faster than row-by-row INSERT statements.
Anti-Patterns
- Assuming CSV files are always comma-delimited — Tab, semicolon, pipe, and even fixed-width separators are all common in files with .csv extensions. Always inspect the first few lines or use an auto-detection library before parsing.
- Not quoting fields that contain commas, newlines, or quotes — Producing CSV without proper quoting creates files that break on import. Always use a proper CSV writer library rather than concatenating strings with commas.
- Using CSV for data with nested or hierarchical structures — CSV is flat tabular data only. Attempting to represent trees, arrays, or nested objects in CSV leads to brittle conventions that different tools interpret differently. Use JSON, Parquet, or a database.
- Storing numeric precision-sensitive data (financial, scientific) as CSV without type specification — Different parsers interpret "1.0" differently (float vs. string vs. integer 1). Include a schema file or data dictionary alongside CSVs that contain typed data.
- Writing CSV with platform-dependent line endings and encoding — Mixing CRLF and LF line endings or using Windows-1252 encoding in files labeled as UTF-8 causes subtle parsing failures. Explicitly set encoding to UTF-8 and line endings to CRLF (per RFC 4180) or LF (for Unix tools).
Install this skill directly: skilldb add file-formats-skills
Related Skills
3MF 3D Manufacturing Format
The 3MF file format — the modern replacement for STL in 3D printing, supporting colors, materials, multi-object assemblies, and precise manufacturing data in a single package.
7-Zip Compressed Archive
The 7z archive format — open-source high-ratio compression using LZMA2, with strong AES-256 encryption, solid archives, and multi-threading support.
AAC (Advanced Audio Coding)
A lossy audio codec standardized as part of MPEG-2 and MPEG-4, designed to supersede MP3 with better quality at equivalent or lower bitrates.
AC3 (Dolby Digital)
Dolby's surround sound audio codec used in cinema, DVD, Blu-ray, and broadcast television for multichannel 5.1 audio delivery.
AI Adobe Illustrator Format
AI is Adobe Illustrator's native vector graphics file format, used for
AIFF (Audio Interchange File Format)
Apple's uncompressed audio format storing raw PCM data, serving as the Mac equivalent of WAV for professional audio production.