Everything you need to know about CSV
CSV (Comma-Separated Values, .csv) is the simplest serious data interchange format - rows of fields separated by commas, defined informally by RFC 4180 in 2005. Despite its apparent simplicity, CSV is genuinely tricky to get right because there's no single canonical specification, and edge cases (quotes, commas in values, line breaks in cells) trip up most parsers.
How it works under the hood
- Field separator varies. Comma is the namesake, but TSV uses tab, and locales using comma decimals (Europe) often use semicolon.
- Quoting rules. RFC 4180 says wrap fields containing commas, quotes, or newlines in double quotes; escape internal quotes by doubling them. Many tools don't follow this strictly.
- No native data types. Everything is a string. The receiver decides what's a number, what's a date, what's a boolean. This causes endless type-inference bugs.
- No nesting, no schema. CSV is flat tabular data. For trees or schema enforcement, use JSON or Parquet.
Where you'll actually use it
- Database imports/exports
- Spreadsheet data exchange between Excel and other tools
- Bulk record updates (CRM imports, mailing lists)
- Simple log files for ad-hoc analysis
How it compares to alternatives
CSV vs TSV: TSV uses tabs - safer because tabs rarely appear in data. CSV vs JSON: JSON has types and nesting; CSV is flat strings. CSV vs Parquet: Parquet is columnar binary - 10-100x faster for analytics on large datasets.
Things that will trip you up
- Excel's auto-conversion eats CSV data - leading zeros, dates, scientific notation in long numbers
- Embedded newlines inside quoted fields confuse simple line-based parsers - always use a real CSV library
- BOM at file start (`UTF-8 with BOM`) breaks naive parsers - handle or strip explicitly