Architectural overview for AI assistants working with this codebase.
Maintenance rule:
AGENTS.mdis the canonical AI-agent guidance. When this file changes, updateARCHITECTURE.mdin the same directory if the architecture index or durable engineering guidance needs to reflect the change.CLAUDE.mdis only a compatibility shim that points here.
single_include/csv.hpp is intentionally non-functional and exists only as a compatibility shim.
- Do not compile against
single_include/csv.hpp - For single-header validation, generate
build/.../single_include_generated/csv.hppvia thegenerate_single_headertarget, then compile that generated file - For unamalgamated usage, include headers from
include/
This guard exists to prevent stale-in-repo amalgamated headers and to force use of the canonical generated distribution.
The CSVReader class has two completely different implementations:
// PATH 1: Memory-mapped I/O (MmapParser)
CSVReader reader("filename.csv");
// PATH 2: Stream-based (StreamParser)
std::ifstream infile("filename.csv", std::ios::binary);
CSVReader reader(infile, format);Impact: Bugs can exist in one path but not the other (see issue #281). Any test validating parsing behavior must test BOTH paths using Catch2 SECTION.
- Worker thread reads in 10MB chunks (
CSV_CHUNK_SIZE_DEFAULT) - Communicates via
ThreadSafeDeque<CSVRow> - Exceptions propagate via
std::exception_ptr - Critical: Fields spanning chunk boundaries must not corrupt
Testing requirement: Use ≥500K rows to cross 10MB boundary.
For detailed file mapping, parser data flow, and component relationships, see ARCHITECTURE.md and include/internal/ARCHITECTURE.md.
- Don't assume one code path: Mmap and stream paths are different. Always test both.
- Don't write tiny tests: Need ≥500K rows to cross 10MB chunk boundary.
- Don't use uniform values: Each column needs distinct values to detect corruption.
- Don't ignore async: Worker thread means exceptions must use
exception_ptr. - Don't change one constructor: Likely affects both mmap and stream paths.
CSVReaderis non-copyable and move-enabled. Prefer explicit ownership transfer (std::move) orstd::unique_ptr<CSVReader>when sharing/handing off parser ownership across APIs.- Prefer user-friendly API constraints. Do not narrow template constraints unless required for correctness, safety, or a measured performance win. If an implementation already handles common standard-library containers/ranges correctly, keep those inputs accepted instead of over-constraining APIs for aesthetic purity.
- Opportunistic rewrites/refactors are allowed when they are safe and justified. Keep them separated from build-fix urgency where possible, and avoid bundling unrelated churn with compiler triage unless explicitly requested.
- When proposing changes that affect compile-time behavior, explain the tradeoff clearly. Call out any impact to codegen, performance, portability, and readability before applying the change.
- If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first. Provide a short justification before expanding further.
CSVReader::iteratoris intentionally single-pass. Do not cache allRawCSVDataPtrchunks to make it behave like a forward iterator; that defeats bounded-memory streaming for large CSV files. Algorithms that need multi-pass access should first materialize rows into a container such asstd::vector<CSVRow>.
See tests/AGENTS.md for test strategy, checklist, and conventions.
- Use compatibility macros defined in
common.hppfor cross-compiler or cross-standard concerns. If it doesn't exist, consider creating one. - Compatibility macros defined in
common.hppMUST be referenced only after includingcommon.hppto ensure correctness. - Prefer compile time control flow and assertions where possible. For example, if a branch may be safely written with
if constexpr, then use theIF_CONSTEXPRmacro (fromcommon.hpp) to ensure C++11 compatibility while ensuring optimal control flow for C++17 and later users.- If this causes compiler warnings, always silence the compiler. Do not revert to unnecessary runtime flow.
- Prefer trailing underscore for private members (for example
source_,leftover_). When you touch code with mixed private-member naming styles, normalize the edited region toward trailing underscores instead of introducing more leading-underscore or unsuffixed names. - Apply the 5/2 anti-duplication rule.
- If equivalent behavior exists in 2 or more code paths and each copy is about 5+ meaningful lines, extract a shared helper.
- If duplication is intentionally kept, add a brief comment explaining why (for example performance, API boundary, or template constraints).
- For behavior-sensitive duplicated logic, keep at least one regression test that exercises each path (for example mmap and stream via separate Catch2
SECTIONs).
- If a class has both a
.hppand.cppfile, put methods inside the.cppand prefix the definition withCSV_INLINEto ensure proper single-header compilation (the macro isinlinein the generated single-header and empty otherwise). Exceptions:- Templates must stay in
.hpp— the compiler needs the definition at instantiation time.init_from_streamis the standing example. - Trivial one-liner accessors may be unconditionally
inlinein the header when the call overhead is measurable and the body will never change. - Consolidation: If a
.cppwould be under ~100 lines and the split causes excessive comment duplication between the two files, prefer a single.hppwith definitions markedinline(free functions and methods alike). Do not useCSV_INLINEfor consolidated definitions —CSV_INLINEexpands to empty in multi-header mode, which would produce ODR violations across TUs. Do not consolidate just for brevity — only when duplication is the dominant cost.
- Templates must stay in
- Prefer LF (
\n) line endings for tracked source, test, CMake, and Markdown files. When you touch a file with mixed line endings, normalize the edited file to LF unless there is a file-specific reason not to. Avoid introducing mixed CRLF/LF endings in the same file. - Keep preprocessor directives flush left.
#define,#if,#ifdef,#else, and#endifshould start at column 0. Code inside multi-line macros should be indented exactly as the equivalent non-macro code would be; do not add extra indentation just because it lives inside a macro body. - Keep constructor initializer lists in declaration order. C++ initializes bases and members in declaration order, not initializer-list order. When adding or editing a constructor, order its initializer list to match the class declaration exactly so GCC/Clang
-Wreorderstays clean and readers do not infer a false initialization dependency. - Internal folder namespaces should match folder structure. When adding or moving files under
include/internal/, place their contents in the matching nested namespace when practical. For example,include/internal/speculative/maps tocsv::internals::speculative, andinclude/internal/parser/maps tocsv::internals::parser. Do not churn existing files solely for this rule unless the namespace move is part of an intentional architecture cleanup. - Do not accidentally pass large objects by value. Use
const&for observation,&for mutation, and&&/ by-value-with-an-explicit-std::movefor ownership transfer. If passing a large object by value is intentional, make the consuming semantics obvious at the call site or add a brief comment.
- Always update or remove incorrect comments.
- Don't reference internal functions in public API comments. Public API docs should describe user-visible behavior and contracts; internal helper/function details belong in internal docs.
- Avoid meaningless @param and @return descriptions. Do not add comments that could trivially be inferred by the function's name or other existing comments. When editing a function, remove any @param/@return descriptions that merely restate the function name or signature.
- Don't delete or simplify comments unless allowed by other rules in this section. Comments in this codebase frequently encode concurrency invariants, non-obvious design decisions, and hard-won bug context that cannot be recovered from the code alone.
- Public API docs belong on declarations in
.hppfiles. When a class has both a header and implementation file, put user-facing/Doxygen documentation on the declaration in the header. Keep the.cppfocused on implementation notes, concurrency invariants, performance rationale, and bug-history comments.