Skip to content

Fix repeated control fields being dropped#79

Merged
dchud merged 4 commits intomainfrom
bd-7dre-repeated-control-fields
Apr 12, 2026
Merged

Fix repeated control fields being dropped#79
dchud merged 4 commits intomainfrom
bd-7dre-repeated-control-fields

Conversation

@dchud
Copy link
Copy Markdown
Owner

@dchud dchud commented Apr 10, 2026

Summary

Fixes #77 — repeated control fields (e.g., multiple 007s) were silently dropped because control_fields used IndexMap<String, String>, causing later values to overwrite earlier ones.

  • Changed storage to IndexMap<String, Vec<String>> across all three record types (Record, AuthorityRecord, HoldingsRecord)
  • Updated all 7 serialization formats (ISO 2709, JSON, MARCJSON, MARCXML, CSV, MODS/Dublin Core, BIBFRAME) to emit all values
  • Added get_control_field_values() (Rust) and control_field_values() (PyO3) for multi-value access
  • Updated Python wrapper get_fields() to return all repeated control fields

The pymarc-compatible API surface (get_fields(), record['007'], add_field(), get_fields() with no args) all correctly handle repeated control fields now. pymarc has no get_control_field method — that's an mrrc-internal API.

22 files changed across Rust core, PyO3 bindings, and Python wrapper.

cc @acdha — this addresses the repeated 007 bug you reported. The test suite includes records with the same cr|nn ||||||aa / fb|a bnnnn pattern from your report, plus additional cases with repeated 006 and triple 007 fields. The to_unicode/permissive flags enhancement is tracked separately in #78.

Test plan

  • 10 new Python tests in test_repeated_control_fields.py covering parsing, round-trip, leader length, and pymarc parity
  • Updated Rust integration test for repeated control fields via builder
  • All 775 Rust tests pass
  • All 596 Python tests pass (including existing pymarc compatibility/compliance suites)
  • Full .cargo/check.sh passes (rustfmt, clippy, ruff, doc tests, security audit)

Bead: bd-7dre

🤖 Generated with Claude Code

Control fields (001-009) were stored in IndexMap<String, String>, causing
repeated tags like 007 to overwrite earlier values. Changed storage to
IndexMap<String, Vec<String>> to match the Vec-based pattern already used
for data fields. Updated all layers: Rust core, serializers (ISO 2709,
JSON, MARCXML, CSV), PyO3 bindings, and Python wrapper.

Added test suite with pymarc parity checks for repeated 006/007 fields
including round-trip serialization verification.

Bead: bd-7dre

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 10, 2026

Merging this PR will degrade performance by 10.46%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 59 untouched benchmarks
⏩ 16 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_pipeline_parallel_2x_10k_threaded 44.8 ms 50 ms -10.46%

Comparing bd-7dre-repeated-control-fields (6f120ea) with main (4de08b6)2

Open in CodSpeed

Footnotes

  1. 16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (6f120ea) during the generation of this report, so 4de08b6 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

CI environments don't have pymarc. Restructured tests so the 6
mrrc-only tests (parsing, round-trip, add_field) always run, while the
4 pymarc baseline/parity tests are skipped with pytest.mark.skipif.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dchud dchud self-assigned this Apr 10, 2026
Redundant with direct access to the public control_fields map.
No callers outside its own test existed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dchud
Copy link
Copy Markdown
Owner Author

dchud commented Apr 12, 2026

Walkthrough: How this PR fixes repeated control fields

Posting this as supplemental context for reviewers — a layer-by-layer explanation of the bug and fix.


The bug

MARC records can have repeated control fields (especially 006, 007). mrrc stored control fields in IndexMap<String, String>, so add_control_field() called insert() which silently overwrote earlier values. A record with two 007s would only retain the last one.

pymarc stores all fields in a flat list, so repetition is preserved naturally. mrrc's get_fields("007") returning 1 field where pymarc returned 2 was a pymarc compatibility break.

Layer 1: Rust core (src/record.rs)

Storage changed from IndexMap<String, String> to IndexMap<String, Vec<String>>. add_control_field() now uses entry().or_default().push() instead of insert(). get_control_field() is kept as a convenience for non-repeatable tags (001, 003, 005, 008), returning the first value. For repeatable tags, callers access the public control_fields map directly. control_fields_iter() flattens across all values. Same changes applied to AuthorityRecord and HoldingsRecord.

Layer 2: Serialization (7 formats)

Every serializer updated from for (tag, value) in &record.control_fields to for (tag, values) ... for value in values. Applied across ISO 2709, JSON, MARCJSON, MARCXML, CSV, Dublin Core/MODS, and BIBFRAME.

Layer 3: PyO3 bindings (src-python/src/wrappers.rs)

New control_field_values(tag) method returns all values for a tag. control_fields() updated to flatten, returning one (tag, value) tuple per value.

Layer 4: Python wrapper (mrrc/__init__.py)

get_fields() now calls control_field_values(tag) (iterating all values) instead of control_field(tag) (single value). Both the no-args and filtered code paths were updated.

Result

After the fix, record.get_fields("007") returns the same count and values as pymarc. The test suite includes direct pymarc parity assertions (skipped in CI where pymarc isn't installed).

Follow-up cleanup (d6c248b)

Removed get_control_field_values() from Record — it was a one-liner wrapping the pub field with no callers outside its own test. Keeps the API surface clean: use get_control_field() for single-value tags, access control_fields directly for repeated tags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dchud dchud merged commit dd4c671 into main Apr 12, 2026
48 checks passed
@dchud dchud deleted the bd-7dre-repeated-control-fields branch April 12, 2026 21:17
dchud added a commit that referenced this pull request Apr 15, 2026
Add MARCReader kwargs (to_unicode, permissive, recovery_mode) and
RecoveryMode documentation to python-api.md reference. Update CHANGELOG
unreleased section with PRs #79, #80, #82 and credit @acdha.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyMARC compatibility

1 participant