diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index ffaa93b..7326f14 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -1,4 +1,5 @@ {"id":"bd-14jh","title":"Pymarc API compatibility: comprehensive gap analysis and implementation plan","status":"closed","priority":2,"issue_type":"feature","created_at":"2026-03-22T22:42:19.365237Z","created_by":"dchud","updated_at":"2026-03-23T00:03:45.948896Z","closed_at":"2026-03-23T00:03:45.948510Z","close_reason":"Replaced by bd-NEW — found JSON schema mismatch and additional gaps during verification","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":12,"issue_id":"bd-14jh","author":"dchud","text":"Full pymarc API surface comparison (pymarc installed via uv, verified against live introspection) identified 24 gaps organized by implementation impact. Replaces bd-qjjs. Elevated to GitHub issue #65: https://github.com/dchud/mrrc/issues/65\n\n---\n\n## Tier 1: Structural / semantic divergences\n\nThese affect the fundamental object model and break the most pymarc-compatible code.\n\n### A. Separate ControlField class\n\npymarc has no ControlField — both control and data fields are `pymarc.Field`, distinguished by `is_control_field()`. mrrc exposes a separate `ControlField` class in the Python wrapper. This breaks `isinstance(f, Field)` checks and means `get_fields()` returns mixed types.\n\n**Fix:** Unify into a single Field class in the wrapper, with `is_control_field()` dispatching behavior.\n\n### B. ControlField uses .value attribute; pymarc uses .data\n\npymarc control fields store content in `.data`. mrrc's ControlField uses `.value`. Also creates a naming collision with the `Field.value()` method (Gap G).\n\n**Fix:** Rename to `.data` when unifying Field (Gap A). Provides both pymarc compat and clears the way for `value()`.\n\n### C. Record convenience accessors are methods; pymarc uses @property\n\nThese 17 pymarc properties are all methods in mrrc: `title`, `author`, `isbn`, `issn`, `issn_title`, `issnl`, `subjects`, `notes`, `location`, `series`, `sudoc`, `publisher`, `pubyear`, `physicaldescription`, `uniformtitle`, `addedentries`.\n\n**Fix:** Convert to `@property` in `mrrc/__init__.py`. Rust `_inner` methods stay as-is. Breaking change for `record.title()` callers — consider deprecation shim.\n\n### D. Record.__getitem__ returns None for missing tags; pymarc raises KeyError\n\npymarc `record['999']` raises `KeyError`. mrrc returns `None`. Silently changes control flow for code using try/except KeyError.\n\n**Fix:** Raise `KeyError` for missing tags. `Record.get(tag, default)` already exists for the None-returning pattern.\n\n### E. Field.__str__ returns useless default repr\n\npymarc `str(field)` returns MARC display format: `=245 10$aThe Great Gatsby$cF. Scott Fitzgerald`. mrrc returns `` because the Python wrapper has no `__str__` and `__getattr__` doesn't intercept dunder methods. The Rust `_inner.__str__` returns `Field(245)` which is also not pymarc-compatible.\n\n**Fix:** Add `__str__` and `__repr__` to the wrapper Field class matching pymarc's format.\n\n### F. Record.as_marc() / as_marc21() missing\n\nNo way to serialize a Record to ISO 2709 bytes. MARCWriter writes to files but there's no Record-level method returning `bytes`. Critical for round-trip workflows.\n\n**Fix:** Implement in Rust (pure serialization, no GIL needed), expose via PyO3, add wrapper method.\n\n---\n\n## Tier 2: Missing methods and signature mismatches\n\n### G. Field.value() missing\n\npymarc `field.value()` returns space-joined subfield values (or `.data` for control fields). mrrc has no equivalent. Depends on Gap B being resolved first (`.value` attribute → `.data`).\n\n**Fix:** Add `def value(self) -> str` to the unified Field class.\n\n### H. Field.format_field() missing\n\npymarc `field.format_field()` returns human-readable text without indicators or subfield codes. Used internally by Record property helpers like `title`.\n\n**Fix:** Add `def format_field(self) -> str` to Field.\n\n### I. Record.as_json(**kwargs) missing\n\npymarc `record.as_json(**kwargs)` wraps `json.dumps(record.as_dict(), **kwargs)`. mrrc only has `to_json()` with no kwargs. Keep `to_json()` in our public API.\n\n**Fix:** Add `as_json(**kwargs)` built on `as_dict()`.\n\n### J. Record.as_dict() missing\n\npymarc `record.as_dict()` returns a plain dict. `as_json()` is built on it.\n\n**Fix:** Implement `as_dict()` in the wrapper, then build `as_json()` on top.\n\n### K. parse_xml_to_array() missing\n\npymarc top-level function accepting file paths, file handles, and StringIO. mrrc has `xml_to_records()` (string-only input).\n\n**Fix:** Add `parse_xml_to_array()` to `mrrc/__init__.py`. GIL policy: file paths and string content → entirely Rust, zero GIL. Python file objects → GIL only for `.read()`, then Rust parsing. Document preferred path-based usage with explicit examples and comments explaining why, consistent with MARCReader docs.\n\n### L. add_field(*fields) signature mismatch\n\npymarc `add_field(self, *fields)` accepts multiple fields. mrrc accepts exactly one.\n\n**Fix:** Change signature to `*fields` with a loop.\n\n### M. remove_field(*fields) + remove_fields(*tags)\n\npymarc `remove_field` accepts multiple fields. mrrc accepts one (Field or str). pymarc also has `remove_fields(*tags)` for bulk removal by tag — mrrc has no equivalent.\n\n**Fix:** Change `remove_field` to accept `*fields`. Add `remove_fields(*tags)`.\n\n### N. add_ordered_field() / add_grouped_field() missing\n\npymarc has both for inserting fields in tag-sorted or tag-grouped positions.\n\n**Fix:** Implement in wrapper. `add_ordered_field` inserts maintaining tag sort order. `add_grouped_field` inserts after the last field with the same tag.\n\n### O. Field.add_subfield() missing pos parameter\n\npymarc: `add_subfield(self, code, value, pos=None)`. mrrc only supports append.\n\n**Fix:** Add optional `pos` parameter. May need Rust-side support for positional insert.\n\n### P. Field.as_marc() / as_marc21() missing\n\npymarc can serialize individual fields to ISO 2709 bytes.\n\n**Fix:** Implement in Rust, expose via PyO3. No GIL needed.\n\n---\n\n## Tier 3: Naming and return type mismatches\n\n### Q. Property name mismatches\n\n- pymarc `physicaldescription` → mrrc `physical_description` (extra underscore)\n- pymarc `uniformtitle` → mrrc `uniform_title` (extra underscore)\n- pymarc `addedentries` → mrrc: missing entirely (returns 700/710/711/730 fields)\n\n**Fix:** Add pymarc-compatible names as properties (Gap C). Keep mrrc names as aliases if desired.\n\n### R. pubyear return type\n\npymarc returns `Optional[str]` (e.g. `\"2020\"`). mrrc returns `Optional[int]` (`2020`). Confirmed via live test.\n\n**Fix:** Return `str` from the property wrapper to match pymarc.\n\n---\n\n## Tier 4: Lower priority\n\nThese are real gaps but unlikely to block most pymarc-compatible code.\n\n**S.** `decode_marc()` — semi-internal, populates Record from raw bytes. Uncommon in user code.\n**T.** Reader/writer variants — JSONReader, JSONWriter, MARCMakerReader, XMLWriter, TextWriter.\n**U.** Exception hierarchy — pymarc has ~15 specific exception classes. mrrc exposes generic errors.\n**V.** MARC8 character encoding support.\n**W.** Convenience functions — `map_records()`, `map_xml()`, `parse_json_to_array()`, `record_to_xml_node()`.\n**X.** Constants — `LEADER_LEN`, `DIRECTORY_ENTRY_LEN`, `END_OF_FIELD`, `END_OF_RECORD`, `SUBFIELD_INDICATOR`, etc.\n\n---\n\n## GIL policy\n\nAvoid holding the GIL whenever possible.\n\n- **as_json(), as_marc(), as_marc21(), as_dict():** In-memory, single record — no GIL concern.\n- **parse_xml_to_array():** File paths and string content → entirely Rust, zero GIL. Python file objects → GIL only for `.read()`, minimize window.\n- **Field.value(), format_field(), as_marc():** Pure computation, no GIL concern.\n\n## Documentation\n\nFor parse_xml_to_array() and any other new functions accepting flexible input types, document the preferred GIL-free approach with explicit examples and comments, consistent with existing MARCReader docs:\n\n```python\n# Preferred: file path — read + parse entirely in Rust, zero GIL\nrecords = mrrc.parse_xml_to_array(\"collection.xml\")\n\n# Also good: string content — parse entirely in Rust, zero GIL\nrecords = mrrc.parse_xml_to_array(xml_string)\n\n# Works but slower: Python file object requires GIL for .read()\nwith open(\"collection.xml\") as f:\n records = mrrc.parse_xml_to_array(f)\n```","created_at":"2026-03-22T22:43:05Z"},{"id":13,"issue_id":"bd-14jh","author":"dchud","text":"Correction: Gap V (MARC8 encoding) removed — mrrc has had MARC8 support for a long time. pymarc exports specific convenience names (MARC8ToUnicode, marc8_to_unicode, map_marc8_field, map_marc8_record) that mrrc doesn't re-export at the top level, but the underlying functionality is there. Not a real gap.","created_at":"2026-03-22T23:59:47Z"},{"id":14,"issue_id":"bd-14jh","author":"dchud","text":"Correction to the correction: Gap V IS a real gap. mrrc has MARC8 support in the Rust core but exports NOTHING MARC8-related at the Python level. pymarc exports 4 things:\n\n- MARC8ToUnicode class — stateful converter (G0/G1 charset tracking)\n- marc8_to_unicode(marc8, hide_utf8_warnings=False) → str — standalone conversion function\n- map_marc8_field(field) → Field — convert a single field's MARC8 data to Unicode\n- map_marc8_record(record) → Record — convert all fields in a record\n\nFix: Expose mrrc's Rust MARC8 functionality via PyO3 bindings, then add pymarc-compatible wrapper names in mrrc/__init__.py. The Rust implementation likely handles the actual conversion already; the gap is purely in the Python API surface.","created_at":"2026-03-23T00:01:23Z"}]} +{"id":"bd-1dr9","title":"Audit codebase for redundant or overly complex code","status":"in_progress","priority":2,"issue_type":"task","created_at":"2026-04-12T21:48:36.603192Z","created_by":"dchud","updated_at":"2026-04-12T22:03:49.539981Z","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":20,"issue_id":"bd-1dr9","author":"dchud","text":"Step through Rust core (src/) and Python layers (src-python/, mrrc/) for: redundant code paths, overly complex abstractions, dead code, duplicated logic across layers, anything that can be simplified. Do not sacrifice performance or pymarc API compatibility. Goal: easier for new devs and agents to understand. Prior example: get_control_field_values() removed in PR #79 as redundant accessor. --json","created_at":"2026-04-12T21:49:00Z"},{"id":21,"issue_id":"bd-1dr9","author":"dchud","text":"Elevated to GitHub issue #81: https://github.com/dchud/mrrc/issues/81 --json","created_at":"2026-04-12T22:03:49Z"}]} {"id":"bd-1uo8","title":"Add Field.is_control_field() for pymarc compatibility","description":"pymarc's Field.is_control_field() returns True if the field is a control field (tag 001-009). In mrrc, control fields and data fields are separate types (ControlField vs Field), so there's no is_control_field() method on either.\n\nFor pymarc compatibility, add is_control_field() to both classes in mrrc/__init__.py:\n- ControlField.is_control_field() -> returns True\n- Field.is_control_field() -> returns False\n\nThis mirrors pymarc's unified Field type where is_control_field() checks the tag. Since mrrc already separates the types, the return values are static per class.\n\npymarc also has Field.is_subject_field() (checks tag == '6xx') — consider adding that too for completeness, though it's less commonly used.\n\nPure Python wrapper additions — no Rust changes needed. Update _mrrc.pyi stubs accordingly.","status":"closed","priority":3,"issue_type":"feature","created_at":"2026-03-04T21:16:27.816511Z","created_by":"dchud","updated_at":"2026-03-04T21:53:01.092023Z","closed_at":"2026-03-04T21:53:01.092013Z","close_reason":"Fixed in PR #60, merged","source_repo":".","compaction_level":0,"original_size":0} {"id":"bd-264q","title":"Add __str__ and __repr__ to Python Record wrapper","description":"The Python Record wrapper class in mrrc/__init__.py does not define __str__ or __repr__. Python's data model skips __getattr__ for dunder methods, so str(rec) returns the default '' instead of delegating to the Rust PyRecord's __str__ (which returns 'Record(type=a)'). Need to add explicit __str__ and __repr__ methods that delegate to self._inner. Reported in https://github.com/dchud/mrrc/issues/57 comment.","status":"closed","priority":2,"issue_type":"bug","created_at":"2026-03-04T21:06:29.006105Z","created_by":"dchud","updated_at":"2026-03-04T21:53:01.013120Z","closed_at":"2026-03-04T21:53:01.013107Z","close_reason":"Fixed in PR #60, merged","source_repo":".","compaction_level":0,"original_size":0} {"id":"bd-3kee","title":"Add Record.get(tag) alias for pymarc compatibility","description":"pymarc's Record.get(tag, default=None) returns the first field matching a tag, or a default value. mrrc has the equivalent as Record.get_field(tag) but lacks the .get() alias.\n\nAdd a Record.get() method to the Python wrapper in mrrc/__init__.py that delegates to get_field() with an optional default parameter:\n\n def get(self, tag: str, default=None):\n return self.get_field(tag) or default\n\nThis is a pure Python wrapper addition — no Rust changes needed. Update _mrrc.pyi stubs accordingly.\n\npymarc source: Record.get() delegates to Record.__getitem__ with a try/except KeyError fallback.","status":"closed","priority":3,"issue_type":"feature","created_at":"2026-03-04T21:16:19.327037Z","created_by":"dchud","updated_at":"2026-03-04T21:53:01.068246Z","closed_at":"2026-03-04T21:53:01.068233Z","close_reason":"Fixed in PR #60, merged","source_repo":".","compaction_level":0,"original_size":0} @@ -12,7 +13,7 @@ {"id":"bd-vc06","title":"Add negation flag to SubfieldValueQuery","status":"closed","priority":2,"issue_type":"feature","created_at":"2026-03-18T23:57:42.630104Z","created_by":"dchud","updated_at":"2026-04-06T00:39:18.974586Z","closed_at":"2026-04-06T00:39:18.974381Z","close_reason":"Completed — merged in PR #74","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":3,"issue_id":"bd-vc06","author":"dchud","text":"## Motivation\n\nFollow-up to bd-w9cm (negation on SubfieldPatternQuery). The same use case applies to exact and partial string matching: users scanning large record sets for fields where a subfield does NOT equal or contain a value currently must filter in Python, incurring FFI overhead.\n\n## Design\n\nMirror the approach from bd-w9cm. SubfieldValueQuery already has a `partial: bool` flag — add `negate: bool` alongside it.\n\n### Rust core (`src/field_query.rs`)\n\n- Add `pub negate: bool` to `SubfieldValueQuery`\n- Update `new()` and `partial()` constructors to set `negate: false`\n- Add `pub fn negated(tag, subfield_code, value) -> Self` and `pub fn partial_negated(tag, subfield_code, value) -> Self` constructors\n- Update `matches()`:\n\n field.get_subfield(self.subfield_code).is_some_and(|value| {\n let matched = if self.partial {\n value.contains(self.value.as_str())\n } else {\n value == self.value.as_str()\n };\n matched != self.negate\n })\n\n### PyO3 bindings (`src-python/src/query.rs`)\n\n- Add `negate` as an optional keyword argument to `PySubfieldValueQuery::new()` (default `False`)\n- Add `#[getter] fn negate(&self) -> bool`\n- Update `__repr__` to include `negate=true` when set\n\n### Python stubs (`mrrc/_mrrc.pyi`)\n\n- Update `SubfieldValueQuery.__init__` signature: add `negate: bool = False`\n- Add `negate` property\n\n### Tests\n\n- Rust unit tests: negated exact match, negated partial match, negated with missing subfield (returns false)\n- Python tests in `tests/python/test_query_dsl.py`: negated value query via kwarg, verify property, integration with `fields_matching_value()`\n\n### Documentation\n\n- Update `docs/guides/query-dsl.md` (SubfieldValueQuery section)\n- Update `docs/tutorials/python/querying-fields.md` if SubfieldValueQuery is covered\n- Update `mrrc/_mrrc.pyi` stubs\n\n### Edge cases\n\nSame as bd-w9cm:\n- Missing subfield + negate=true → false (subfield must exist)\n- Repeating subfields: only the first value is checked (existing `get_subfield` behavior)\n\n### Sequencing\n\nImplement after bd-w9cm lands so both query types get consistent API and we can share test patterns.","created_at":"2026-03-18T23:58:00Z"},{"id":5,"issue_id":"bd-vc06","author":"dchud","text":"Elevated to GitHub issue #64: https://github.com/dchud/mrrc/issues/64","created_at":"2026-03-18T23:59:36Z"}]} {"id":"bd-w9cm","title":"Add negation flag to SubfieldPatternQuery","status":"closed","priority":2,"issue_type":"feature","created_at":"2026-03-18T23:53:40.703765Z","created_by":"dchud","updated_at":"2026-04-06T00:13:06.687346Z","closed_at":"2026-04-06T00:13:06.687035Z","close_reason":"Completed — merged in PR #73","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":1,"issue_id":"bd-w9cm","author":"dchud","text":"## Motivation\n\nA user scanning ~250k records needed to find ones where a subfield did NOT match a regex. Currently this requires Python-side filtering (`not re.match(...)`) which means:\n- Extra string copies across the Rust/Python FFI boundary for every record\n- GIL overhead on each field value that crosses into Python\n- Inability to short-circuit inside Rust iteration\n\nAdding a `negate` flag pushes the inversion into Rust so non-matching fields never cross the FFI boundary.\n\n## Design\n\nFollow the existing `partial` flag pattern on `SubfieldValueQuery` — add a `negate: bool` field defaulting to `false`.\n\n### Rust core (`src/field_query.rs`)\n\n- Add `pub negate: bool` to `SubfieldPatternQuery`\n- Update `new()` to set `negate: false`\n- Add `pub fn negated(tag, subfield_code, pattern) -> Result` constructor (sets `negate: true`)\n- Update `matches()`: XOR the regex result with `self.negate` so that when negated, a match means \"subfield exists but does NOT match the pattern\"\n\n### PyO3 bindings (`src-python/src/query.rs`)\n\n- Add `negate` as an optional keyword argument to `PySubfieldPatternQuery::new()` (default `False`)\n- Add `#[getter] fn negate(&self) -> bool`\n- Update `__repr__` to include `negate=true` when set\n\n### Python stubs (`mrrc/_mrrc.pyi`)\n\n- Update `SubfieldPatternQuery.__init__` signature: add `negate: bool = False`\n- Add `negate` property\n\n### Tests\n\n- Rust unit tests in `src/field_query.rs`: negated match, negated non-match, negated with missing subfield (should return false — subfield must exist)\n- Python tests in `tests/python/test_query_dsl.py`: negated pattern query via kwarg, verify `negate` property, integration with `fields_matching_pattern()`\n\n### Documentation\n\n- Update `docs/guides/query-dsl.md` with negation examples\n- Update `docs/tutorials/python/querying-fields.md` if SubfieldPatternQuery is covered there\n- Update API reference docs if they document SubfieldPatternQuery parameters\n\n## Edge case: missing subfield\n\nWhen `negate=true` and the subfield doesn't exist on a field, `matches()` should return `false`. Negation means \"the subfield exists but its value doesn't match\" — not \"the subfield is absent.\" This matches user intent (filtering by value, not by presence).\n\n## Scope\n\nThis bead covers the `negate` flag only. A broader query composition framework (NotQuery, AndQuery, OrQuery) is a separate concern.","created_at":"2026-03-18T23:53:58Z"},{"id":2,"issue_id":"bd-w9cm","author":"dchud","text":"## Revision: additional detail and edge cases\n\n### Repeating subfields\n\n`get_subfield()` returns only the **first** subfield with the given code. MARC fields can have repeating subfields (e.g., multiple `$a` in a 650). With `negate=true`, the query checks whether the first value does NOT match — it does not scan all values with that code. This is consistent with the existing non-negated behavior but worth documenting explicitly since negation makes the semantics more consequential. If users need \"no subfield with this code matches,\" they can combine with Python-level filtering or we can address multi-value matching in a future enhancement.\n\n### matches() implementation detail\n\nThe cleanest implementation uses `!=` rather than a literal XOR:\n\n field.get_subfield(self.subfield_code)\n .is_some_and(|value| self.pattern.is_match(value) != self.negate)\n\nThis handles all cases correctly:\n- subfield exists, pattern matches, negate=false → true (correct)\n- subfield exists, pattern doesn't match, negate=false → false (correct)\n- subfield exists, pattern matches, negate=true → false (correct: match excluded)\n- subfield exists, pattern doesn't match, negate=true → true (correct: non-match included)\n- subfield missing, any negate value → false (is_some_and short-circuits)\n\nNo extra branching needed for the missing-subfield edge case.\n\n### Follow-up\n\nThe same `negate` flag would be useful on `SubfieldValueQuery` (same user scenario, exact/partial matching). That should be a separate bead to keep scope tight.\n\n### Documentation locations confirmed\n\n- `docs/guides/query-dsl.md` — SubfieldPatternQuery section at line 108, examples at line 228\n- `docs/tutorials/python/querying-fields.md` — two usage examples at lines 80, 85\n- `mrrc/_mrrc.pyi` — stub at line 295\n- No separate API reference docs beyond the stubs; mkdocs-material may auto-generate from docstrings","created_at":"2026-03-18T23:56:36Z"},{"id":4,"issue_id":"bd-w9cm","author":"dchud","text":"Elevated to GitHub issue #63: https://github.com/dchud/mrrc/issues/63","created_at":"2026-03-18T23:59:36Z"}]} {"id":"bd-wifu","title":"Pymarc API compatibility: verified gap analysis and implementation plan","status":"closed","priority":2,"issue_type":"feature","created_at":"2026-03-23T00:03:50.574638Z","created_by":"dchud","updated_at":"2026-04-05T21:33:52.664164Z","closed_at":"2026-04-05T21:33:52.663939Z","close_reason":"Superseded by bd-sgwi / GitHub #71 — clean ordered plan replaces accumulated analysis","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":15,"issue_id":"bd-wifu","author":"dchud","text":"Verified pymarc API gap analysis. Every gap confirmed via live introspection of pymarc and mrrc. Replaces bd-qjjs and bd-14jh. Elevated to GitHub issue #65: https://github.com/dchud/mrrc/issues/65\n\n---\n\n## Tier 1: Structural / semantic divergences\n\nThese affect the fundamental object model and break the most pymarc-compatible code.\n\n### A. Separate ControlField class\n\npymarc has NO ControlField — both control and data fields are `pymarc.Field`, distinguished by `is_control_field()`. mrrc exposes a separate `ControlField` class. Breaks `isinstance(f, Field)` checks; `get_fields()` returns mixed types in mrrc vs uniform Field in pymarc.\n\n**Fix:** Unify into a single Field class in the wrapper, with `is_control_field()` dispatching behavior internally.\n\n### B. Control field content: mrrc `.value` attribute vs pymarc `.data` attribute\n\npymarc control fields store content in `.data`. mrrc's ControlField uses `.value`. Code doing `field.data` on a control field fails. Also creates a naming collision with `Field.value()` method (Gap G).\n\n**Fix:** Use `.data` when unifying Field (Gap A). Provides pymarc compat and clears the way for `value()` method.\n\n### C. Record convenience accessors are methods; pymarc uses @property\n\nAll 17 verified as properties in pymarc, methods in mrrc: `title`, `author`, `isbn`, `issn`, `issn_title`, `issnl`, `subjects`, `notes`, `location`, `series`, `sudoc`, `publisher`, `pubyear`. Plus 3 missing entirely from mrrc: `physicaldescription`, `uniformtitle`, `addedentries` (see Gap Q).\n\n**Fix:** Convert to `@property` in `mrrc/__init__.py`. Rust `_inner` methods stay as-is. Breaking change for `record.title()` callers — consider deprecation shim.\n\n### D. Record.__getitem__ returns None for missing tags; pymarc raises KeyError\n\nVerified: pymarc `record['999']` raises `KeyError`. mrrc returns `None`. Silently changes control flow for code using try/except KeyError. Note: `Record.get(tag, default)` already exists in both and behaves the same (returns default).\n\n**Fix:** Raise `KeyError` for missing tags in `__getitem__`.\n\n### E. Field.__str__ returns useless default repr\n\nVerified: pymarc `str(field)` → `'=245 10$aThe Great Gatsby$cF. Scott Fitzgerald'`. mrrc `str(field)` → `''`. The Python wrapper has no `__str__`; `__getattr__` doesn't intercept dunder methods. The Rust `_inner.__str__` returns `'Field(245)'` which is also not pymarc-compatible.\n\n**Fix:** Add `__str__` and `__repr__` to the wrapper Field class matching pymarc's MARC display format.\n\n### F. Record.as_marc() / as_marc21() missing\n\nVerified: pymarc `record.as_marc()` returns `bytes` (ISO 2709). mrrc has no equivalent — `MARCWriter.write_record()` writes to a file, returns `None`. Critical for round-trip workflows.\n\n**Fix:** Implement in Rust (pure serialization, no GIL needed), expose via PyO3, add wrapper method. `as_marc21()` should be an alias.\n\n### G. JSON serialization format mismatch\n\nVerified structural differences between pymarc `as_json()` and mrrc `to_json()`:\n\n**pymarc** (follows MARC-in-JSON / code4lib proposal):\n```json\n{\"leader\": \"...\", \"fields\": [{\"245\": {\"ind1\": \"1\", \"ind2\": \"0\", \"subfields\": [{\"a\": \"Title\"}, {\"c\": \"Author\"}]}}]}\n```\n\n**mrrc**:\n```json\n[{\"leader\": \"...\"}, {\"245\": {\"ind1\": \"1\", \"ind2\": \"0\", \"subfields\": {\"a\": \"Title\", \"c\": \"Author\"}}}]\n```\n\nTwo differences: (1) Top-level: pymarc uses `{leader, fields}` object, mrrc uses flat array. (2) Subfields: pymarc uses array of single-key objects (preserves order, allows duplicate codes), mrrc uses single object (loses duplicate subfield codes).\n\n**Fix:** `as_json()` and `as_dict()` must produce pymarc-compatible format. Keep `to_json()` as mrrc's native format. This likely needs Rust-side work or a Python-side reformatter.\n\n---\n\n## Tier 2: Missing methods and signature mismatches\n\n### H. Field.value() missing\n\nVerified: pymarc `field.value()` returns space-joined subfield values (or `.data` for control fields). mrrc Field has no `value` attribute or method. Depends on Gap B (renaming `.value` → `.data` on control fields).\n\n**Fix:** Add `def value(self) -> str` to the unified Field class.\n\n### I. Field.format_field() missing\n\nVerified: pymarc `field.format_field()` returns human-readable text without indicators or subfield codes. For the test data, `value()` and `format_field()` return the same string, but they can differ for fields with special formatting.\n\n**Fix:** Add `def format_field(self) -> str` to Field.\n\n### J. Record.as_json(**kwargs) and Record.as_dict() missing\n\nVerified: pymarc has both. `as_dict()` returns a plain dict; `as_json(**kwargs)` wraps `json.dumps(as_dict(), **kwargs)`. mrrc has neither. Note: `as_json()` must produce the pymarc-compatible JSON schema (Gap G), not mrrc's native format.\n\n**Fix:** Implement `as_dict()` producing pymarc-compatible structure, then `as_json(**kwargs)` on top. Keep mrrc's `to_json()` as its own format.\n\n### K. parse_xml_to_array() missing\n\nVerified: pymarc `parse_xml_to_array(xml_file, strict=False, normalize_form=None)` accepts file paths, open file handles, and StringIO. mrrc has `xml_to_records(str)` only.\n\n**Fix:** Add `parse_xml_to_array()` to `mrrc/__init__.py`. GIL policy: file paths and string content → entirely Rust, zero GIL. Python file objects → GIL only for `.read()`, minimize window. Document preferred path-based usage with explicit examples and comments explaining why, consistent with MARCReader docs:\n\n```python\n# Preferred: file path — read + parse entirely in Rust, zero GIL\nrecords = mrrc.parse_xml_to_array(\"collection.xml\")\n\n# Also good: string content — parse entirely in Rust, zero GIL\nrecords = mrrc.parse_xml_to_array(xml_string)\n\n# Works but slower: Python file object requires GIL for .read()\nwith open(\"collection.xml\") as f:\n records = mrrc.parse_xml_to_array(f)\n```\n\n### L. add_field(*fields) signature mismatch\n\nVerified: pymarc `add_field(self, *fields)`. mrrc `add_field(self, field: Field) -> None`.\n\n**Fix:** Change signature to accept `*fields` with a loop.\n\n### M. remove_field(*fields) + remove_fields(*tags)\n\nVerified: pymarc `remove_field(self, *fields) -> None` accepts multiple. mrrc `remove_field(self, field: Union[Field, str]) -> List[Field]` accepts one. pymarc also has `remove_fields(self, *tags) -> None` for bulk removal by tag string — mrrc has no equivalent. Also note: pymarc returns None; mrrc returns the removed fields.\n\n**Fix:** Change `remove_field` to accept `*fields`. Add `remove_fields(*tags)`. Consider whether to keep mrrc's return value (non-breaking addition).\n\n### N. add_ordered_field() / add_grouped_field() missing\n\nVerified: both exist in pymarc, neither in mrrc.\n\n**Fix:** Implement in wrapper. `add_ordered_field` inserts maintaining tag sort order. `add_grouped_field` inserts after the last field with the same tag.\n\n### O. Field.add_subfield() missing pos parameter\n\nVerified: pymarc `add_subfield(self, code, value, pos=None)`. mrrc `add_subfield(self, code, value)`.\n\n**Fix:** Add optional `pos` parameter. May need Rust-side support for positional insert.\n\n### P. Field.as_marc() / as_marc21() missing\n\nVerified: pymarc Field has both, mrrc Field has neither.\n\n**Fix:** Implement in Rust, expose via PyO3. No GIL needed. `as_marc21` as alias.\n\n### Q. Field.linkage_occurrence_num() missing\n\nVerified: pymarc `field.linkage_occurrence_num() -> Optional[str]` extracts the occurrence number from subfield $6 linkage.\n\n**Fix:** Implement in wrapper or Rust.\n\n---\n\n## Tier 3: Naming and return type mismatches\n\n### R. Property name mismatches\n\nVerified:\n- pymarc `physicaldescription` → mrrc has `physical_description` (underscore) but NOT `physicaldescription`\n- pymarc `uniformtitle` → mrrc has `uniform_title` (underscore) but NOT `uniformtitle`\n- pymarc `addedentries` → mrrc: missing entirely (returns 700/710/711/730 fields)\n\n**Fix:** Add pymarc-compatible names as `@property` (part of Gap C). Keep mrrc underscore names as aliases.\n\n### S. pubyear return type\n\nVerified: mrrc returns `Optional[int]` (e.g. `2020`). pymarc returns `Optional[str]` (e.g. `\"2020\"`).\n\n**Fix:** Return `str` from the property wrapper.\n\n### T. Field.convert_legacy_subfields() missing\n\nVerified: pymarc has `Field.convert_legacy_subfields(subfields: list[str]) -> list[Subfield]` as a classmethod. Converts old `['a', 'value', 'b', 'value']` format to Subfield namedtuples.\n\n**Fix:** Add classmethod to Field wrapper. Low priority — only needed for migrating old pymarc code.\n\n---\n\n## Tier 4: Lower priority\n\nThese are real gaps but unlikely to block most pymarc-compatible code.\n\n### U. Record.as_marc() / as_marc21() (already covered as Gap F)\n\n### V. decode_marc()\n\nVerified missing. Semi-internal — populates Record from raw bytes. Uncommon in user code but used by some advanced pymarc scripts.\n\n**Fix:** Could wrap mrrc's MARCReader to parse a single record from bytes.\n\n### W. Reader/writer variants\n\nVerified all missing: JSONReader, JSONWriter, MARCMakerReader, XMLWriter, TextWriter, Reader (base), Writer (base).\n\n**Fix:** Implement as needed. JSONReader/JSONWriter most likely to be used.\n\n### X. Exception hierarchy\n\nVerified: pymarc has 10 specific exception classes (`BadSubfieldCodeWarning`, `BaseAddressInvalid`, `BaseAddressNotFound`, `EndOfRecordNotFound`, `FatalReaderError`, `FieldNotFound`, `PymarcException`, `RecordDirectoryInvalid`, `RecordLeaderInvalid`, `RecordLengthInvalid`). mrrc exports zero exception classes.\n\n**Fix:** Define exception hierarchy in Python wrapper. Map Rust errors to specific exception types.\n\n### Y. MARC8 Python API\n\nVerified: mrrc has MARC8 support in Rust core but exports nothing at the Python level. pymarc exports: `MARC8ToUnicode` class, `marc8_to_unicode()` function, `map_marc8_field()`, `map_marc8_record()`.\n\n**Fix:** Expose Rust MARC8 functionality via PyO3, add pymarc-compatible wrapper names.\n\n### Z. Convenience functions\n\nVerified all missing: `map_records()`, `map_xml()`, `parse_json_to_array()`, `record_to_xml_node()`, `normalize_subfield_code()`.\n\n**Fix:** Implement as wrappers. `parse_json_to_array()` most likely to be used.\n\n### AA. Constants\n\nVerified all missing: `LEADER_LEN`, `DIRECTORY_ENTRY_LEN`, `END_OF_FIELD`, `END_OF_RECORD`, `SUBFIELD_INDICATOR`, `MARC_XML_NS`, `MARC_XML_SCHEMA`.\n\n**Fix:** Export from `mrrc/__init__.py`. These are just integer/string constants.\n\n### BB. Record instance attributes\n\npymarc Record has `force_utf8`, `pos`, `to_unicode` as instance attributes (member descriptors). These are used during decode and by some user code. mrrc has none of these.\n\n**Fix:** Low priority. `pos` (file position) could be useful; `force_utf8`/`to_unicode` are decode-time flags.\n\n---\n\n## GIL policy\n\nAvoid holding the GIL whenever possible.\n\n- **as_json(), as_dict(), as_marc(), as_marc21():** In-memory, single record — no GIL concern.\n- **parse_xml_to_array():** File paths and string content → entirely Rust, zero GIL. Python file objects → GIL only for `.read()`, minimize window.\n- **Field.value(), format_field(), as_marc():** Pure computation, no GIL concern.","created_at":"2026-03-23T00:05:04Z"}]} -{"id":"bd-y331","title":"Add pymarc-compatible to_unicode and permissive flags to MARCReader","status":"in_progress","priority":3,"issue_type":"feature","created_at":"2026-04-10T01:47:10.886181Z","created_by":"dchud","updated_at":"2026-04-12T21:37:14.182755Z","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":19,"issue_id":"bd-y331","author":"dchud","text":"Elevated to GitHub issue #78: https://github.com/dchud/mrrc/issues/78","created_at":"2026-04-10T01:48:11Z"}]} +{"id":"bd-y331","title":"Add pymarc-compatible to_unicode and permissive flags to MARCReader","status":"closed","priority":3,"issue_type":"feature","created_at":"2026-04-10T01:47:10.886181Z","created_by":"dchud","updated_at":"2026-04-12T21:59:31.638324Z","closed_at":"2026-04-12T21:59:31.638179Z","close_reason":"Fixed in PR #80, merged to main","source_repo":".","compaction_level":0,"original_size":0,"comments":[{"id":19,"issue_id":"bd-y331","author":"dchud","text":"Elevated to GitHub issue #78: https://github.com/dchud/mrrc/issues/78","created_at":"2026-04-10T01:48:11Z"}]} {"id":"mrrc-086","title":"Add parallel benchmarking suite (rayon/threading/multiprocessing)","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-31T17:02:57.787595Z","updated_at":"2026-01-06T16:11:40.252865Z","closed_at":"2026-01-06T16:11:40.252865Z","close_reason":"Satisfied by Phase F - Comprehensive benchmarking suite with Criterion.rs, speedup curves (2.04x @ 2-thread, 3.20x @ 4-thread), fixtures (1k/10k/100k records), and pymrrc vs Rust efficiency comparison (92%) completed.","source_repo":".","compaction_level":0,"original_size":0} {"id":"mrrc-08k","title":"Phase 3: Linked field navigation and specialized queries","description":"Phase 3 of Advanced Field Query DSL (see design/FIELD_QUERY_DSL.md).\n\nImplement specialized query patterns for linked fields, authority control, and format-specific queries.\n\nTasks:\n1. Linked field navigation (880 field linkage)\n - Implement subfield 6 parsing for linkage occurrence numbers\n - Methods:\n * field.get_linked_field(record) -> Option (880 partner)\n * record.get_all_linked_fields(field) -> Vec\n * record.get_field_pairs(tag) -> Vec<(Field, Option)> (original + 880)\n - Support for bidirectional lookups\n\n2. Authority control helpers (for authority records)\n - Methods for finding related authority headings\n - Navigate 4XX (see from) and 5XX (see also) tracings\n - Helper: find_related_headings(heading: &Field) -> Vec\n - Extract authority relationship information from 7XX fields\n\n3. Format-specific query traits\n - BibliographicQueries trait (for Record)\n - AuthorityQueries trait (for AuthorityRecord)\n - HoldingsQueries trait (for HoldingsRecord)\n - Each provides domain-specific helper methods\n\n4. Comprehensive testing\n - Test 880 field linkage with real MARC-8 encoded data\n - Test authority record navigation patterns\n - Test complex record type-specific queries\n \nDependencies: Requires Phase 1 (mrrc-9n8) and Phase 2 (mrrc-131) completed","notes":"Phase 3 Status: Linked field navigation COMPLETED (28 tests). Authority control helpers TODO. Format-specific traits TODO.","status":"closed","priority":4,"issue_type":"task","created_at":"2025-12-26T17:13:18.416250Z","updated_at":"2025-12-28T16:25:38.021558Z","closed_at":"2025-12-28T16:25:38.021558Z","close_reason":"All three sub-features completed: linked field navigation, authority control helpers, format-specific query traits. 282 library tests passing.","source_repo":".","compaction_level":0,"original_size":0} {"id":"mrrc-0jm","title":"Update check.sh to include mrrc-python Clippy and test checks","status":"closed","priority":1,"issue_type":"task","created_at":"2026-01-06T16:50:13.684695Z","created_by":"dchud","updated_at":"2026-01-06T16:59:17.545907Z","closed_at":"2026-01-06T16:59:17.545907Z","close_reason":"Completed","source_repo":".","compaction_level":0,"original_size":0} diff --git a/mrrc/__init__.py b/mrrc/__init__.py index 3826074..ef2f972 100644 --- a/mrrc/__init__.py +++ b/mrrc/__init__.py @@ -64,6 +64,14 @@ def _is_control_tag(tag: str) -> bool: return tag < '010' and tag.isdigit() +def _wrap_field(rust_field) -> 'Field': + """Wrap a Rust _Field in a Python Field wrapper.""" + wrapper = Field.__new__(Field) + wrapper._data = None + wrapper._inner = rust_field + return wrapper + + # Control field tags for enumeration (when we need to iterate all possible control fields) _CONTROL_TAGS = ('001', '002', '003', '004', '005', '006', '007', '008', '009') @@ -309,32 +317,13 @@ def get_subfields(self, *codes: str) -> List[str]: return result def delete_subfield(self, code: str) -> Optional[str]: - """Delete first subfield with given code and return its value.""" - try: - values = self._inner.subfields_by_code(code) - if not values: - return None - - deleted_value = values[0] - - # Find and remove the first occurrence - subfields = self._inner.subfields() - code_char = code[0] if code else '' + """Delete first subfield with given code and return its value. - # Create a new list without the first occurrence - new_subfields = [] - found = False - for sf in subfields: - if sf.code == code_char and not found: - found = True - continue - new_subfields.append(sf) - - # Unfortunately, we can't easily replace subfields in the current API - # This is a limitation we'll note - return deleted_value - except Exception: - return None + Matches pymarc's ``Field.delete_subfield()`` behavior: removes the + first subfield with the given code and returns its value, or ``None`` + if no subfield with that code exists. + """ + return self._inner.delete_subfield(code) def subfields_as_dict(self) -> dict: """Return subfields as dictionary mapping code to list of values.""" @@ -832,10 +821,7 @@ def __getitem__(self, tag: str) -> 'Field': # For data fields, return Field wrapper field = self._inner.get_field(tag) if field: - wrapper = Field.__new__(Field) - wrapper._data = None - wrapper._inner = field - return wrapper + return _wrap_field(field) raise KeyError(tag) def get_fields(self, *tags: str) -> List['Field']: @@ -852,10 +838,7 @@ def get_fields(self, *tags: str) -> List['Field']: for value in self._inner.control_field_values(tag): result.append(Field(tag, data=value)) for field in self._inner.fields(): - wrapper = Field.__new__(Field) - wrapper._data = None - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) else: # Return fields for specified tags for tag in tags: @@ -864,10 +847,7 @@ def get_fields(self, *tags: str) -> List['Field']: result.append(Field(tag, data=value)) else: for field in self._inner.get_fields(tag): - wrapper = Field.__new__(Field) - wrapper._data = None - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result @@ -890,9 +870,7 @@ def get_field(self, tag: str) -> Optional['Field']: """Get first field with given tag.""" field = self._inner.get_field(tag) if field: - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - return wrapper + return _wrap_field(field) return None def remove_field(self, *fields: Union['Field', str]) -> None: @@ -968,10 +946,7 @@ def fields(self) -> List['Field']: result.append(Field(tag, data=value)) # Include data fields for field in self._inner.fields(): - wrapper = Field.__new__(Field) - wrapper._data = None - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result @property @@ -1198,9 +1173,7 @@ def fields_by_indicator( """ result = [] for field in self._inner.fields_by_indicator(tag, indicator1=indicator1, indicator2=indicator2): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def fields_in_range(self, start_tag: str, end_tag: str) -> List['Field']: @@ -1224,9 +1197,7 @@ def fields_in_range(self, start_tag: str, end_tag: str) -> List['Field']: """ result = [] for field in self._inner.fields_in_range(start_tag, end_tag): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def fields_matching(self, query: 'FieldQuery') -> List['Field']: @@ -1249,9 +1220,7 @@ def fields_matching(self, query: 'FieldQuery') -> List['Field']: """ result = [] for field in self._inner.fields_matching(query): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def fields_matching_range(self, query: 'TagRangeQuery') -> List['Field']: @@ -1273,9 +1242,7 @@ def fields_matching_range(self, query: 'TagRangeQuery') -> List['Field']: """ result = [] for field in self._inner.fields_matching_range(query): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def fields_matching_pattern(self, query: 'SubfieldPatternQuery') -> List['Field']: @@ -1297,9 +1264,7 @@ def fields_matching_pattern(self, query: 'SubfieldPatternQuery') -> List['Field' """ result = [] for field in self._inner.fields_matching_pattern(query): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def fields_matching_value(self, query: 'SubfieldValueQuery') -> List['Field']: @@ -1325,9 +1290,7 @@ def fields_matching_value(self, query: 'SubfieldValueQuery') -> List['Field']: """ result = [] for field in self._inner.fields_matching_value(query): - wrapper = Field(field.tag, field.indicator1, field.indicator2) - wrapper._inner = field - result.append(wrapper) + result.append(_wrap_field(field)) return result def to_json(self) -> str: @@ -1513,15 +1476,7 @@ def __next__(self) -> Optional[Record]: if self._permissive: return None raise - wrapper = Record(None) - wrapper._inner = record - # Create a Leader wrapper from the Rust record's leader - leader = Leader() - leader._rust_leader = record.leader() - leader._parent_record = wrapper - wrapper._leader = leader - wrapper._leader_modified = False - return wrapper + return _wrap_record(record) @property def backend_type(self) -> str: diff --git a/src-python/src/authority_readers.rs b/src-python/src/authority_readers.rs index 0360ada..6bbb568 100644 --- a/src-python/src/authority_readers.rs +++ b/src-python/src/authority_readers.rs @@ -1,11 +1,10 @@ // Python wrapper for AuthorityMarcReader with multi-backend support // -// This module implements the Python interface for Authority record reading, -// with automatic backend detection (RustFile, CursorBackend, PythonFile). -// -// Authority records are specialized MARC records (Type Z) used in library -// systems for maintaining authorized headings (names, subjects, etc.). +// Uses shared helpers from reader_helpers.rs for source detection and +// Python file I/O. Only type-specific logic (AuthorityMarcReader, PyAuthorityRecord) +// lives here. +use crate::reader_helpers; use crate::wrappers::PyAuthorityRecord; use mrrc::authority_reader::AuthorityMarcReader; use mrrc::recovery::RecoveryMode; @@ -15,18 +14,8 @@ use std::io::Cursor; /// Internal enum for Authority reader backends enum AuthorityReaderBackend { - /// Direct file I/O via std::fs::File - /// Input: str path or pathlib.Path RustFile(AuthorityMarcReader), - - /// In-memory reads from bytes via std::io::Cursor - /// Input: bytes or bytearray - /// Enables thread-safe parallel parsing without Python interaction CursorBackend(AuthorityMarcReader>>), - - /// Python file-like object (fallback for custom types) - /// Input: Any object with .read() method - /// Requires GIL for each read operation PythonFile(Py), } @@ -37,33 +26,6 @@ enum AuthorityReaderBackend { /// - File paths → RustFile (parallel-safe) /// - Bytes/BytesIO → CursorBackend (parallel-safe) /// - Python file objects → PythonFile (requires GIL) -/// -/// # Examples -/// -/// Reading from a file path (parallel-safe): -/// ```python -/// from mrrc import AuthorityMARCReader -/// -/// # RustFile backend automatically selected -/// with AuthorityMARCReader('authorities.mrc') as reader: -/// for record in reader: -/// print(f"Heading: {record.heading_text()}") -/// ``` -/// -/// Reading from bytes (parallel-safe): -/// ```python -/// reader = AuthorityMARCReader(authority_bytes) -/// for record in reader: -/// print(f"Type: {record.record_type}") -/// ``` -/// -/// Reading from Python file object: -/// ```python -/// with open('authorities.mrc', 'rb') as f: -/// reader = AuthorityMARCReader(f) -/// for record in reader: -/// process(record) -/// ``` #[pyclass(name = "AuthorityMARCReader")] pub struct PyAuthorityMARCReader { backend: Option, @@ -80,117 +42,39 @@ impl PyAuthorityMARCReader { #[new] #[pyo3(signature = (source, recovery_mode = "strict"))] pub fn new(source: &Bound<'_, PyAny>, recovery_mode: &str) -> PyResult { - let rec_mode = match recovery_mode { - "lenient" => RecoveryMode::Lenient, - "permissive" => RecoveryMode::Permissive, - _ => RecoveryMode::Strict, - }; + let rec_mode = reader_helpers::parse_recovery_mode(recovery_mode)?; - // Try str path first - if let Ok(path_str) = source.extract::() { - return match File::open(&path_str) { - Ok(file) => { - let reader = AuthorityMarcReader::new(file).with_recovery_mode(rec_mode); - Ok(PyAuthorityMARCReader { - backend: Some(AuthorityReaderBackend::RustFile(reader)), - recovery_mode: rec_mode, - }) - }, - Err(e) if e.kind() == std::io::ErrorKind::NotFound => { - Err(pyo3::exceptions::PyFileNotFoundError::new_err(format!( - "No such file or directory: '{}'", - path_str - ))) - }, - Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => { - Err(pyo3::exceptions::PyPermissionError::new_err(format!( - "Permission denied: '{}'", - path_str - ))) - }, - Err(e) => Err(pyo3::exceptions::PyIOError::new_err(format!( - "Failed to open file '{}': {}", - path_str, e - ))), - }; - } - - // Try pathlib.Path - let fspath_method = source.getattr("__fspath__"); - if let Ok(method) = fspath_method { - if method.is_callable() { - if let Ok(path_obj) = method.call0() { - if let Ok(path_str) = path_obj.extract::() { - return match File::open(&path_str) { - Ok(file) => { - let reader = - AuthorityMarcReader::new(file).with_recovery_mode(rec_mode); - Ok(PyAuthorityMARCReader { - backend: Some(AuthorityReaderBackend::RustFile(reader)), - recovery_mode: rec_mode, - }) - }, - Err(e) if e.kind() == std::io::ErrorKind::NotFound => { - Err(pyo3::exceptions::PyFileNotFoundError::new_err(format!( - "No such file or directory: '{}'", - path_str - ))) - }, - Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => { - Err(pyo3::exceptions::PyPermissionError::new_err(format!( - "Permission denied: '{}'", - path_str - ))) - }, - Err(e) => Err(pyo3::exceptions::PyIOError::new_err(format!( - "Failed to open file '{}': {}", - path_str, e - ))), - }; - } - } - } + // Try file path (str or pathlib.Path) + if let Some(file) = reader_helpers::try_open_as_path(source)? { + let reader = AuthorityMarcReader::new(file).with_recovery_mode(rec_mode); + return Ok(PyAuthorityMARCReader { + backend: Some(AuthorityReaderBackend::RustFile(reader)), + recovery_mode: rec_mode, + }); } // Try bytes/bytearray - if let Ok(bytes_data) = source.extract::>() { - let cursor = Cursor::new(bytes_data); - let reader = AuthorityMarcReader::new(cursor).with_recovery_mode(rec_mode); + if let Some(bytes) = reader_helpers::try_extract_bytes(source)? { + let reader = AuthorityMarcReader::new(Cursor::new(bytes)).with_recovery_mode(rec_mode); return Ok(PyAuthorityMARCReader { backend: Some(AuthorityReaderBackend::CursorBackend(reader)), recovery_mode: rec_mode, }); } - // Try file-like object with .read() method - let read_method = source.getattr("read"); - if let Ok(method) = read_method { - if method.is_callable() { - return Ok(PyAuthorityMARCReader { - backend: Some(AuthorityReaderBackend::PythonFile(source.clone().unbind())), - recovery_mode: rec_mode, - }); - } + // Try Python file object + if let Some(py_obj) = reader_helpers::try_as_python_file(source)? { + return Ok(PyAuthorityMARCReader { + backend: Some(AuthorityReaderBackend::PythonFile(py_obj)), + recovery_mode: rec_mode, + }); } - // Unknown type - let type_name = source.get_type().name()?; - Err(pyo3::exceptions::PyTypeError::new_err(format!( - "Unsupported input type: {}. Supported types: str (file path), pathlib.Path, \ - bytes, bytearray, or file-like object (with .read() method)", - type_name - ))) + reader_helpers::unsupported_source_error(source)?; + unreachable!() } /// Read the next Authority record - /// - /// # Returns - /// * `AuthorityRecord` on success - /// * `None` if end of file reached - /// - /// # Raises - /// * `ValueError` - If record is invalid or malformed - /// * `IOError` - If read operation fails pub fn read_record(&mut self) -> PyResult> { if self.backend.is_none() { return Err(pyo3::exceptions::PyStopIteration::new_err( @@ -231,112 +115,36 @@ impl PyAuthorityMARCReader { }, AuthorityReaderBackend::PythonFile(py_obj) => { let py = unsafe { Python::assume_attached() }; - { - let obj = py_obj.bind(py); - let read_method = obj.getattr("read").map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Object missing .read() method: {}", - e - )) - })?; - - // Read leader (24 bytes) - let leader_result = read_method.call1((24usize,)).map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Failed to read leader: {}", - e - )) - })?; - - let leader: Vec = leader_result.extract().map_err(|_| { - pyo3::exceptions::PyValueError::new_err("Leader must be bytes") - })?; - - if leader.is_empty() { - return Ok(None); - } - - if leader.len() != 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Incomplete leader: expected 24 bytes, got {}", - leader.len() - ))); - } - - // Parse record length - let record_length_str = String::from_utf8_lossy(&leader[0..5]); - let record_length: usize = record_length_str.trim().parse().map_err(|_| { - pyo3::exceptions::PyValueError::new_err(format!( - "Invalid record length in leader: '{}'", - record_length_str - )) - })?; - - if record_length < 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Record length too small: {} (minimum 24)", - record_length - ))); - } - - // Read remainder - let record_data_bytes = - read_method.call1((record_length - 24,)).map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Failed to read record data: {}", + let bound = py_obj.bind(py); + match reader_helpers::read_record_bytes_from_python_file(bound)? { + None => Ok(None), + Some(bytes) => { + let cursor = Cursor::new(bytes); + let mut parser = + AuthorityMarcReader::new(cursor).with_recovery_mode(self.recovery_mode); + match parser.read_record() { + Ok(Some(record)) => { + self.backend = Some(AuthorityReaderBackend::PythonFile(py_obj)); + Ok(Some(PyAuthorityRecord { inner: record })) + }, + Ok(None) => Ok(None), + Err(e) => Err(pyo3::exceptions::PyValueError::new_err(format!( + "Failed to parse record: {}", e - )) - })?; - - let record_data: Vec = record_data_bytes.extract().map_err(|_| { - pyo3::exceptions::PyValueError::new_err("Record data must be bytes") - })?; - - if record_data.len() != record_length - 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Truncated record: expected {} bytes, got {}", - record_length - 24, - record_data.len() - ))); - } - - // Assemble and parse - let mut complete_record = Vec::with_capacity(record_length); - complete_record.extend_from_slice(&leader); - complete_record.extend_from_slice(&record_data); - - let cursor = std::io::Cursor::new(complete_record); - let mut parser = - AuthorityMarcReader::new(cursor).with_recovery_mode(self.recovery_mode); - - match parser.read_record() { - Ok(Some(record)) => Ok(Some(PyAuthorityRecord { inner: record })), - Ok(None) => Ok(None), - Err(e) => Err(pyo3::exceptions::PyValueError::new_err(format!( - "Failed to parse record: {}", - e - ))), - } + ))), + } + }, } }, }; - // If we hit EOF (Ok(None)), don't restore the backend - match &result { - Ok(None) => {}, // EOF, backend stays None - Err(_) => {}, // Error, backend already re-set above - Ok(Some(_)) => {}, // Success, backend already re-set above - } - result } - /// Iterator protocol support pub fn __iter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { slf } - /// Iterator next pub fn __next__(mut slf: PyRefMut<'_, Self>) -> PyResult { match slf.read_record()? { Some(record) => Ok(record), @@ -344,12 +152,10 @@ impl PyAuthorityMARCReader { } } - /// Context manager support pub fn __enter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { slf } - /// Context manager cleanup pub fn __exit__( mut slf: PyRefMut<'_, Self>, _exc_type: &Bound<'_, PyAny>, @@ -360,7 +166,6 @@ impl PyAuthorityMARCReader { Ok(false) } - /// Get string representation pub fn __repr__(&self) -> String { match &self.backend { None => "AuthorityMARCReader(closed)".to_string(), diff --git a/src-python/src/holdings_readers.rs b/src-python/src/holdings_readers.rs index c02b946..8a0644e 100644 --- a/src-python/src/holdings_readers.rs +++ b/src-python/src/holdings_readers.rs @@ -1,11 +1,10 @@ // Python wrapper for HoldingsMarcReader with multi-backend support // -// This module implements the Python interface for Holdings record reading, -// with automatic backend detection (RustFile, CursorBackend, PythonFile). -// -// Holdings records are specialized MARC records (Type x/y/v/u) used in library -// systems for maintaining inventory and location information. +// Uses shared helpers from reader_helpers.rs for source detection and +// Python file I/O. Only type-specific logic (HoldingsMarcReader, PyHoldingsRecord) +// lives here. +use crate::reader_helpers; use crate::wrappers::PyHoldingsRecord; use mrrc::holdings_reader::HoldingsMarcReader; use mrrc::recovery::RecoveryMode; @@ -15,55 +14,18 @@ use std::io::Cursor; /// Internal enum for Holdings reader backends enum HoldingsReaderBackend { - /// Direct file I/O via std::fs::File - /// Input: str path or pathlib.Path RustFile(HoldingsMarcReader), - - /// In-memory reads from bytes via std::io::Cursor - /// Input: bytes or bytearray - /// Enables thread-safe parallel parsing without Python interaction CursorBackend(HoldingsMarcReader>>), - - /// Python file-like object (fallback for custom types) - /// Input: Any object with .read() method - /// Requires GIL for each read operation PythonFile(Py), } /// Python wrapper for HoldingsMarcReader /// -/// Reads MARC Holdings records (Type x/y/v/u) from different sources with automatic +/// Reads MARC Holdings records from different sources with automatic /// backend selection: /// - File paths → RustFile (parallel-safe) /// - Bytes/BytesIO → CursorBackend (parallel-safe) /// - Python file objects → PythonFile (requires GIL) -/// -/// # Examples -/// -/// Reading from a file path (parallel-safe): -/// ```python -/// from mrrc import HoldingsMARCReader -/// -/// # RustFile backend automatically selected -/// with HoldingsMARCReader('holdings.mrc') as reader: -/// for record in reader: -/// print(f"Locations: {len(record.locations())}") -/// ``` -/// -/// Reading from bytes (parallel-safe): -/// ```python -/// reader = HoldingsMARCReader(holdings_bytes) -/// for record in reader: -/// print(f"Type: {record.record_type}") -/// ``` -/// -/// Reading from Python file object: -/// ```python -/// with open('holdings.mrc', 'rb') as f: -/// reader = HoldingsMARCReader(f) -/// for record in reader: -/// process(record) -/// ``` #[pyclass(name = "HoldingsMARCReader")] pub struct PyHoldingsMARCReader { backend: Option, @@ -80,117 +42,39 @@ impl PyHoldingsMARCReader { #[new] #[pyo3(signature = (source, recovery_mode = "strict"))] pub fn new(source: &Bound<'_, PyAny>, recovery_mode: &str) -> PyResult { - let rec_mode = match recovery_mode { - "lenient" => RecoveryMode::Lenient, - "permissive" => RecoveryMode::Permissive, - _ => RecoveryMode::Strict, - }; + let rec_mode = reader_helpers::parse_recovery_mode(recovery_mode)?; - // Try str path first - if let Ok(path_str) = source.extract::() { - return match File::open(&path_str) { - Ok(file) => { - let reader = HoldingsMarcReader::new(file).with_recovery_mode(rec_mode); - Ok(PyHoldingsMARCReader { - backend: Some(HoldingsReaderBackend::RustFile(reader)), - recovery_mode: rec_mode, - }) - }, - Err(e) if e.kind() == std::io::ErrorKind::NotFound => { - Err(pyo3::exceptions::PyFileNotFoundError::new_err(format!( - "No such file or directory: '{}'", - path_str - ))) - }, - Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => { - Err(pyo3::exceptions::PyPermissionError::new_err(format!( - "Permission denied: '{}'", - path_str - ))) - }, - Err(e) => Err(pyo3::exceptions::PyIOError::new_err(format!( - "Failed to open file '{}': {}", - path_str, e - ))), - }; - } - - // Try pathlib.Path - let fspath_method = source.getattr("__fspath__"); - if let Ok(method) = fspath_method { - if method.is_callable() { - if let Ok(path_obj) = method.call0() { - if let Ok(path_str) = path_obj.extract::() { - return match File::open(&path_str) { - Ok(file) => { - let reader = - HoldingsMarcReader::new(file).with_recovery_mode(rec_mode); - Ok(PyHoldingsMARCReader { - backend: Some(HoldingsReaderBackend::RustFile(reader)), - recovery_mode: rec_mode, - }) - }, - Err(e) if e.kind() == std::io::ErrorKind::NotFound => { - Err(pyo3::exceptions::PyFileNotFoundError::new_err(format!( - "No such file or directory: '{}'", - path_str - ))) - }, - Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => { - Err(pyo3::exceptions::PyPermissionError::new_err(format!( - "Permission denied: '{}'", - path_str - ))) - }, - Err(e) => Err(pyo3::exceptions::PyIOError::new_err(format!( - "Failed to open file '{}': {}", - path_str, e - ))), - }; - } - } - } + // Try file path (str or pathlib.Path) + if let Some(file) = reader_helpers::try_open_as_path(source)? { + let reader = HoldingsMarcReader::new(file).with_recovery_mode(rec_mode); + return Ok(PyHoldingsMARCReader { + backend: Some(HoldingsReaderBackend::RustFile(reader)), + recovery_mode: rec_mode, + }); } // Try bytes/bytearray - if let Ok(bytes_data) = source.extract::>() { - let cursor = Cursor::new(bytes_data); - let reader = HoldingsMarcReader::new(cursor).with_recovery_mode(rec_mode); + if let Some(bytes) = reader_helpers::try_extract_bytes(source)? { + let reader = HoldingsMarcReader::new(Cursor::new(bytes)).with_recovery_mode(rec_mode); return Ok(PyHoldingsMARCReader { backend: Some(HoldingsReaderBackend::CursorBackend(reader)), recovery_mode: rec_mode, }); } - // Try file-like object with .read() method - let read_method = source.getattr("read"); - if let Ok(method) = read_method { - if method.is_callable() { - return Ok(PyHoldingsMARCReader { - backend: Some(HoldingsReaderBackend::PythonFile(source.clone().unbind())), - recovery_mode: rec_mode, - }); - } + // Try Python file object + if let Some(py_obj) = reader_helpers::try_as_python_file(source)? { + return Ok(PyHoldingsMARCReader { + backend: Some(HoldingsReaderBackend::PythonFile(py_obj)), + recovery_mode: rec_mode, + }); } - // Unknown type - let type_name = source.get_type().name()?; - Err(pyo3::exceptions::PyTypeError::new_err(format!( - "Unsupported input type: {}. Supported types: str (file path), pathlib.Path, \ - bytes, bytearray, or file-like object (with .read() method)", - type_name - ))) + reader_helpers::unsupported_source_error(source)?; + unreachable!() } /// Read the next Holdings record - /// - /// # Returns - /// * `HoldingsRecord` on success - /// * `None` if end of file reached - /// - /// # Raises - /// * `ValueError` - If record is invalid or malformed - /// * `IOError` - If read operation fails pub fn read_record(&mut self) -> PyResult> { if self.backend.is_none() { return Err(pyo3::exceptions::PyStopIteration::new_err( @@ -231,112 +115,36 @@ impl PyHoldingsMARCReader { }, HoldingsReaderBackend::PythonFile(py_obj) => { let py = unsafe { Python::assume_attached() }; - { - let obj = py_obj.bind(py); - let read_method = obj.getattr("read").map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Object missing .read() method: {}", - e - )) - })?; - - // Read leader (24 bytes) - let leader_result = read_method.call1((24usize,)).map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Failed to read leader: {}", - e - )) - })?; - - let leader: Vec = leader_result.extract().map_err(|_| { - pyo3::exceptions::PyValueError::new_err("Leader must be bytes") - })?; - - if leader.is_empty() { - return Ok(None); - } - - if leader.len() != 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Incomplete leader: expected 24 bytes, got {}", - leader.len() - ))); - } - - // Parse record length - let record_length_str = String::from_utf8_lossy(&leader[0..5]); - let record_length: usize = record_length_str.trim().parse().map_err(|_| { - pyo3::exceptions::PyValueError::new_err(format!( - "Invalid record length in leader: '{}'", - record_length_str - )) - })?; - - if record_length < 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Record length too small: {} (minimum 24)", - record_length - ))); - } - - // Read remainder - let record_data_bytes = - read_method.call1((record_length - 24,)).map_err(|e| { - pyo3::exceptions::PyIOError::new_err(format!( - "Failed to read record data: {}", + let bound = py_obj.bind(py); + match reader_helpers::read_record_bytes_from_python_file(bound)? { + None => Ok(None), + Some(bytes) => { + let cursor = Cursor::new(bytes); + let mut parser = + HoldingsMarcReader::new(cursor).with_recovery_mode(self.recovery_mode); + match parser.read_record() { + Ok(Some(record)) => { + self.backend = Some(HoldingsReaderBackend::PythonFile(py_obj)); + Ok(Some(PyHoldingsRecord { inner: record })) + }, + Ok(None) => Ok(None), + Err(e) => Err(pyo3::exceptions::PyValueError::new_err(format!( + "Failed to parse record: {}", e - )) - })?; - - let record_data: Vec = record_data_bytes.extract().map_err(|_| { - pyo3::exceptions::PyValueError::new_err("Record data must be bytes") - })?; - - if record_data.len() != record_length - 24 { - return Err(pyo3::exceptions::PyValueError::new_err(format!( - "Truncated record: expected {} bytes, got {}", - record_length - 24, - record_data.len() - ))); - } - - // Assemble and parse - let mut complete_record = Vec::with_capacity(record_length); - complete_record.extend_from_slice(&leader); - complete_record.extend_from_slice(&record_data); - - let cursor = std::io::Cursor::new(complete_record); - let mut parser = - HoldingsMarcReader::new(cursor).with_recovery_mode(self.recovery_mode); - - match parser.read_record() { - Ok(Some(record)) => Ok(Some(PyHoldingsRecord { inner: record })), - Ok(None) => Ok(None), - Err(e) => Err(pyo3::exceptions::PyValueError::new_err(format!( - "Failed to parse record: {}", - e - ))), - } + ))), + } + }, } }, }; - // If we hit EOF (Ok(None)), don't restore the backend - match &result { - Ok(None) => {}, // EOF, backend stays None - Err(_) => {}, // Error, backend already re-set above - Ok(Some(_)) => {}, // Success, backend already re-set above - } - result } - /// Iterator protocol support pub fn __iter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { slf } - /// Iterator next pub fn __next__(mut slf: PyRefMut<'_, Self>) -> PyResult { match slf.read_record()? { Some(record) => Ok(record), @@ -344,12 +152,10 @@ impl PyHoldingsMARCReader { } } - /// Context manager support pub fn __enter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { slf } - /// Context manager cleanup pub fn __exit__( mut slf: PyRefMut<'_, Self>, _exc_type: &Bound<'_, PyAny>, @@ -360,7 +166,6 @@ impl PyHoldingsMARCReader { Ok(false) } - /// Get string representation pub fn __repr__(&self) -> String { match &self.backend { None => "HoldingsMARCReader(closed)".to_string(), diff --git a/src-python/src/lib.rs b/src-python/src/lib.rs index ceaa841..f0c5bad 100644 --- a/src-python/src/lib.rs +++ b/src-python/src/lib.rs @@ -15,6 +15,7 @@ mod parse_error; mod producer_consumer_pipeline_wrapper; mod query; mod rayon_parser_pool_wrapper; +mod reader_helpers; mod readers; mod unified_reader; mod wrappers; diff --git a/src-python/src/reader_helpers.rs b/src-python/src/reader_helpers.rs new file mode 100644 index 0000000..dfd296e --- /dev/null +++ b/src-python/src/reader_helpers.rs @@ -0,0 +1,170 @@ +//! Shared helpers for MARC reader backends (authority, holdings). +//! +//! Eliminates duplication between authority_readers.rs and holdings_readers.rs +//! for common operations: recovery mode parsing, source file opening, and +//! reading raw record bytes from Python file objects. + +use mrrc::recovery::RecoveryMode; +use pyo3::prelude::*; +use std::fs::File; + +/// Parse a recovery_mode string into a RecoveryMode enum. +/// +/// Returns `PyValueError` for invalid values. +pub fn parse_recovery_mode(mode: &str) -> PyResult { + match mode { + "lenient" => Ok(RecoveryMode::Lenient), + "permissive" => Ok(RecoveryMode::Permissive), + "strict" => Ok(RecoveryMode::Strict), + _ => Err(pyo3::exceptions::PyValueError::new_err(format!( + "Invalid recovery_mode '{}': must be 'strict', 'lenient', or 'permissive'", + mode + ))), + } +} + +/// Try to open a source as a file path (str or pathlib.Path). +/// +/// Returns `Ok(Some(file))` if the source is a path, `Ok(None)` if it's not +/// a path type, or `Err` for file I/O errors. +pub fn try_open_as_path(source: &Bound<'_, PyAny>) -> PyResult> { + // Try str path first + if let Ok(path_str) = source.extract::() { + return open_file_path(&path_str).map(Some); + } + + // Try pathlib.Path via __fspath__ + if let Ok(method) = source.getattr("__fspath__") { + if method.is_callable() { + if let Ok(path_obj) = method.call0() { + if let Ok(path_str) = path_obj.extract::() { + return open_file_path(&path_str).map(Some); + } + } + } + } + + Ok(None) +} + +/// Open a file path with proper Python error mapping. +fn open_file_path(path: &str) -> PyResult { + match File::open(path) { + Ok(file) => Ok(file), + Err(e) if e.kind() == std::io::ErrorKind::NotFound => { + Err(pyo3::exceptions::PyFileNotFoundError::new_err(format!( + "No such file or directory: '{}'", + path + ))) + }, + Err(e) if e.kind() == std::io::ErrorKind::PermissionDenied => Err( + pyo3::exceptions::PyPermissionError::new_err(format!("Permission denied: '{}'", path)), + ), + Err(e) => Err(pyo3::exceptions::PyIOError::new_err(format!( + "Failed to open file '{}': {}", + path, e + ))), + } +} + +/// Try to extract bytes/bytearray from a source. +/// +/// Returns `Ok(Some(bytes))` if the source is bytes, `Ok(None)` otherwise. +pub fn try_extract_bytes(source: &Bound<'_, PyAny>) -> PyResult>> { + match source.extract::>() { + Ok(bytes) => Ok(Some(bytes)), + Err(_) => Ok(None), + } +} + +/// Try to get a Python file-like object (has .read() method). +/// +/// Returns `Ok(Some(obj))` if the source has a callable .read(), `Ok(None)` otherwise. +pub fn try_as_python_file(source: &Bound<'_, PyAny>) -> PyResult>> { + if let Ok(method) = source.getattr("read") { + if method.is_callable() { + return Ok(Some(source.clone().unbind())); + } + } + Ok(None) +} + +/// Return a type error for unsupported input types. +pub fn unsupported_source_error(source: &Bound<'_, PyAny>) -> PyResult<()> { + let type_name = source.get_type().name()?; + Err(pyo3::exceptions::PyTypeError::new_err(format!( + "Unsupported input type: {}. Supported types: str (file path), pathlib.Path, \ + bytes, bytearray, or file-like object (with .read() method)", + type_name + ))) +} + +/// Read one complete MARC record as raw bytes from a Python file object. +/// +/// Returns `Ok(Some(bytes))` for a complete record, `Ok(None)` at EOF, +/// or `Err` for malformed/truncated data. +pub fn read_record_bytes_from_python_file(py_obj: &Bound<'_, PyAny>) -> PyResult>> { + let read_method = py_obj.getattr("read").map_err(|e| { + pyo3::exceptions::PyIOError::new_err(format!("Object missing .read() method: {}", e)) + })?; + + // Read leader (24 bytes) + let leader_result = read_method.call1((24usize,)).map_err(|e| { + pyo3::exceptions::PyIOError::new_err(format!("Failed to read leader: {}", e)) + })?; + + let leader: Vec = leader_result + .extract() + .map_err(|_| pyo3::exceptions::PyValueError::new_err("Leader must be bytes"))?; + + if leader.is_empty() { + return Ok(None); // EOF + } + + if leader.len() != 24 { + return Err(pyo3::exceptions::PyValueError::new_err(format!( + "Incomplete leader: expected 24 bytes, got {}", + leader.len() + ))); + } + + // Parse record length from leader + let record_length_str = String::from_utf8_lossy(&leader[0..5]); + let record_length: usize = record_length_str.trim().parse().map_err(|_| { + pyo3::exceptions::PyValueError::new_err(format!( + "Invalid record length in leader: '{}'", + record_length_str + )) + })?; + + if record_length < 24 { + return Err(pyo3::exceptions::PyValueError::new_err(format!( + "Record length too small: {} (minimum 24)", + record_length + ))); + } + + // Read remainder + let record_data_bytes = read_method.call1((record_length - 24,)).map_err(|e| { + pyo3::exceptions::PyIOError::new_err(format!("Failed to read record data: {}", e)) + })?; + + let record_data: Vec = record_data_bytes + .extract() + .map_err(|_| pyo3::exceptions::PyValueError::new_err("Record data must be bytes"))?; + + if record_data.len() != record_length - 24 { + return Err(pyo3::exceptions::PyValueError::new_err(format!( + "Truncated record: expected {} bytes, got {}", + record_length - 24, + record_data.len() + ))); + } + + // Assemble complete record + let mut complete_record = Vec::with_capacity(record_length); + complete_record.extend_from_slice(&leader); + complete_record.extend_from_slice(&record_data); + + Ok(Some(complete_record)) +} diff --git a/src-python/src/wrappers.rs b/src-python/src/wrappers.rs index 4b39c44..6989e50 100644 --- a/src-python/src/wrappers.rs +++ b/src-python/src/wrappers.rs @@ -560,6 +560,17 @@ impl PyField { .collect()) } + /// Delete the first subfield with a given code, returning its value + pub fn delete_subfield(&mut self, code: &str) -> PyResult> { + if code.is_empty() { + return Err(pyo3::exceptions::PyValueError::new_err( + "Subfield code cannot be empty", + )); + } + let code_char = code.chars().next().unwrap(); + Ok(self.inner.delete_subfield(code_char)) + } + fn __repr__(&self) -> String { format!( "", diff --git a/src/authority_record.rs b/src/authority_record.rs index 5dbf9e8..d656bb6 100644 --- a/src/authority_record.rs +++ b/src/authority_record.rs @@ -7,6 +7,7 @@ use crate::leader::Leader; use crate::marc_record::MarcRecord; use crate::record::Field; +use crate::record_helpers::control_field_char_at; use indexmap::IndexMap; use serde::{Deserialize, Serialize}; @@ -261,41 +262,29 @@ impl AuthorityRecord { /// Get kind of record from 008/09 #[must_use] pub fn kind_of_record(&self) -> Option { - self.get_control_field("008").and_then(|field| { - if field.len() > 9 { - match field.chars().nth(9) { - Some('a') => Some(KindOfRecord::EstablishedHeading), - Some('b') => Some(KindOfRecord::ReferenceUntracted), - Some('c') => Some(KindOfRecord::ReferenceTraced), - Some('d') => Some(KindOfRecord::Subdivision), - Some('e') => Some(KindOfRecord::NodeLabel), - Some('f') => Some(KindOfRecord::EstablishedHeadingAndSubdivision), - Some('g') => Some(KindOfRecord::ReferenceAndSubdivision), - _ => None, - } - } else { - None - } - }) + match control_field_char_at(self, "008", 9)? { + 'a' => Some(KindOfRecord::EstablishedHeading), + 'b' => Some(KindOfRecord::ReferenceUntracted), + 'c' => Some(KindOfRecord::ReferenceTraced), + 'd' => Some(KindOfRecord::Subdivision), + 'e' => Some(KindOfRecord::NodeLabel), + 'f' => Some(KindOfRecord::EstablishedHeadingAndSubdivision), + 'g' => Some(KindOfRecord::ReferenceAndSubdivision), + _ => None, + } } /// Get level of establishment from 008/33 #[must_use] pub fn level_of_establishment(&self) -> Option { - self.get_control_field("008").and_then(|field| { - if field.len() > 33 { - match field.chars().nth(33) { - Some('a') => Some(LevelOfEstablishment::FullyEstablished), - Some('b') => Some(LevelOfEstablishment::Memorandum), - Some('c') => Some(LevelOfEstablishment::Provisional), - Some('d') => Some(LevelOfEstablishment::Preliminary), - Some('n') => Some(LevelOfEstablishment::NotApplicable), - _ => None, - } - } else { - None - } - }) + match control_field_char_at(self, "008", 33)? { + 'a' => Some(LevelOfEstablishment::FullyEstablished), + 'b' => Some(LevelOfEstablishment::Memorandum), + 'c' => Some(LevelOfEstablishment::Provisional), + 'd' => Some(LevelOfEstablishment::Preliminary), + 'n' => Some(LevelOfEstablishment::NotApplicable), + _ => None, + } } /// Check if this is an established heading diff --git a/src/holdings_record.rs b/src/holdings_record.rs index a3a7a58..5c399a6 100644 --- a/src/holdings_record.rs +++ b/src/holdings_record.rs @@ -7,6 +7,7 @@ use crate::leader::Leader; use crate::marc_record::MarcRecord; use crate::record::Field; +use crate::record_helpers::control_field_char_at; use indexmap::IndexMap; use serde::{Deserialize, Serialize}; @@ -298,58 +299,40 @@ impl HoldingsRecord { /// Get acquisition status from 008/06 #[must_use] pub fn acquisition_status(&self) -> Option { - self.get_control_field("008").and_then(|field| { - if field.len() > 6 { - match field.chars().nth(6) { - Some('0') => Some(AcquisitionStatus::Other), - Some('1') => Some(AcquisitionStatus::ReceivedAndComplete), - Some('2') => Some(AcquisitionStatus::OnOrder), - Some('3') => Some(AcquisitionStatus::ReceivedAndIncomplete), - _ => None, - } - } else { - None - } - }) + match control_field_char_at(self, "008", 6)? { + '0' => Some(AcquisitionStatus::Other), + '1' => Some(AcquisitionStatus::ReceivedAndComplete), + '2' => Some(AcquisitionStatus::OnOrder), + '3' => Some(AcquisitionStatus::ReceivedAndIncomplete), + _ => None, + } } /// Get method of acquisition from 008/07 #[must_use] pub fn method_of_acquisition(&self) -> Option { - self.get_control_field("008").and_then(|field| { - if field.len() > 7 { - match field.chars().nth(7) { - Some('u') => Some(MethodOfAcquisition::Unknown), - Some('f') => Some(MethodOfAcquisition::Free), - Some('g') => Some(MethodOfAcquisition::Gift), - Some('l') => Some(MethodOfAcquisition::LegalDeposit), - Some('m') => Some(MethodOfAcquisition::Membership), - Some('p') => Some(MethodOfAcquisition::Purchase), - Some('e') => Some(MethodOfAcquisition::Exchange), - _ => None, - } - } else { - None - } - }) + match control_field_char_at(self, "008", 7)? { + 'u' => Some(MethodOfAcquisition::Unknown), + 'f' => Some(MethodOfAcquisition::Free), + 'g' => Some(MethodOfAcquisition::Gift), + 'l' => Some(MethodOfAcquisition::LegalDeposit), + 'm' => Some(MethodOfAcquisition::Membership), + 'p' => Some(MethodOfAcquisition::Purchase), + 'e' => Some(MethodOfAcquisition::Exchange), + _ => None, + } } /// Get completeness of holdings from 008/16 #[must_use] pub fn completeness(&self) -> Option { - self.get_control_field("008").and_then(|field| { - if field.len() > 16 { - match field.chars().nth(16) { - Some('1') => Some(Completeness::Complete), - Some('2') => Some(Completeness::Incomplete), - Some('3') => Some(Completeness::Scattered), - Some('4') => Some(Completeness::NotApplicable), - _ => None, - } - } else { - None - } - }) + match control_field_char_at(self, "008", 16)? { + '1' => Some(Completeness::Complete), + '2' => Some(Completeness::Incomplete), + '3' => Some(Completeness::Scattered), + '4' => Some(Completeness::NotApplicable), + _ => None, + } } /// Check if this is a serial holdings record diff --git a/src/record.rs b/src/record.rs index c063a5c..38280cf 100644 --- a/src/record.rs +++ b/src/record.rs @@ -1182,6 +1182,17 @@ impl Field { self.subfields.iter_mut().filter(move |sf| sf.code == code) } + /// Remove the first subfield with a given code + /// + /// Returns the removed subfield's value, or `None` if no match. + pub fn delete_subfield(&mut self, code: char) -> Option { + if let Some(pos) = self.subfields.iter().position(|sf| sf.code == code) { + Some(self.subfields.remove(pos).value) + } else { + None + } + } + /// Remove all subfields with a given code /// /// Returns the removed subfields. diff --git a/src/record_helpers.rs b/src/record_helpers.rs index 6aabda5..5c596e2 100644 --- a/src/record_helpers.rs +++ b/src/record_helpers.rs @@ -19,6 +19,22 @@ use crate::bibliographic_helpers::PublicationInfo; use crate::marc_record::MarcRecord; +/// Extract a single character at a given position from a control field. +/// +/// Returns `None` if the field doesn't exist or is too short. +/// This avoids the repeated `get_control_field(tag).and_then(|f| if f.len() > pos ...)` +/// pattern used throughout authority and holdings record parsing. +pub fn control_field_char_at( + record: &T, + tag: &str, + position: usize, +) -> Option { + record + .get_control_field(tag) + .filter(|f| f.len() > position) + .and_then(|f| f.chars().nth(position)) +} + /// MARC 6XX subject tags matching pymarc's `subjects()` coverage. /// /// Includes standard subject fields (600-662) and local subject fields (690-699)