Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .beads/issues.jsonl

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/getting-started/quickstart-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ for record in MARCReader("records.mrc"):

This approach uses pure Rust I/O and releases Python's GIL, enabling multi-threaded speedups.

For files with malformed records, use `permissive=True` to skip errors (yields `None`), or `recovery_mode="lenient"` to salvage partial data. See the [error handling guide](../guides/migration-from-pymarc.md#error-handling) for details.

## Access Fields

```python
Expand Down
52 changes: 52 additions & 0 deletions docs/guides/migration-from-pymarc.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ with open('output.mrc', 'wb') as f:
| Operation | pymarc | mrrc | Same? |
|-----------|--------|------|-------|
| Create reader | `MARCReader(file_obj)` | `MARCReader('path.mrc')` (recommended) or `MARCReader(file_obj)` | Enhanced |
| Permissive mode | `MARCReader(f, permissive=True)` | `MARCReader(f, permissive=True)` | **Same** |
| Unicode flag | `MARCReader(f, to_unicode=True)` | `MARCReader(f, to_unicode=True)` | **Same** |
| Read record | `reader.next()` or `next(reader)` | `next(reader)` | **Same** |
| Write record | `writer.write(record)` | `writer.write(record)` | **Same** |
| Iterate | `for record in reader:` | `for record in reader:` | **Same** |
Expand Down Expand Up @@ -338,6 +340,56 @@ from mrrc import MrrcException, MarcError
- [ ] Update writers to use context managers: `with mrrc.MARCWriter(f) as w:` (better resource management)
- [ ] Use `record.as_marc()`, `record.as_json()`, `record.as_dict()` for serialization

## Error Handling

### Permissive Mode (pymarc-compatible)

pymarc's `permissive=True` flag yields `None` for records that fail to parse,
letting callers skip bad records and keep processing. mrrc supports the same
flag with identical behavior:

```python
# Works the same in both pymarc and mrrc
for record in mrrc.MARCReader('records.mrc', permissive=True):
if record is None:
continue # skip malformed record
print(record.title)
```

### to_unicode Flag

pymarc's `to_unicode=True` (the default) converts MARC-8 encoded records to
UTF-8. mrrc always converts MARC-8 to UTF-8 automatically — the conversion
happens in the Rust parsing layer and cannot be disabled. The `to_unicode`
kwarg is accepted for compatibility so existing scripts work unchanged.
Passing `to_unicode=False` emits a warning but has no effect.

### Recovery Mode (mrrc-specific)

mrrc also offers a `recovery_mode` kwarg that goes beyond pymarc's
permissive mode. Instead of skipping bad records entirely, recovery mode
attempts to salvage valid fields from damaged records:

```python
# Attempt to recover partial data from malformed records
reader = mrrc.MARCReader('records.mrc', recovery_mode='lenient')
for record in reader:
print(f"Got {len(record.get_fields())} fields")

# Even more lenient — accept partial data
reader = mrrc.MARCReader('records.mrc', recovery_mode='permissive')
```

Recovery modes:
- `"strict"` (default) — raise on any malformation
- `"lenient"` — attempt to recover, salvage valid fields
- `"permissive"` — very lenient, accept partial data

Note: `permissive=True` and `recovery_mode` other than `"strict"` cannot
be combined — they represent different error-handling strategies. Use
`permissive=True` for pymarc-compatible "skip bad records" behavior, or
`recovery_mode` for mrrc's "salvage what you can" approach.

## Known Differences from pymarc

1. **Record constructor**: `mrrc.Record()` works (defaults to `Leader()`), or pass explicit `mrrc.Record(mrrc.Leader())`
Expand Down
27 changes: 27 additions & 0 deletions docs/tutorials/python/reading-records.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,33 @@ except FileNotFoundError:
print("File not found")
```

### Permissive Mode

When processing files that may contain malformed records, use
`permissive=True` to skip bad records instead of raising errors.
This matches pymarc's `permissive` flag behavior:

```python
for record in MARCReader("records.mrc", permissive=True):
if record is None:
continue # skip malformed record
print(record.title)
```

### Recovery Mode

For more control over error handling, use `recovery_mode` to attempt
salvaging valid fields from damaged records:

```python
# Attempt to recover partial data
for record in MARCReader("records.mrc", recovery_mode="lenient"):
print(f"Got {len(record.get_fields())} fields")
```

See the [migration guide](../guides/migration-from-pymarc.md#error-handling)
for details on the differences between `permissive` and `recovery_mode`.

## Complete Example

This example analyzes a MARC file to summarize the collection by language and material type:
Expand Down
56 changes: 46 additions & 10 deletions mrrc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1464,19 +1464,55 @@ def __hash__(self) -> int:


class MARCReader:
"""MARC Reader wrapper."""

def __init__(self, file_obj):
"""MARC Reader wrapper.

Args:
file_obj: File path (str), pathlib.Path, bytes/bytearray, or file-like object.
to_unicode: Accepted for pymarc compatibility. mrrc always converts
MARC-8 to UTF-8; passing ``False`` emits a warning.
permissive: When ``True``, yields ``None`` for records that fail to
parse instead of raising, matching pymarc's ``permissive`` behavior.
recovery_mode: mrrc-native error handling: ``"strict"`` (default, raise
on errors), ``"lenient"`` (attempt to salvage valid fields), or
``"permissive"`` (very lenient, accept partial data). Cannot be
combined with ``permissive=True``.
"""

def __init__(self, file_obj, to_unicode: bool = True, permissive: bool = False,
recovery_mode: str = "strict"):
"""Create a new MARC reader."""
self._inner = _MARCReader(file_obj)

if not to_unicode:
import warnings
warnings.warn(
"mrrc always converts MARC-8 to UTF-8; to_unicode=False has no effect",
stacklevel=2,
)
if permissive and recovery_mode != "strict":
raise ValueError(
"Cannot combine permissive=True with recovery_mode other than "
"'strict' — they represent different error-handling strategies"
)
self._permissive = permissive
self._inner = _MARCReader(file_obj, recovery_mode=recovery_mode)

def __iter__(self):
"""Iterate over records."""
return self

def __next__(self) -> Record:
"""Get next record."""
record = next(self._inner)

def __next__(self) -> Optional[Record]:
"""Get next record.

When ``permissive=True``, returns ``None`` for records that fail
to parse instead of raising, matching pymarc behavior.
"""
try:
record = next(self._inner)
except StopIteration:
raise
except Exception:
if self._permissive:
return None
raise
wrapper = Record(None)
wrapper._inner = record
# Create a Leader wrapper from the Rust record's leader
Expand All @@ -1486,7 +1522,7 @@ def __next__(self) -> Record:
wrapper._leader = leader
wrapper._leader_modified = False
return wrapper

@property
def backend_type(self) -> str:
"""The backend type: ``"rust_file"``, ``"cursor"``, or ``"python_file"``."""
Expand Down
2 changes: 1 addition & 1 deletion mrrc/_mrrc.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ class MARCReader:
futures = [executor.submit(process_file, f) for f in file_list]
results = [f.result() for f in futures]
"""
def __init__(self, file: Any) -> None: ...
def __init__(self, file: Any, *, recovery_mode: str = "strict") -> None: ...
def __repr__(self) -> str: ...
def __iter__(self) -> Iterator[Record]: ...
def __next__(self) -> Record:
Expand Down
28 changes: 24 additions & 4 deletions src-python/src/readers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use crate::batched_reader::BatchedMarcReader;
use crate::batched_unified_reader::BatchedUnifiedReader;
use crate::parse_error::ParseError;
use crate::wrappers::PyRecord;
use mrrc::MarcReader;
use mrrc::{MarcReader, RecoveryMode};
use pyo3::prelude::*;
use smallvec::SmallVec;

Expand Down Expand Up @@ -59,6 +59,8 @@ enum ReaderType {
pub struct PyMARCReader {
/// Reader backend (Python-based or unified multi-backend)
reader: Option<ReaderType>,
/// Recovery mode for handling malformed records
recovery_mode: RecoveryMode,
}

#[pymethods]
Expand All @@ -73,12 +75,27 @@ impl PyMARCReader {
///
/// # Arguments
/// * `source` - File path (str), pathlib.Path, bytes/bytearray, or file-like object
/// * `recovery_mode` - Error handling mode: 'strict' (default), 'lenient', 'permissive'
#[new]
pub fn new(source: &Bound<'_, PyAny>) -> PyResult<Self> {
#[pyo3(signature = (source, recovery_mode = "strict"))]
pub fn new(source: &Bound<'_, PyAny>, recovery_mode: &str) -> PyResult<Self> {
let rec_mode = match recovery_mode {
"lenient" => RecoveryMode::Lenient,
"permissive" => RecoveryMode::Permissive,
"strict" => RecoveryMode::Strict,
_ => {
return Err(pyo3::exceptions::PyValueError::new_err(format!(
"Invalid recovery_mode '{}': must be 'strict', 'lenient', or 'permissive'",
recovery_mode
)));
},
};

// Try unified reader first (handles file paths and bytes)
match BatchedUnifiedReader::new(source) {
Ok(unified_reader) => Ok(PyMARCReader {
reader: Some(ReaderType::Unified(unified_reader)),
recovery_mode: rec_mode,
}),
Err(_) => {
// Fall back to legacy Python file wrapper
Expand All @@ -87,6 +104,7 @@ impl PyMARCReader {
let batched_reader = BatchedMarcReader::new(file_obj);
Ok(PyMARCReader {
reader: Some(ReaderType::Python(batched_reader)),
recovery_mode: rec_mode,
})
},
}
Expand Down Expand Up @@ -122,11 +140,12 @@ impl PyMARCReader {

// Step 2: Parse bytes (GIL released)
// CRITICAL FIX: Use Python::detach() which properly releases the GIL.
let rec_mode = self.recovery_mode;
let record = py
.detach(|| {
// Create a cursor from owned bytes
let cursor = std::io::Cursor::new(bytes_owned.to_vec());
let mut parser = MarcReader::new(cursor);
let mut parser = MarcReader::new(cursor).with_recovery_mode(rec_mode);

// Parse the single record
parser.read_record().map_err(|e| {
Expand Down Expand Up @@ -229,13 +248,14 @@ impl PyMARCReader {
// We defer conversion to PyErr until AFTER detach() returns (GIL re-acquired).
// This is required because PyErr construction needs the GIL.
// NOTE: Use detach() which properly releases GIL
let rec_mode = slf.recovery_mode;
let parse_result: Result<Option<mrrc::Record>, crate::parse_error::ParseError> =
py.detach(|| {
// This closure runs WITHOUT the GIL held
// All data here is owned (no Python references)
// Return Rust errors only; defer PyErr conversion to Phase 3
let cursor = std::io::Cursor::new(record_bytes_owned.to_vec());
let mut parser = MarcReader::new(cursor);
let mut parser = MarcReader::new(cursor).with_recovery_mode(rec_mode);

// Parse the single record from bytes
// Return ParseError (Rust type), not PyErr
Expand Down
Loading
Loading