Skip to content

Extract common author/date/identifier patterns into mixins#2461

Closed
SatoryKono wants to merge 2 commits intomainfrom
claude/unify-transformer-logic-Ujcsk
Closed

Extract common author/date/identifier patterns into mixins#2461
SatoryKono wants to merge 2 commits intomainfrom
claude/unify-transformer-logic-Ujcsk

Conversation

@SatoryKono
Copy link
Owner

Summary

Consolidates repeated author normalization, date handling, and identifier validation patterns across CrossRef, OpenAlex, PubMed, and SemanticScholar transformers into three reusable mixins. Reduces code duplication and improves maintainability by centralizing these shared concerns.

Changes

  • New mixins (src/bioetl/application/pipelines/common/):

    • AuthorTransformMixin: Consolidates _normalize_author_block() (wraps 3-call author/affiliation normalization pattern) and _hash_author_pii_details() (generalizes PII hashing for author details)
    • DateTransformMixin: Provides _validate_publication_year(), _normalize_publication_date(), and _prefer_date() helpers
    • IdentifierTransformMixin: Provides _validate_doi(), _validate_pmid(), and _build_metadata_block() helpers
  • Updated transformers to use mixins:

    • BasePublicationTransformer: Now inherits from all three mixins
    • CrossRefPublicationTransformer: Replaced _hash_author_details() with mixin method; uses _normalize_author_block(), _validate_doi(), _validate_publication_year(), _prefer_date(), _build_metadata_block()
    • OpenAlexPublicationTransformer: Uses _validate_doi(), _normalize_author_block(), _validate_publication_year(), _build_metadata_block()
    • PubMedPublicationTransformer: Uses _validate_pmid(), _validate_doi(), _normalize_author_block(), _build_metadata_block()
    • SemanticScholarPublicationTransformer: Uses _validate_doi(), _validate_pmid(), _normalize_author_block(), _build_metadata_block()
  • Removed unused imports: Eliminated direct imports of DOI, PubMedId, PublicationYear from transformers (now accessed via mixins)

  • Simplified docstrings: Removed redundant inline comments explaining obvious patterns (e.g., regex patterns, metadata field purposes)

  • Updated tests: Modified test_date_parsing.py to call _prefer_date() instead of _compute_publication_date()

  • Updated architecture exemptions: Removed PubMed transformer file size exemption (reduced from 920 to ~750 lines via refactoring)

Type

  • Refactoring (no functional changes)

Affected layers

  • Application

Test plan

  • Existing unit tests pass (refactored test_date_parsing.py to use new mixin method)
  • Architecture tests pass (no new import boundary violations; dependency map updated)
  • Type annotations preserved on all mixin methods

Checklist

  • No new import boundary violations (ARCH-001)
  • Type annotations on all public mixin methods (TYPE-001)
  • Tests updated for renamed method (_prefer_date)
  • Removed unused imports from transformers

https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB

claude added 2 commits March 4, 2026 22:18
…, IdentifierTransform mixins

Extract common patterns from 4 publication transformers (CrossRef, OpenAlex,
SemanticScholar, PubMed) into 3 shared mixins in common/ package:

- AuthorTransformMixin: _normalize_author_block() consolidates the repeated
  normalize_author_list + normalize_author_keys + normalize_affiliations
  pipeline; _hash_author_pii_details() generalises CrossRef's PII hashing
- DateTransformMixin: _validate_publication_year(), _normalize_publication_date(),
  _prefer_date() replace per-transformer date handling
- IdentifierTransformMixin: _validate_doi(), _validate_pmid(),
  _build_metadata_block() consolidate repeated ID validation and metadata fields

BasePublicationTransformer now inherits all 3 mixins, making them available
to all publication transformers via MRO.

Results:
- 6 transformer files: 2300 → 2036 lines (-264 lines, -11.5%)
- CrossRef: 415 → 311 (-25%), OpenAlex: 339 → 296 (-13%)
- SemanticScholar: 324 → 299 (-8%), PubMed: 579 → 487 (-16%)
- PubMed now under 500-line layer limit (exemption removed)
- CrossRef now under 300-class-size limit (exemption removed)
- base_transformer.py unchanged at 634 lines
- All 2906 tests pass including 10 snapshot tests

https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB
- Resolve 3 merge conflicts in crossref, openalex, pubmed transformers
- Fix JsonDict import in pipeline_config.py (Pydantic model resolution)
- Fix JsonDict import in semanticscholar/fallback.py (runtime NameError)
- Fix AsyncIterator import in uniprot/filtering_adapter_mixin.py
- Fix cast import in _quarterly_targets_validation.py
- Fix Any budget violations (move # Any: comments to same line)
- Update architecture metric exemptions for grown domain files
- Update _compute_publication_date -> _prefer_date in crossref date tests
- Apply ruff format and import ordering fixes

https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB
@SatoryKono SatoryKono force-pushed the claude/unify-transformer-logic-Ujcsk branch from 7311e7a to 7d4f84a Compare March 5, 2026 00:07
@SatoryKono
Copy link
Owner Author

Closing: stale branch, content outdated or superseded by main.

@SatoryKono SatoryKono closed this Mar 8, 2026
@SatoryKono SatoryKono deleted the claude/unify-transformer-logic-Ujcsk branch March 8, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants