Extract common author/date/identifier patterns into mixins#2461
Closed
SatoryKono wants to merge 2 commits intomainfrom
Closed
Extract common author/date/identifier patterns into mixins#2461SatoryKono wants to merge 2 commits intomainfrom
SatoryKono wants to merge 2 commits intomainfrom
Conversation
…, IdentifierTransform mixins Extract common patterns from 4 publication transformers (CrossRef, OpenAlex, SemanticScholar, PubMed) into 3 shared mixins in common/ package: - AuthorTransformMixin: _normalize_author_block() consolidates the repeated normalize_author_list + normalize_author_keys + normalize_affiliations pipeline; _hash_author_pii_details() generalises CrossRef's PII hashing - DateTransformMixin: _validate_publication_year(), _normalize_publication_date(), _prefer_date() replace per-transformer date handling - IdentifierTransformMixin: _validate_doi(), _validate_pmid(), _build_metadata_block() consolidate repeated ID validation and metadata fields BasePublicationTransformer now inherits all 3 mixins, making them available to all publication transformers via MRO. Results: - 6 transformer files: 2300 → 2036 lines (-264 lines, -11.5%) - CrossRef: 415 → 311 (-25%), OpenAlex: 339 → 296 (-13%) - SemanticScholar: 324 → 299 (-8%), PubMed: 579 → 487 (-16%) - PubMed now under 500-line layer limit (exemption removed) - CrossRef now under 300-class-size limit (exemption removed) - base_transformer.py unchanged at 634 lines - All 2906 tests pass including 10 snapshot tests https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB
- Resolve 3 merge conflicts in crossref, openalex, pubmed transformers - Fix JsonDict import in pipeline_config.py (Pydantic model resolution) - Fix JsonDict import in semanticscholar/fallback.py (runtime NameError) - Fix AsyncIterator import in uniprot/filtering_adapter_mixin.py - Fix cast import in _quarterly_targets_validation.py - Fix Any budget violations (move # Any: comments to same line) - Update architecture metric exemptions for grown domain files - Update _compute_publication_date -> _prefer_date in crossref date tests - Apply ruff format and import ordering fixes https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB
7311e7a to
7d4f84a
Compare
Owner
Author
|
Closing: stale branch, content outdated or superseded by main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates repeated author normalization, date handling, and identifier validation patterns across CrossRef, OpenAlex, PubMed, and SemanticScholar transformers into three reusable mixins. Reduces code duplication and improves maintainability by centralizing these shared concerns.
Changes
New mixins (
src/bioetl/application/pipelines/common/):AuthorTransformMixin: Consolidates_normalize_author_block()(wraps 3-call author/affiliation normalization pattern) and_hash_author_pii_details()(generalizes PII hashing for author details)DateTransformMixin: Provides_validate_publication_year(),_normalize_publication_date(), and_prefer_date()helpersIdentifierTransformMixin: Provides_validate_doi(),_validate_pmid(), and_build_metadata_block()helpersUpdated transformers to use mixins:
BasePublicationTransformer: Now inherits from all three mixinsCrossRefPublicationTransformer: Replaced_hash_author_details()with mixin method; uses_normalize_author_block(),_validate_doi(),_validate_publication_year(),_prefer_date(),_build_metadata_block()OpenAlexPublicationTransformer: Uses_validate_doi(),_normalize_author_block(),_validate_publication_year(),_build_metadata_block()PubMedPublicationTransformer: Uses_validate_pmid(),_validate_doi(),_normalize_author_block(),_build_metadata_block()SemanticScholarPublicationTransformer: Uses_validate_doi(),_validate_pmid(),_normalize_author_block(),_build_metadata_block()Removed unused imports: Eliminated direct imports of
DOI,PubMedId,PublicationYearfrom transformers (now accessed via mixins)Simplified docstrings: Removed redundant inline comments explaining obvious patterns (e.g., regex patterns, metadata field purposes)
Updated tests: Modified
test_date_parsing.pyto call_prefer_date()instead of_compute_publication_date()Updated architecture exemptions: Removed PubMed transformer file size exemption (reduced from 920 to ~750 lines via refactoring)
Type
Affected layers
Test plan
test_date_parsing.pyto use new mixin method)Checklist
_prefer_date)https://claude.ai/code/session_01748YsDt7YJUKNpqyxK99tB