Skip to content

Add Markpub content parser#9

Open
pfefferle wants to merge 15 commits intotrunkfrom
add/markpub-parser
Open

Add Markpub content parser#9
pfefferle wants to merge 15 commits intotrunkfrom
add/markpub-parser

Conversation

@pfefferle
Copy link
Copy Markdown
Member

Proposed changes:

  • Add Markpub parser (at.markpub.markdown) that walks Gutenberg blocks via parse_blocks() and converts each to GFM markdown.
  • Register Markpub as the default content parser via the atmosphere_content_parser filter.
  • Add tests for all supported block types and filter integration.

Stacked on #8. Merge that first.

Other information:

  • Have you written new tests for your changes, if applicable?

Testing instructions:

  • Verify ?atproto preview on a published post returns valid JSON with a content field containing at.markpub.markdown data.
  • Verify block content converts to markdown correctly (headings, links, images, lists, code blocks).
  • Verify classic editor content falls back to inline HTML-to-markdown conversion.
  • Verify atmosphere_content_parser filter with __return_null disables the content field.
  • Run npm run env-test — all tests pass.

Changelog entry

  • Automatically create a changelog entry from the details below.
Changelog Entry Details

Significance

  • Patch
  • Minor
  • Major

Type

  • Added - for new features
  • Changed - for changes in existing functionality
  • Deprecated - for soon-to-be removed features
  • Removed - for now removed features
  • Fixed - for any bug fixes
  • Security - in case of vulnerabilities

Message

Add rich content support for standard.site documents using the Markpub format.

@pfefferle pfefferle added the enhancement New feature label Mar 23, 2026
@pfefferle pfefferle self-assigned this Mar 23, 2026
@github-actions github-actions Bot added [Feature] Content Parser Content parser for AT Protocol [Tests] Includes Tests PR includes test changes labels Mar 23, 2026
@pfefferle pfefferle force-pushed the add/markpub-parser branch 3 times, most recently from f3d5710 to 71a30f5 Compare March 23, 2026 17:15
Base automatically changed from add/content-parser-interface to trunk March 23, 2026 17:34
@github-actions github-actions Bot added the [Feature] Transformer AT Protocol record transformers label Mar 23, 2026
Ship a Markpub parser (at.markpub.markdown) as the default content
parser, registered via the atmosphere_content_parser filter.

- Add Markpub class that walks Gutenberg blocks via parse_blocks()
  and converts each to GFM markdown.
- Register Markpub as the default parser in Atmosphere::init().
- Add tests for all supported block types and filter integration.
- Use a separate counter for ordered lists so numbering stays
  sequential when empty items are skipped.
- Percent-encode parentheses in link URLs to prevent breaking
  markdown link syntax.
@pfefferle pfefferle force-pushed the add/markpub-parser branch from 6a2aa7d to 920dbea Compare March 23, 2026 17:46
Copy link
Copy Markdown
Contributor

@kraftbj kraftbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the parser structure is clean and the namespaced-prefix conventions look good throughout. Left a few inline notes on some edge cases I spotted while reading; none of them block the overall approach.

Comment thread includes/content-parser/class-markpub.php Outdated
Comment thread includes/content-parser/class-markpub.php Outdated
Comment thread includes/content-parser/class-markpub.php
Comment thread includes/content-parser/class-markpub.php Outdated
Comment thread includes/content-parser/class-markpub.php Outdated
- Widen Content_Parser::parse return type to ?array; Markpub returns
  null when the trimmed markdown is empty so Document omits the
  content field instead of shipping an empty record. Document skips
  the atmosphere_document_content filter on null to keep the filter
  contract strictly array-typed.
- Drop 'table' from advertised extensions; core/table currently falls
  through to fallback and is stripped by wp_strip_all_tags, so the
  advertised capability didn't match emitted output.
- Reorder the figcaption extraction so the closing-tag strip runs
  before wp_strip_all_tags. Otherwise sibling content after
  </figcaption> (e.g. a trailing <p> inside the same <figure>) would
  bleed into the caption text.
- Wrap preg_replace / preg_replace_callback calls in safe_replace
  helpers that fall back to the original input on PCRE failure and
  emit a wp_trigger_error. Without this, a pathological input could
  erase the post body silently when preg_replace returns null and
  that null cascades through subsequent string ops.
- Fix the phpcs:ignore rule on unused $post parameters to reference
  the sniff that actually fires (Generic.CodeAnalysis.UnusedFunction
  Parameter alongside VariableAnalysis), and document why $post is
  declared but unused.
@pfefferle
Copy link
Copy Markdown
Member Author

Thanks for the review! All five points addressed in 8991043:

L36 phpcs:ignore rule + $post comment — both Generic.CodeAnalysis.UnusedFunctionParameter and VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable actually fire on this parameter, so the ignore now lists both. Added a docblock note explaining that $post is required by the Content_Parser contract even though Markpub doesn't use it.

L65 advertised table extension — dropped from the extensions array. No core/table handler yet; the advertised capability should match what's actually emitted. Can re-add when a table transform lands.

L66 empty-markdown null sentinel — widened Content_Parser::parse return type to ?array. Markpub returns null when the trimmed markdown is empty; Document::get_content bails early on null and skips the atmosphere_document_content filter (keeping callbacks' array $content signature honest). New tests cover a post of only core/spacer blocks, literal empty post content, and the Document-side null-handling path (asserting the filter is not invoked).

L164 figcaption sibling-bleed — took your suggested order swap: preg_replace the surrounding figcaption markers first, then wp_strip_all_tags. Added a regression test with a trailing <p> inside the same <figure>; without the fix it fails with "Real captionShould not appear in caption" concatenated.

L367 preg_replace PCRE guard — added safe_replace / safe_replace_callback helpers that fall back to the unchanged subject and wp_trigger_error when preg_replace returns null. Applied to all inline conversions and the figcaption chain. Didn't TDD this one directly — reliably tripping PCRE limits in a unit test is awkward — but the new call sites are uniform and the helper is small enough to read at a glance.

@pfefferle pfefferle requested a review from kraftbj April 22, 2026 06:35
kraftbj
kraftbj previously approved these changes Apr 22, 2026
paragraph(), fallback(), and the classic-content branch checked
empty( $html ) on raw innerHTML. For content like <p>   </p> the
wrapper is non-empty, so these handlers fell through to
inline_html_to_markdown() which trims to "". The empty string
survived into the parts array and produced a spurious "\n\n"
separator before the next block in mixed-content posts.

Match the pattern heading() already uses: convert first, return null
when the result is empty.
kraftbj
kraftbj previously approved these changes Apr 22, 2026
@pfefferle
Copy link
Copy Markdown
Member Author

@kraftbj we should maybe find a more generic way, to also transform into the other standard.site content formats.

@kraftbj
Copy link
Copy Markdown
Contributor

kraftbj commented Apr 29, 2026

Agreed — the single-parser filter and one-NSID-per-implementation shape will get awkward as soon as a second standard.site format lands. Filed #45 to nail down the interface before that happens (registry vs per-NSID filter vs canonical-IR projection — open for discussion).

For this PR, let's move forward with the current shape and iterate. Markpub is the only parser today, so the registry refactor doesn't block anything user-facing, and #45 keeps the design conversation in its own thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature [Feature] Content Parser Content parser for AT Protocol [Feature] Transformer AT Protocol record transformers [Tests] Includes Tests PR includes test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants