Skip to content

fix(closure,HTML::Parser,open): unblock HTML-Tree 5.07 tests#559

Merged
fglock merged 6 commits intomasterfrom
feature/html-element-fixes
Apr 25, 2026
Merged

fix(closure,HTML::Parser,open): unblock HTML-Tree 5.07 tests#559
fglock merged 6 commits intomasterfrom
feature/html-element-fixes

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 25, 2026

Summary (WIP)

Fixes the three highest-impact bugs uncovered while investigating
./jcpan -t HTML::Element (HTML-Tree 5.07). Plan + remaining
work live in dev/modules/html_element.md.

Before this PR: 10 of 23 upstream test files fail.
After this PR: 3 of 23 fail. The remaining failures are:

  • refloop.t — weak-ref destruction timing (Phase 4, being handled
    on another branch — explicitly out of scope here);
  • whitespace.t — appears to hang on \xA0 input (separate issue);
  • split.t — entity-handling diff (separate issue).

Bug 1 — For3Node continue block lost from closure capture

VariableCollectorVisitor.visit(For3Node) walked initialization,
condition, increment, and body — but not continueBlock.
The selective-capture optimisation in SubroutineParser used the
visitor's output to decide which outer lexicals to attach to a sub's
closure; anything referenced only inside a continue {} block
was filtered out. The lazy compiler then failed at first call with

Global symbol "$nillio" requires explicit package name

For1Node already traversed continueBlock correctly (lines 165-167);
only the For3Node visitor was wrong. Fix is one block.

Discovered via HTML::Element::look_down, which ends in:

while (defined($this = shift @pile)) { ... }
continue { unshift @pile, ..., @{ ... || $nillio }; }

where $nillio is a file-scoped my variable (see HTML/Element.pm
line 74 vs 2023).

Bug 2 — HTML::Parser tokens / tokenN / tokenpos argspecs

HTMLParser.java::buildArgs had no case for tokens, tokenpos,
or tokenN for N > 0. The default branch silently emitted an empty
scalar, which broke the default comment handler that
HTML/Parser.pm itself installs:

$self->handler(comment => sub {
    my ($self, $tokens) = @_;
    for (@$tokens) { $self->comment($_) }    # $tokens was ""
}, "self,tokens");

Implemented per HTML::Parser docs:

Event tokens value
start [tagname, k1, v1, k2, v2, ...] in attrseq order
end [tagname]
text/dtext [text]
comment [comment_body]
declaration [declaration_body]
process [pi_body]

tokenN is tokens->[N]. tokenpos parallels tokens with
[0, 0] placeholder offsets (we don't track byte offsets — same
status as the existing offset / offset_end cases).

Bug 3 — literal-string filehandle name in open(...)

PrototypeArgs.handleTypeGlobArgument routed bareword filehandles
through FileHandle.parseBarewordHandle, which autovivifies
*main::FH and yields a typeglob reference. A constant-string first
argument fell through to the "Bare scalars" branch and was passed
in scalar context; open then tried to write the new IO into the
read-only string literal, producing
Modification of a read-only value attempted.

Now: when the first argument is a StringNode whose value is a
syntactically valid identifier (or Pkg::Identifier), treat it the
same as a bareword. open(my $fh, "<", $path) and other lvalue
paths are unaffected.

Discovered via HTML-Tree 5.07 t/oldparse.t:

open( "INFILE", "$TestInput" ) || die "$!";
binmode INFILE;
$HTML = <INFILE>;

Test plan

  • make green on master (1m 11s)
  • New regression tests:
    • src/test/resources/unit/closure.t — 2 cases for the
      continue-block capture bug
    • src/test/resources/unit/html_parser_tokens.t — 20 assertions
      for tokens / tokenN / tokenpos
    • src/test/resources/unit/open_string_filehandle.t — 5
      assertions for 2-arg / 3-arg / package-qualified / lvalue
  • HTML-Tree 5.07 test suite, before -> after:
    | file | before | after |
    | --- | --- | --- |
    | attributes.t | 0/3 | 3/3 |
    | children.t | 0/3 | 3/3 |
    | comment.t | 0/1 | 1/1 |
    | construct_tree.t | 2/75 | 75/75 |
    | oldparse.t | 0/16 | 16/16 |
    | parse.t | 7/44 | 44/44 |
    | parsefile.t | 0/4 | 4/4 |
    | split.t | 1/444 | 417/444 (separate entity-handling issue) |
    | refloop.t | 3/8 (aborted) | 4/8 (Phase 4, out of scope) |
    | whitespace.t | 0/6 (aborted) | 4/6 then hangs on \xA0 |

Out of scope / follow-ups

  • Phase 4 (-weak refloop) — being addressed on a separate
    branch; do NOT implement here.
  • split.t entity diff — output writes &amp; literals
    through some paths instead of decoded entities. Separate PR.
  • whitespace.t hang on \xA0 — likely an HTML::TreeBuilder
    whitespace-folding loop; needs its own investigation.

Generated with Devin

fglock and others added 4 commits April 25, 2026 16:57
- dev/modules/devel_declare.md: design for Java-backed Devel::Declare +
  B::Hooks::OP::Check, including phased rollout and the lexer changes
  needed for set_linestr emulation.
- dev/modules/html_element.md: detailed plan for fixing
  jcpan -t HTML::Element. Four root causes identified, two with
  immediate fixes in scope (continue-block closure capture; HTML::Parser
  tokens argspec).

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
….. }`

VariableCollectorVisitor.visit(For3Node) walked initialization,
condition, increment, and body — but not continueBlock. The
selective-capture optimisation in SubroutineParser used the visitor's
output to decide which outer lexicals to attach to a sub's closure;
anything referenced *only* inside a continue block was filtered out.
The lazy compiler then failed at first call with

    Global symbol "$nillio" requires explicit package name

For1Node already traversed continueBlock correctly; only the
For3Node visitor was wrong.

Discovered while running `jcpan -t HTML::Element` (HTML-Tree 5.07):
HTML::Element::look_down ends in
    while (defined($this = shift @pile)) { ... }
    continue { unshift @pile, ..., @{ ... || $nillio }; }
where $nillio is a file-scoped `my` variable. Four upstream tests
now pass that were previously aborting with the parser error
(attributes.t, children.t, refloop.t, whitespace.t).

Adds two regression tests to src/test/resources/unit/closure.t
covering both the anon-sub and named-sub paths.

See dev/modules/html_element.md for the full investigation and the
plan for the remaining HTML-Tree failures (HTML::Parser tokens
argspec; 2-arg open with literal-string filehandle; weak-ref refloop
which is being addressed on a separate branch).

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
HTMLParser.java::buildArgs handled tagname/attr/attrseq/text/dtext/
is_cdata/self/event/offset/length/line/token0 (PI-only) but had no
case for `tokens`, `tokenpos`, or `tokenN` for N>0. The default
branch silently emitted an empty scalar.

That broke the comment handler that HTML/Parser.pm itself installs
when api_version < 3:

    $self->handler(comment => sub {
        my ($self, $tokens) = @_;
        for (@$tokens) { $self->comment($_) }
    }, "self,tokens");

…with "Can't use string ('') as an ARRAY ref while strict refs in
use" — five upstream HTML-Tree tests aborted on this (comment.t,
parse.t, parsefile.t, construct_tree.t, split.t).

`tokens` produces:
  start       => [tagname, k1, v1, k2, v2, ...] (in attrseq order)
  end         => [tagname]
  text/dtext  => [text]
  comment     => [comment_body]
  declaration => [declaration_body]
  process     => [pi_body]

`tokenN` (N >= 0) returns tokens->[N], or "" if out of range.
`token0` keeps its existing PI fast-path for processing-instruction
events.

`tokenpos` returns a parallel arrayref of [start, end] byte-offset
pairs. We don't track byte offsets yet (already noted at the
existing `offset` / `offset_end` cases, which return 0), so the
returned pairs are all [0, 0]. This is enough for callers that just
iterate the array, and matches the existing approximation we make
for `offset`.

After this fix + the For3Node continue-block fix in the previous
commit, jcpan -t HTML::Element goes from 10/23 failing files to
just 4 (oldparse.t — 2-arg open with literal-string filehandle;
refloop.t — weak-ref destruction timing; whitespace.t — appears to
hang on \xA0 input, separate issue; split.t — entity-handling diff).
The remaining bugs are tracked in dev/modules/html_element.md
Phases 3 and 4 (Phase 4 is being addressed on a different branch).

New regression test: src/test/resources/unit/html_parser_tokens.t
covers comment, start, end, tokenN, and tokenpos argspecs.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
PrototypeArgs.handleTypeGlobArgument routed bareword filehandles
(open(FH, ...)) through FileHandle.parseBarewordHandle, which
autovivifies *main::FH and produces a typeglob reference. A
constant-string first argument (open("FH", ...)) fell through to
the "Bare scalars" branch and was passed in scalar context; open
then tried to write the new IO into the read-only string literal,
producing

    Modification of a read-only value attempted

Now: when the first argument is a StringNode whose value is a
syntactically valid identifier (or `Pkg::Identifier`), treat it the
same as a bareword. The non-identifier case still falls through to
the scalar path, so `open(my $fh, "<", $path)` and friends are
unaffected.

Discovered via HTML-Tree 5.07 t/oldparse.t which uses

    open( "INFILE", "$TestInput" ) || die "$!";
    binmode INFILE;
    $HTML = <INFILE>;

After this fix t/oldparse.t goes from 0/16 to 16/16. Combined with
the previous two commits on this branch, jcpan -t HTML::Element
now leaves only 3 failing files (refloop.t — weak-ref work tracked
on another branch; whitespace.t — hangs on \xA0 input;
split.t — entity-handling diff).

New regression test src/test/resources/unit/open_string_filehandle.t
covers the 2-arg, 3-arg, package-qualified, and lvalue-scalar paths.

See dev/modules/html_element.md (Phase 3, now marked completed).

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock changed the title [WIP] fix(closure,HTML::Parser): unblock HTML-Tree 5.07 tests fix(closure,HTML::Parser,open): unblock HTML-Tree 5.07 tests Apr 25, 2026
@fglock fglock marked this pull request as ready for review April 25, 2026 15:16
fglock and others added 2 commits April 25, 2026 18:17
…regression)

The previous commit on this branch routed any literal-string argument
in a `*` prototype slot through FileHandle.parseBarewordHandle. That
was too aggressive — the `*` prototype is generic; user-defined
`sub foo (*) { }` must pass a literal string through as a SCALAR,
matching real Perl. Doing the typeglob conversion unconditionally
broke 4 tests in perl5_t/t/comp/proto.t:

    sub star  (*&)  { ... }   # star "FOO" / star("FOO")
    sub star2 (**&) { ... }   # star2 "FOO", "BAR" / star2("FOO","BAR")

…where the test expects $_[0] eq 'FOO' (a plain string), not a glob
ref.

The right discriminator is "Perl built-in vs. user-defined sub":
real Perl performs the indirect-glob lookup for built-ins like
open/close/binmode/fileno/eof/select/etc. but not for user subs.
Built-ins are exactly those with a registered prototype in
ParserTables.CORE_PROTOTYPES, so we can tell them apart by checking
the current operator name against that map.

This also fixes (incidentally) similar paper cuts that were
silent-no-ops before: `close "FOO"`, `fileno "FOO"`, `binmode "FOO"`,
`eof "FOO"` now look up the typeglob correctly, matching real Perl.

Regression test extended in src/test/resources/unit/open_string_filehandle.t:
- close/fileno via string FH name match the bareword form (3 cases)
- user-defined sub with `*` prototype still receives a SCALAR (2 cases)

After this fix:
  perl5_t/t/comp/proto.t                197/216 (back to baseline)
  HTML-Tree t/oldparse.t                16/16   (Phase 3 preserved)
  src/test/resources/unit/open_string_filehandle.t  10/10
  make                                  green

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
HTML::Parser's `.pm` is shipped via CPAN, not bundled with PerlOnJava
(only the Java-XS shim in HTMLParser.java lives in this repo). So
`use HTML::Parser` in a unit test fails in CI, where no CPAN modules
are installed in @inc.

The `tokens` / `tokenN` / `tokenpos` argspec fix is already covered by
the live `jcpan -t HTML::Element` run (HTML-Tree's t/comment.t,
t/parse.t, t/parsefile.t, t/construct_tree.t, t/split.t exercise
exactly this code path). No need for a duplicate in-tree unit test
that relies on a CPAN module to be present.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock merged commit efbfd34 into master Apr 25, 2026
2 checks passed
@fglock fglock deleted the feature/html-element-fixes branch April 25, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant