Skip to content

Need a way to always receive a tag position in bytes #100

@PavelFil

Description

@PavelFil

Html code

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv='content-type' content='text/html; charset=windows-1251'>
        <title>кириллица кириллица кириллица кириллица кириллица кириллица</title>
    </head>
    <body>
        <img src='/biz-globus-sea-32X32.webp'>
    </body>
</html>

PHP code

$body = file_get_contents('./index.html');
var_dump(strlen($body));
$html = \duzun\hQuery::fromHTML($body);
$imageNodes = $html->find('img') ?? [];
foreach($imageNodes as $pos => $imageNode) {
    var_dump($pos);
}

Output

int(298)
int(329) <- the position out of page length

It happens because of <meta http-equiv='content-type' content='text/html; charset=windows-1251'>. The library try to count position in characters.

Using multybite character position is bad idea because of emoji. Need a way to disable using of encoding data.

At the moment I'm using this hard code to always receive the tag position in bytes:

$body = preg_replace_callback(["/<meta[^>]*http-equiv=('|\")content-type('|\")[^>]*>/Ui", "/<meta[^>]*charset=('|\")[^'\"]+('|\")[^>]*>/Ui"], function($matches) {
    $repeat = strlen($matches[0]);
    return str_repeat(' ', $repeat);
}, $body);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions