-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Description
Html code
<!DOCTYPE html>
<html>
<head>
<meta http-equiv='content-type' content='text/html; charset=windows-1251'>
<title>кириллица кириллица кириллица кириллица кириллица кириллица</title>
</head>
<body>
<img src='/biz-globus-sea-32X32.webp'>
</body>
</html>PHP code
$body = file_get_contents('./index.html');
var_dump(strlen($body));
$html = \duzun\hQuery::fromHTML($body);
$imageNodes = $html->find('img') ?? [];
foreach($imageNodes as $pos => $imageNode) {
var_dump($pos);
}Output
int(298)
int(329) <- the position out of page length
It happens because of <meta http-equiv='content-type' content='text/html; charset=windows-1251'>. The library try to count position in characters.
Using multybite character position is bad idea because of emoji. Need a way to disable using of encoding data.
At the moment I'm using this hard code to always receive the tag position in bytes:
$body = preg_replace_callback(["/<meta[^>]*http-equiv=('|\")content-type('|\")[^>]*>/Ui", "/<meta[^>]*charset=('|\")[^'\"]+('|\")[^>]*>/Ui"], function($matches) {
$repeat = strlen($matches[0]);
return str_repeat(' ', $repeat);
}, $body);Metadata
Metadata
Assignees
Labels
No labels