Screenshot-code pair misalignment: real images in screenshots vs placeholder.png in ground truth HTML

Hi, thanks for the great work on ScreenCoder!

While examining the Screen-10K dataset, I noticed that the screenshot-code pairs have a systematic misalignment in the image dimension:

Screenshots are captured from original URLs and contain real images (photos, icons, logos, etc.).

Ground truth HTML is processed by webpage2html, which replaces all image resources (<img src>, CSS background-image, favicon, font files, etc.) with placeholder.png:


<link data-href="https://www.baidu.com/favicon.ico" href="placeholder.png" rel="shortcut icon"/>

.bg{background-image:url(placeholder.png)}
Since placeholder.png is not bundled with the dataset either, rendering the ground truth HTML produces broken images, which means:

The visual appearance of the rendered ground truth code does not match the input screenshot
Any pixel-level or CLIP-based similarity metric between the screenshot and the rendered code output would be penalized by this discrepancy
Models trained on this data learn a lossy mapping — they see real images in the input but are never expected to reproduce them in the output
Questions:

Is this intentional? If so, is the training objective purely focused on layout/structure reproduction, deliberately ignoring image content?
Were the filtering metrics (used to select 10K from 50K) computed against the original URL rendering or against the placeholder-ized HTML rendering?
Would it be possible to release placeholder.png alongside the dataset, or document the expected rendering behavior?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Screenshot-code pair misalignment: real images in screenshots vs placeholder.png in ground truth HTML #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Screenshot-code pair misalignment: real images in screenshots vs placeholder.png in ground truth HTML #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions