-
Notifications
You must be signed in to change notification settings - Fork 255
Description
Hi, thanks for the great work on ScreenCoder!
While examining the Screen-10K dataset, I noticed that the screenshot-code pairs have a systematic misalignment in the image dimension:
Screenshots are captured from original URLs and contain real images (photos, icons, logos, etc.).
Ground truth HTML is processed by webpage2html, which replaces all image resources (, CSS background-image, favicon, font files, etc.) with placeholder.png:
.bg{background-image:url(placeholder.png)}
Since placeholder.png is not bundled with the dataset either, rendering the ground truth HTML produces broken images, which means:
The visual appearance of the rendered ground truth code does not match the input screenshot
Any pixel-level or CLIP-based similarity metric between the screenshot and the rendered code output would be penalized by this discrepancy
Models trained on this data learn a lossy mapping — they see real images in the input but are never expected to reproduce them in the output
Questions:
Is this intentional? If so, is the training objective purely focused on layout/structure reproduction, deliberately ignoring image content?
Were the filtering metrics (used to select 10K from 50K) computed against the original URL rendering or against the placeholder-ized HTML rendering?
Would it be possible to release placeholder.png alongside the dataset, or document the expected rendering behavior?
Thanks!