Skip to content

Commit 60aea08

Browse files
wikirbyclaude
andcommitted
Article-mode oEmbed extraction for video pages + V1 payload parity
V1's article-mode flow on video pages (YouTube, Vimeo) produced a save payload with an embedded video iframe, citation, title/author caption, and `<meta name="AutoPageTagsCodes" content="Article" />` / `<meta name="AutoPageTags" content="Article" />` tags that OneNote's page renderer uses to recognize the result as an article-style clip with playable embeds. V2 shipped without any of that machinery -- the article mode just ran Readability over the YouTube DOM, which strips iframes and produces a text-only result with no player and no description. Users reported the regression on YouTube specifically; the same gap applied to Vimeo as well. oEmbed standard provides exactly the shape we need (iframe `html`, title, author_name, thumbnail_url, dimensions) without any provider-specific scraping. Both YouTube and Vimeo publish CORS-enabled oEmbed endpoints that the chrome-extension origin can fetch directly under our existing `<all_urls>` host_permissions. Changes: - New `src/scripts/contentCapture/oembedExtractor.ts` -- thin module with a provider table (YouTube + Vimeo only, matching V1's SupportedVideoDomains), hostname-pattern matching, fetch + JSON parse, and a small `sanitizeProviderHtml` helper that strips script-execution surfaces from provider-supplied HTML. - `extractArticle` in renderer now tries oEmbed first; on no-match or fetch failure it falls through to the existing Readability path with zero behavior change. - Preview vs save are decoupled: - Preview shows the `thumbnail_url` at the same 600x338 (16:9) box the saved iframe uses, with title / "author . provider" attribution, page description (og:description fallback chain same as bookmark mode), and a CSS-only play-glyph overlay when `type === "video"`. No iframe in preview because the renderer's `preview-frame` is sandboxed (allow-same-origin) and the YouTube/Vimeo player can't run JS inside it -- which is why earlier attempts produced a broken "Unable to execute JavaScript" placeholder. - Save uses the provider's iframe HTML (sanitized), with `data-original-src=<pageUrl>` injected and dimensions normalized to 600x338 -- the marker OneNote's renderer uses to recognize and render the embedded player on the saved page, matching V1's YoutubeVideoExtractor behavior exactly. - PageMetadata plumbing: renderer threads a `pageMetadata` map through the save port message; worker's `buildPage` iterates and emits `<meta name="K" content="V" />` for each entry. Mirrors V1's `OneNoteApi.OneNotePage.getPageMetadataAsHtml` behavior. Article mode (both oEmbed and Readability paths) populates `AutoPageTagsCodes=Article`, `AutoPageTags=Article`, plus title/author/siteName (oEmbed) or title/description/author/siteName/publishedTime (Readability, matching V1 augmentationHelper). - `buildPage` HTML output realigned to V1 `OneNoteApi.OneNotePage.getEntireOnml` shape: no `<!DOCTYPE>`, `<html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>>` (no quotes around lang -- matches V1 output literally), locale via `chrome.i18n.getUILanguage()`. Same change applied to the parallel `distHtml` builder for distributed-PDF saves so all save paths emit the same shape. - Bookmark thumbnail size fallback restored: `imageToDataUrl` initial-encode is PNG (good for icons/logos), with iterative JPEG-quality step-down when the encoded data URL exceeds the OneNote API per-MIME-part limit (~2MB minus padding). Matches V1's deleted `DomUtils.adjustImageQualityIfNecessary` behavior including the 0.1 step size. Surfaced because the user was hitting "400 Maximum request size exceeded" on bookmark-mode saves of YouTube pages whose 1280x720 og:image PNG-encoded to ~2.5MB. Provider scope is intentionally narrow (YouTube + Vimeo only) to match V1's effective surface and avoid accidentally enabling capture on sites V1 never supported. V1 also handled Khan Academy via regex scrape for embedded YouTube IDs in lesson-page HTML; that markup likely no longer matches modern Khan Academy pages and is skipped here per maintainer direction. Verified manually: YouTube watch page and Vimeo video page produce saved OneNote pages with the embedded player, title/author caption, and og:description text below; non-matching domains fall through to Readability with no regression; bookmark mode on YouTube saves successfully without the 400 limit error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9cb7fa7 commit 60aea08

3 files changed

Lines changed: 380 additions & 17 deletions

File tree

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
/**
2+
* oEmbed-based article extraction for rich-media pages (video, slideshare,
3+
* soundcloud, etc.). When a page URL matches a known oEmbed provider, fetch
4+
* the provider's structured embed payload. Returns the raw response data so
5+
* the renderer can compose distinct HTML for preview (clean static thumbnail)
6+
* and save (iframe embed picked up by OneNote's page renderer).
7+
*
8+
* Provider list mirrors the canonical OneNote-supported set. Each entry is
9+
* { name, endpoint, hostPattern } where hostPattern is either a bare
10+
* hostname (matched as suffix), a "host/path" prefix, or a partial hostname
11+
* ending in "." (matched as prefix).
12+
*
13+
* Returns null on no-match or fetch failure; callers should fall back to
14+
* Readability.
15+
*/
16+
17+
interface OEmbedProvider {
18+
name: string;
19+
endpoint: string;
20+
hostPattern: string;
21+
}
22+
23+
export interface OEmbedData {
24+
type: string; // "video" | "photo" | "link" | "rich"
25+
html?: string; // present for video / rich
26+
url?: string; // present for photo
27+
width?: number;
28+
height?: number;
29+
title?: string;
30+
author_name?: string;
31+
thumbnail_url?: string;
32+
provider_name?: string;
33+
pageUrl: string; // echo of the page URL we matched against
34+
}
35+
36+
// Provider set matches V1's video extractor support (YouTube + Vimeo).
37+
// V1 also had KhanAcademy in its SupportedVideoDomains, but Khan Academy
38+
// doesn't publish an oEmbed endpoint -- their V1 extractor was just
39+
// scanning Khan Academy pages for embedded YouTube iframes, which our
40+
// YouTube provider already covers when those iframes are present.
41+
const PROVIDERS: OEmbedProvider[] = [
42+
{ name: "YouTube", endpoint: "https://www.youtube.com/oembed", hostPattern: "youtube.com" },
43+
{ name: "YouTube", endpoint: "https://www.youtube.com/oembed", hostPattern: "youtu.be" },
44+
{ name: "Vimeo", endpoint: "https://vimeo.com/api/oembed.json", hostPattern: "vimeo.com" },
45+
];
46+
47+
function matchProvider(url: string): OEmbedProvider | null {
48+
let parsed: URL;
49+
try {
50+
parsed = new URL(url);
51+
} catch (e) {
52+
return null;
53+
}
54+
const host = parsed.hostname.toLowerCase();
55+
const hostAndPath = (host + parsed.pathname).toLowerCase();
56+
57+
for (const provider of PROVIDERS) {
58+
const pattern = provider.hostPattern.toLowerCase();
59+
60+
if (pattern.indexOf("/") !== -1) {
61+
if (hostAndPath === pattern
62+
|| hostAndPath.indexOf(pattern) === 0
63+
|| hostAndPath.indexOf("." + pattern) !== -1) {
64+
return provider;
65+
}
66+
} else if (pattern.charAt(pattern.length - 1) === ".") {
67+
if (host.indexOf(pattern) === 0) {
68+
return provider;
69+
}
70+
} else {
71+
if (host === pattern || host.indexOf("." + pattern) === host.length - pattern.length - 1) {
72+
return provider;
73+
}
74+
}
75+
}
76+
return null;
77+
}
78+
79+
/**
80+
* Strip executable surfaces from provider-supplied HTML while preserving the
81+
* iframes/anchors/images that carry the actual embed. Belt-and-suspenders:
82+
* the renderer's preview iframe is sandboxed (allow-same-origin), and
83+
* OneNote sanitizes server-side on save.
84+
*/
85+
export function sanitizeProviderHtml(html: string): string {
86+
const doc = new DOMParser().parseFromString(html, "text/html");
87+
88+
const removable = doc.querySelectorAll("script, object, embed, link, style, meta");
89+
for (let i = removable.length - 1; i >= 0; i--) {
90+
const el = removable[i];
91+
if (el.parentNode) { el.parentNode.removeChild(el); }
92+
}
93+
94+
const all = doc.querySelectorAll("*");
95+
for (let i = 0; i < all.length; i++) {
96+
const el = all[i] as HTMLElement;
97+
const attrs = el.attributes;
98+
for (let j = attrs.length - 1; j >= 0; j--) {
99+
const name = attrs[j].name.toLowerCase();
100+
const value = attrs[j].value;
101+
if (name.indexOf("on") === 0) {
102+
el.removeAttribute(attrs[j].name);
103+
} else if ((name === "href" || name === "src") && /^\s*javascript:/i.test(value)) {
104+
el.removeAttribute(attrs[j].name);
105+
}
106+
}
107+
}
108+
109+
return doc.body ? doc.body.innerHTML : "";
110+
}
111+
112+
/**
113+
* Entry point. Returns raw oEmbed response data on success, null on
114+
* no-match or any failure (caller should fall back to Readability).
115+
*/
116+
export async function tryOEmbed(pageUrl: string): Promise<OEmbedData | null> {
117+
if (!pageUrl) { return null; }
118+
119+
const provider = matchProvider(pageUrl);
120+
if (!provider) { return null; }
121+
122+
const endpoint = provider.endpoint
123+
+ "?url=" + encodeURIComponent(pageUrl)
124+
+ "&format=json&maxwidth=600";
125+
126+
try {
127+
const resp = await fetch(endpoint);
128+
if (!resp.ok) { return null; }
129+
const data = await resp.json() as Partial<OEmbedData>;
130+
// Only video / rich / photo types produce embeddable content.
131+
// "link" type carries metadata only; let Readability handle the page text.
132+
if (data.type !== "video" && data.type !== "rich" && data.type !== "photo") {
133+
return null;
134+
}
135+
// Ensure provider_name is set even when the response omits it -- some
136+
// providers leave it blank but our match guarantees we know who it is.
137+
if (!data.provider_name) {
138+
data.provider_name = provider.name;
139+
}
140+
data.pageUrl = pageUrl;
141+
return data as OEmbedData;
142+
} catch (e) {
143+
return null;
144+
}
145+
}

src/scripts/extensions/webExtensionBase/webExtensionWorker.ts

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -679,6 +679,7 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
679679
let saveAnnotation = msg.annotation || "";
680680
let saveSectionId = msg.sectionId || "";
681681
let saveUrl = msg.url || "";
682+
let savePageMetadata: { [key: string]: string } | undefined = msg.pageMetadata;
682683

683684
// Ensure fresh token before save (matches old clipper.tsx ensureFreshUserBeforeClip)
684685
workerSelf.auth.updateUserInfoData(workerSelf.clientInfo.get().clipperId, UpdateReason.TokenRefreshForPendingClip).then(() => {
@@ -697,7 +698,10 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
697698
return;
698699
}
699700

700-
// Build OneNote page content based on mode
701+
// Build OneNote page content based on mode. Output shape mirrors V1
702+
// OneNotePage.getEntireOnml: `<html xmlns lang>` (no DOCTYPE, no
703+
// quotes around lang), `<head>` with title + created meta + one
704+
// `<meta>` per PageMetadata entry.
701705
let buildPage = (bodyOnml: string, imageParts: { name: string; blob: Blob; type: string }[]) => {
702706
let boundary = "OneNoteRendererBoundary" + Date.now();
703707
let now = new Date();
@@ -710,9 +714,21 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
710714
if (parseInt(offsetMins, 10) < 10) { offsetMins = "0" + offsetMins; }
711715
let createdTime = offsetSign + offsetHours + ":" + offsetMins;
712716
let fontStyle = "font-size: 16px; font-family: Verdana;";
713-
let presentationHtml = "<!DOCTYPE html><html><head>"
714-
+ "<title>" + saveTitle.replace(/</g, "&lt;").replace(/>/g, "&gt;") + "</title>"
717+
let locale = (typeof chrome !== "undefined" && chrome.i18n && chrome.i18n.getUILanguage) ? chrome.i18n.getUILanguage() : "en";
718+
let metaTags = "";
719+
if (savePageMetadata) {
720+
for (let key in savePageMetadata) {
721+
if (Object.prototype.hasOwnProperty.call(savePageMetadata, key)) {
722+
metaTags += "<meta name=\"" + escapeAttr(key)
723+
+ "\" content=\"" + escapeAttr(savePageMetadata[key]) + "\" />";
724+
}
725+
}
726+
}
727+
let presentationHtml = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=" + locale + ">"
728+
+ "<head>"
729+
+ "<title>" + escapeHtml(saveTitle) + "</title>"
715730
+ "<meta name=\"created\" content=\"" + createdTime + " \">"
731+
+ metaTags
716732
+ "</head><body>";
717733
if (saveAnnotation) {
718734
let escaped = saveAnnotation.replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;");
@@ -861,8 +877,10 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
861877
if (parseInt(oM, 10) < 10) { oM = "0" + oM; }
862878
let ct = offsetSign2 + oH + ":" + oM;
863879
let fStyle = "font-size: 16px; font-family: Verdana;";
864-
let distHtml = "<!DOCTYPE html><html><head>"
865-
+ "<title>" + pageTitle.replace(/</g, "&lt;").replace(/>/g, "&gt;") + "</title>"
880+
let distLocale = (typeof chrome !== "undefined" && chrome.i18n && chrome.i18n.getUILanguage) ? chrome.i18n.getUILanguage() : "en";
881+
let distHtml = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=" + distLocale + ">"
882+
+ "<head>"
883+
+ "<title>" + escapeHtml(pageTitle) + "</title>"
866884
+ "<meta name=\"created\" content=\"" + ct + " \">"
867885
+ "</head><body>";
868886
if (pageIdx === 0 && saveAnnotation) {

0 commit comments

Comments
 (0)