Article-mode oEmbed extraction for video pages + V1 payload parity

wikirby · claude · wikirby · commit 60aea080c281 · 2026-05-14T01:47:59.000-05:00
V1's article-mode flow on video pages (YouTube, Vimeo) produced a save payload with an embedded video iframe, citation, title/author caption, and `<meta name="AutoPageTagsCodes" content="Article" />` / `<meta name="AutoPageTags" content="Article" />` tags that OneNote's page renderer uses to recognize the result as an article-style clip with playable embeds. V2 shipped without any of that machinery -- the article mode just ran Readability over the YouTube DOM, which strips iframes and produces a text-only result with no player and no description. Users reported the regression on YouTube specifically; the same gap applied to Vimeo as well. oEmbed standard provides exactly the shape we need (iframe `html`, title, author_name, thumbnail_url, dimensions) without any provider-specific scraping. Both YouTube and Vimeo publish CORS-enabled oEmbed endpoints that the chrome-extension origin can fetch directly under our existing `<all_urls>` host_permissions. Changes: - New `src/scripts/contentCapture/oembedExtractor.ts` -- thin module with a provider table (YouTube + Vimeo only, matching V1's SupportedVideoDomains), hostname-pattern matching, fetch + JSON parse, and a small `sanitizeProviderHtml` helper that strips script-execution surfaces from provider-supplied HTML. - `extractArticle` in renderer now tries oEmbed first; on no-match or fetch failure it falls through to the existing Readability path with zero behavior change. - Preview vs save are decoupled: - Preview shows the `thumbnail_url` at the same 600x338 (16:9) box the saved iframe uses, with title / "author . provider" attribution, page description (og:description fallback chain same as bookmark mode), and a CSS-only play-glyph overlay when `type === "video"`. No iframe in preview because the renderer's `preview-frame` is sandboxed (allow-same-origin) and the YouTube/Vimeo player can't run JS inside it -- which is why earlier attempts produced a broken "Unable to execute JavaScript" placeholder. - Save uses the provider's iframe HTML (sanitized), with `data-original-src=<pageUrl>` injected and dimensions normalized to 600x338 -- the marker OneNote's renderer uses to recognize and render the embedded player on the saved page, matching V1's YoutubeVideoExtractor behavior exactly. - PageMetadata plumbing: renderer threads a `pageMetadata` map through the save port message; worker's `buildPage` iterates and emits `<meta name="K" content="V" />` for each entry. Mirrors V1's `OneNoteApi.OneNotePage.getPageMetadataAsHtml` behavior. Article mode (both oEmbed and Readability paths) populates `AutoPageTagsCodes=Article`, `AutoPageTags=Article`, plus title/author/siteName (oEmbed) or title/description/author/siteName/publishedTime (Readability, matching V1 augmentationHelper). - `buildPage` HTML output realigned to V1 `OneNoteApi.OneNotePage.getEntireOnml` shape: no `<!DOCTYPE>`, `<html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>>` (no quotes around lang -- matches V1 output literally), locale via `chrome.i18n.getUILanguage()`. Same change applied to the parallel `distHtml` builder for distributed-PDF saves so all save paths emit the same shape. - Bookmark thumbnail size fallback restored: `imageToDataUrl` initial-encode is PNG (good for icons/logos), with iterative JPEG-quality step-down when the encoded data URL exceeds the OneNote API per-MIME-part limit (~2MB minus padding). Matches V1's deleted `DomUtils.adjustImageQualityIfNecessary` behavior including the 0.1 step size. Surfaced because the user was hitting "400 Maximum request size exceeded" on bookmark-mode saves of YouTube pages whose 1280x720 og:image PNG-encoded to ~2.5MB. Provider scope is intentionally narrow (YouTube + Vimeo only) to match V1's effective surface and avoid accidentally enabling capture on sites V1 never supported. V1 also handled Khan Academy via regex scrape for embedded YouTube IDs in lesson-page HTML; that markup likely no longer matches modern Khan Academy pages and is skipped here per maintainer direction. Verified manually: YouTube watch page and Vimeo video page produce saved OneNote pages with the embedded player, title/author caption, and og:description text below; non-matching domains fall through to Readability with no regression; bookmark mode on YouTube saves successfully without the 400 limit error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/src/scripts/contentCapture/oembedExtractor.ts b/src/scripts/contentCapture/oembedExtractor.ts
@@ -0,0 +1,145 @@
+/**
+ * oEmbed-based article extraction for rich-media pages (video, slideshare,
+ * soundcloud, etc.). When a page URL matches a known oEmbed provider, fetch
+ * the provider's structured embed payload. Returns the raw response data so
+ * the renderer can compose distinct HTML for preview (clean static thumbnail)
+ * and save (iframe embed picked up by OneNote's page renderer).
+ *
+ * Provider list mirrors the canonical OneNote-supported set. Each entry is
+ * { name, endpoint, hostPattern } where hostPattern is either a bare
+ * hostname (matched as suffix), a "host/path" prefix, or a partial hostname
+ * ending in "." (matched as prefix).
+ *
+ * Returns null on no-match or fetch failure; callers should fall back to
+ * Readability.
+ */
+
+interface OEmbedProvider {
+	name: string;
+	endpoint: string;
+	hostPattern: string;
+}
+
+export interface OEmbedData {
+	type: string;                 // "video" | "photo" | "link" | "rich"
+	html?: string;                // present for video / rich
+	url?: string;                 // present for photo
+	width?: number;
+	height?: number;
+	title?: string;
+	author_name?: string;
+	thumbnail_url?: string;
+	provider_name?: string;
+	pageUrl: string;              // echo of the page URL we matched against
+}
+
+// Provider set matches V1's video extractor support (YouTube + Vimeo).
+// V1 also had KhanAcademy in its SupportedVideoDomains, but Khan Academy
+// doesn't publish an oEmbed endpoint -- their V1 extractor was just
+// scanning Khan Academy pages for embedded YouTube iframes, which our
+// YouTube provider already covers when those iframes are present.
+const PROVIDERS: OEmbedProvider[] = [
+	{ name: "YouTube", endpoint: "https://www.youtube.com/oembed", hostPattern: "youtube.com" },
+	{ name: "YouTube", endpoint: "https://www.youtube.com/oembed", hostPattern: "youtu.be" },
+	{ name: "Vimeo", endpoint: "https://vimeo.com/api/oembed.json", hostPattern: "vimeo.com" },
+];
+
+function matchProvider(url: string): OEmbedProvider | null {
+	let parsed: URL;
+	try {
+		parsed = new URL(url);
+	} catch (e) {
+		return null;
+	}
+	const host = parsed.hostname.toLowerCase();
+	const hostAndPath = (host + parsed.pathname).toLowerCase();
+
+	for (const provider of PROVIDERS) {
+		const pattern = provider.hostPattern.toLowerCase();
+
+		if (pattern.indexOf("/") !== -1) {
+			if (hostAndPath === pattern
+				|| hostAndPath.indexOf(pattern) === 0
+				|| hostAndPath.indexOf("." + pattern) !== -1) {
+				return provider;
+			}
+		} else if (pattern.charAt(pattern.length - 1) === ".") {
+			if (host.indexOf(pattern) === 0) {
+				return provider;
+			}
+		} else {
+			if (host === pattern || host.indexOf("." + pattern) === host.length - pattern.length - 1) {
+				return provider;
+			}
+		}
+	}
+	return null;
+}
+
+/**
+ * Strip executable surfaces from provider-supplied HTML while preserving the
+ * iframes/anchors/images that carry the actual embed. Belt-and-suspenders:
+ * the renderer's preview iframe is sandboxed (allow-same-origin), and
+ * OneNote sanitizes server-side on save.
+ */
+export function sanitizeProviderHtml(html: string): string {
+	const doc = new DOMParser().parseFromString(html, "text/html");
+
+	const removable = doc.querySelectorAll("script, object, embed, link, style, meta");
+	for (let i = removable.length - 1; i >= 0; i--) {
+		const el = removable[i];
+		if (el.parentNode) { el.parentNode.removeChild(el); }
+	}
+
+	const all = doc.querySelectorAll("*");
+	for (let i = 0; i < all.length; i++) {
+		const el = all[i] as HTMLElement;
+		const attrs = el.attributes;
+		for (let j = attrs.length - 1; j >= 0; j--) {
+			const name = attrs[j].name.toLowerCase();
+			const value = attrs[j].value;
+			if (name.indexOf("on") === 0) {
+				el.removeAttribute(attrs[j].name);
+			} else if ((name === "href" || name === "src") && /^\s*javascript:/i.test(value)) {
+				el.removeAttribute(attrs[j].name);
+			}
+		}
+	}
+
+	return doc.body ? doc.body.innerHTML : "";
+}
+
+/**
+ * Entry point. Returns raw oEmbed response data on success, null on
+ * no-match or any failure (caller should fall back to Readability).
+ */
+export async function tryOEmbed(pageUrl: string): Promise<OEmbedData | null> {
+	if (!pageUrl) { return null; }
+
+	const provider = matchProvider(pageUrl);
+	if (!provider) { return null; }
+
+	const endpoint = provider.endpoint
+		+ "?url=" + encodeURIComponent(pageUrl)
+		+ "&format=json&maxwidth=600";
+
+	try {
+		const resp = await fetch(endpoint);
+		if (!resp.ok) { return null; }
+		const data = await resp.json() as Partial<OEmbedData>;
+		// Only video / rich / photo types produce embeddable content.
+		// "link" type carries metadata only; let Readability handle the page text.
+		if (data.type !== "video" && data.type !== "rich" && data.type !== "photo") {
+			return null;
+		}
+		// Ensure provider_name is set even when the response omits it -- some
+		// providers leave it blank but our match guarantees we know who it is.
+		if (!data.provider_name) {
+			data.provider_name = provider.name;
+		}
+		data.pageUrl = pageUrl;
+		return data as OEmbedData;
+	} catch (e) {
+		return null;
+	}
+}
diff --git a/src/scripts/extensions/webExtensionBase/webExtensionWorker.ts b/src/scripts/extensions/webExtensionBase/webExtensionWorker.ts
@@ -679,6 +679,7 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
 					let saveAnnotation = msg.annotation || "";
 					let saveSectionId = msg.sectionId || "";
 					let saveUrl = msg.url || "";
+					let savePageMetadata: { [key: string]: string } | undefined = msg.pageMetadata;
 
 					// Ensure fresh token before save (matches old clipper.tsx ensureFreshUserBeforeClip)
 					workerSelf.auth.updateUserInfoData(workerSelf.clientInfo.get().clipperId, UpdateReason.TokenRefreshForPendingClip).then(() => {
@@ -697,7 +698,10 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
 								return;
 							}
 
-							// Build OneNote page content based on mode
+							// Build OneNote page content based on mode. Output shape mirrors V1
+							// OneNotePage.getEntireOnml: `<html xmlns lang>` (no DOCTYPE, no
+							// quotes around lang), `<head>` with title + created meta + one
+							// `<meta>` per PageMetadata entry.
 							let buildPage = (bodyOnml: string, imageParts: { name: string; blob: Blob; type: string }[]) => {
 								let boundary = "OneNoteRendererBoundary" + Date.now();
 								let now = new Date();
@@ -710,9 +714,21 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
 								if (parseInt(offsetMins, 10) < 10) { offsetMins = "0" + offsetMins; }
 								let createdTime = offsetSign + offsetHours + ":" + offsetMins;
 								let fontStyle = "font-size: 16px; font-family: Verdana;";
-								let presentationHtml = "<!DOCTYPE html><html><head>"
-									+ "<title>" + saveTitle.replace(/</g, "&lt;").replace(/>/g, "&gt;") + "</title>"
+								let locale = (typeof chrome !== "undefined" && chrome.i18n && chrome.i18n.getUILanguage) ? chrome.i18n.getUILanguage() : "en";
+								let metaTags = "";
+								if (savePageMetadata) {
+									for (let key in savePageMetadata) {
+										if (Object.prototype.hasOwnProperty.call(savePageMetadata, key)) {
+											metaTags += "<meta name=\"" + escapeAttr(key)
+												+ "\" content=\"" + escapeAttr(savePageMetadata[key]) + "\" />";
+										}
+									}
+								}
+								let presentationHtml = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=" + locale + ">"
+									+ "<head>"
+									+ "<title>" + escapeHtml(saveTitle) + "</title>"
 									+ "<meta name=\"created\" content=\"" + createdTime + " \">"
+									+ metaTags
 									+ "</head><body>";
 								if (saveAnnotation) {
 									let escaped = saveAnnotation.replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;");
@@ -861,8 +877,10 @@ export class WebExtensionWorker extends ExtensionWorkerBase<W3CTab, number> {
 											if (parseInt(oM, 10) < 10) { oM = "0" + oM; }
 											let ct = offsetSign2 + oH + ":" + oM;
 											let fStyle = "font-size: 16px; font-family: Verdana;";
-											let distHtml = "<!DOCTYPE html><html><head>"
-												+ "<title>" + pageTitle.replace(/</g, "&lt;").replace(/>/g, "&gt;") + "</title>"
+											let distLocale = (typeof chrome !== "undefined" && chrome.i18n && chrome.i18n.getUILanguage) ? chrome.i18n.getUILanguage() : "en";
+											let distHtml = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=" + distLocale + ">"
+												+ "<head>"
+												+ "<title>" + escapeHtml(pageTitle) + "</title>"
 												+ "<meta name=\"created\" content=\"" + ct + " \">"
 												+ "</head><body>";
 											if (pageIdx === 0 && saveAnnotation) {
diff --git a/src/scripts/renderer.ts b/src/scripts/renderer.ts