Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion Article-Generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,10 +448,45 @@ The HTML article is a pure projection. If the analysis is weak, the article will
| File | Responsibility |
|---|---|
| [`scripts/aggregate-analysis.ts`](scripts/aggregate-analysis.ts) | CLI wrapper for aggregating one folder or all folders. |
| [`scripts/render-lib/aggregator.ts`](scripts/render-lib/aggregator.ts) | Deterministic logic for ordering, reader-guide insertion, cleaning, linking and front matter. |
| [`scripts/render-lib/aggregator/aggregate.ts`](scripts/render-lib/aggregator/aggregate.ts) | Slim orchestrator: reads artifacts, delegates to leaf modules, returns `AggregationResult`. |
| [`scripts/render-lib/aggregator/interfaces.ts`](scripts/render-lib/aggregator/interfaces.ts) | Shared pipeline interfaces (`PipelineResult`, `ReadStageInput`, `WriteStageOutput`, etc.). |
| [`scripts/render-lib/aggregator/pipeline.ts`](scripts/render-lib/aggregator/pipeline.ts) | Composable pipeline orchestrator (`runArticlePipeline`). |
| [`scripts/render-lib/aggregator/cleaning/`](scripts/render-lib/aggregator/cleaning/) | Body cleaning: admin-bylines, pass-two, process-meta, structural, heading-demotion, link-rewriting, deduplication. |
| [`scripts/render-lib/aggregator/seo/`](scripts/render-lib/aggregator/seo/) | Title and description extraction for SEO metadata. |
| [`scripts/render-lib/aggregator/order.ts`](scripts/render-lib/aggregator/order.ts) | Canonical narrative order (`AGGREGATION_ORDER`). |
| [`scripts/render-lib/aggregator/frontmatter.ts`](scripts/render-lib/aggregator/frontmatter.ts) | YAML front-matter assembly and escape helpers. |
| [`scripts/render-lib/aggregator/reader-guide.ts`](scripts/render-lib/aggregator/reader-guide.ts) | Reader Intelligence Guide table generation. |
| [`scripts/render-lib/aggregator/per-document.ts`](scripts/render-lib/aggregator/per-document.ts) | Per-document `documents/` expansion. |
| [`scripts/render-lib/aggregator/sources-appendix.ts`](scripts/render-lib/aggregator/sources-appendix.ts) | Article Sources appendix generation. |
| [`scripts/render-lib/url-helpers.ts`](scripts/render-lib/url-helpers.ts) | GitHub blob/tree URL construction. |
| [`scripts/render-lib/constants.ts`](scripts/render-lib/constants.ts) | Shared paths, base URLs and language constants. |

### Pipeline architecture (bounded contexts)

```
scripts/render-lib/aggregator/
├── interfaces.ts # Shared pipeline types (PipelineResult, ReadStageInput, etc.)
├── pipeline.ts # Composable pipeline orchestrator (runArticlePipeline)
├── aggregate.ts # Core orchestrator (aggregateAnalysis)
├── order.ts # Canonical narrative order
├── frontmatter.ts # YAML front-matter + escape helpers
├── reader-guide.ts # Reader Intelligence Guide
├── reader-guide-i18n.ts # 14-language i18n for Reader Guide
├── per-document.ts # documents/ expansion
├── sources-appendix.ts # Article Sources appendix
├── cleaning/
│ ├── structural.ts # cleanArtifactBody orchestrator
│ ├── admin-bylines.ts # Admin-byline paragraph stripping
│ ├── pass-two.ts # AI self-audit section stripping
│ ├── process-meta.ts # Process-metadata line stripping
│ ├── heading-demotion.ts # Heading level demotion (## → ###)
│ ├── link-rewriting.ts # Relative → GitHub blob URL rewriting
│ └── deduplication.ts # Adjacent-line and footer-block dedup
└── seo/
├── title.ts # Article title extraction + cleaning
└── description.ts # BLUF / first-paragraph description
```

### Aggregation command

```bash
Expand Down
17 changes: 13 additions & 4 deletions scripts/render-lib/aggregator/aggregate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ import { cleanArticleTitle, readFirstHeading, titleFromBluf } from './seo/title.
import { buildSourcesAppendix } from './sources-appendix.js';

/**
* Inputs to {@link aggregateAnalysis}. All four fields are required;
* the absolute path is used for filesystem reads, the repo-relative
* path is used to build GitHub source URLs.
* Inputs to {@link aggregateAnalysis}. All four required fields provide
* the filesystem and metadata context; the optional config fields allow
* callers (e.g. `runArticlePipeline`) to override front-matter values
* without forking the aggregation logic.
*/
export interface AggregationInput {
/** Absolute path to `analysis/daily/$DATE/$SUBFOLDER`. */
Expand All @@ -56,6 +57,12 @@ export interface AggregationInput {
readonly date: string;
/** `$SUBFOLDER` (e.g. `propositions`). */
readonly subfolder: string;
/** Override the `generated_at` front-matter field (ISO-8601). Defaults to `new Date().toISOString()`. */
readonly generated_at?: string;
/** Language code injected into front-matter (defaults to `'en'`). */
readonly language?: string;
/** Layout template injected into front-matter (defaults to `'article'`). */
readonly layout?: string;
}

/**
Expand Down Expand Up @@ -217,7 +224,9 @@ export function aggregateAnalysis(input: AggregationInput): AggregationResult {
date,
subfolder,
source_folder: subfolderRepoRelPath,
generated_at: new Date().toISOString(),
generated_at: input.generated_at ?? new Date().toISOString(),
language: input.language,
layout: input.layout,
});

const body = sections.join('\n\n');
Expand Down
124 changes: 124 additions & 0 deletions scripts/render-lib/aggregator/cleaning/deduplication.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
/**
* @module Infrastructure/RenderLib/Aggregator/Cleaning/Deduplication
* @category Intelligence Operations / Supporting Infrastructure
* @name Adjacent-line and footer-block deduplication
*
* @description
* Defensive cleaning for AI-authored artifacts that paste classification
* rows, ISMS footers or metadata sentinels more than once. Two functions:
*
* 1. {@link dedupeAdjacentDuplicateLines} — collapses identical adjacent
* non-blank lines (fence-aware, idempotent).
* 2. {@link collapseRepeatedFooterBlocks} — collapses repeated ISMS /
* classification / provenance footer lines to their first occurrence.
*
* Extracted from `structural.ts` to maintain the ≤200 LOC single-
* responsibility constraint.
*
* @author Hack23 AB (Infrastructure Team)
* @license Apache-2.0
*/

/**
* Collapse identical adjacent non-blank lines that appear two-or-more
* times in a row. Defensive cleaning for the common AI-authored failure
* mode where a classification row, ISMS footer or metadata sentinel is
* pasted twice into the same artifact body.
*
* Lines inside fenced code blocks are preserved verbatim — duplication
* inside a code block may be intentional (e.g. config snippets). Blank
* lines are not deduplicated; they participate as paragraph separators
* and are handled later by the `\n{3,}` collapse step.
*
* Stable on already-deduped inputs: the function is idempotent —
* applying it twice yields the same result.
*/
export function dedupeAdjacentDuplicateLines(body: string): string {
const lines = body.split('\n');
const out: string[] = [];
let inFence = false;
let prevNonBlank: string | null = null;
for (const line of lines) {
if (/^\s{0,3}(?:```|~~~)/.test(line)) {
inFence = !inFence;
out.push(line);
prevNonBlank = null;
continue;
}
if (inFence) {
out.push(line);
prevNonBlank = null;
continue;
}
if (line.trim() === '') {
out.push(line);
// Blank lines reset the adjacency window — duplicates separated
// by blank lines are a different concern (handled by
// `collapseRepeatedFooterBlocks`).
prevNonBlank = null;
continue;
}
if (prevNonBlank !== null && line === prevNonBlank) {
// Skip the duplicate.
continue;
}
out.push(line);
prevNonBlank = line;
}
return out.join('\n');
}

/**
* Footer-block markers that templates and AI agents have historically
* emitted at the end of every artifact (sometimes twice). The aggregator
* already strips a curated set of trailing administrative blocks (see
* {@link cleanArtifactBody}); this function catches the *intra-body*
* duplicates — when an ISMS / classification / GDPR provenance line
* appears two-or-more times in the same artifact body, only the first
* occurrence is kept.
*
* A "footer block" is a single line (post-trim) that:
* - starts with the bold marker `**ISMS …`, `**Classified under …`,
* `**GDPR …`, `**Article-Generation contract**`, `**Hack23 ISMS**`,
* `**Provenance**`, or
* - starts with the italic marker `_Classified under …` or
* `*Classified under …`.
*
* Lines inside fenced code blocks are preserved verbatim. Subsequent
* occurrences of the *exact same* footer line are removed (along with a
* single trailing blank line so the surrounding paragraph spacing is
* preserved).
*/
export function collapseRepeatedFooterBlocks(body: string): string {
const FOOTER_LINE = /^\s*(?:\*\*|[*_])\s*(?:ISMS\b|Classified\s+under\b|GDPR\b|Hack23\s+ISMS\b|Article-Generation\s+contract\b|Provenance\b)/i;
const lines = body.split('\n');
const seen = new Set<string>();
const out: string[] = [];
let inFence = false;
for (let i = 0; i < lines.length; i += 1) {
const line = lines[i]!;
if (/^\s{0,3}(?:```|~~~)/.test(line)) {
inFence = !inFence;
out.push(line);
continue;
}
if (inFence) {
out.push(line);
continue;
}
const trimmed = line.trim();
if (FOOTER_LINE.test(trimmed)) {
if (seen.has(trimmed)) {
// Skip this duplicated footer line. Also swallow a single
// trailing blank line so we don't leave a stranded gap.
if (i + 1 < lines.length && lines[i + 1]!.trim() === '') {
i += 1;
}
continue;
}
seen.add(trimmed);
}
out.push(line);
}
return out.join('\n');
}
54 changes: 54 additions & 0 deletions scripts/render-lib/aggregator/cleaning/heading-demotion.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
/**
* @module Infrastructure/RenderLib/Aggregator/Cleaning/HeadingDemotion
* @category Intelligence Operations / Supporting Infrastructure
* @name Heading demotion inside aggregated artifact bodies
*
* @description
* Demote ATX headings by one level inside an artifact body — `##` → `###`,
* `###` → `####`, …, capped at `######`. The aggregator wraps each
* artifact under its own injected `## <title>`, so without this the
* rendered article outline ends up flat (every artifact's internal H2s
* become siblings of the wrapper H2). Indentation, fenced code blocks
* and table contents are not affected — only line-anchored ATX headings
* are matched.
*
* Headings inside fenced code blocks are explicitly excluded by
* tracking fence state line-by-line.
*
* Extracted from `structural.ts` to maintain the ≤200 LOC single-
* responsibility constraint.
*
* @author Hack23 AB (Infrastructure Team)
* @license Apache-2.0
*/

/**
* Demote ATX headings by one level inside an artifact body — `##` → `###`,
* `###` → `####`, …, capped at `######`. The aggregator wraps each
* artifact under its own injected `## <Section title>`, so without this the
* rendered article outline ends up flat (every artifact's internal H2s
* become siblings of the wrapper H2).
*
* Headings inside fenced code blocks are explicitly excluded by
* tracking fence state line-by-line.
*/
export function demoteHeadings(body: string): string {
const lines = body.split('\n');
let inFence = false;
for (let i = 0; i < lines.length; i += 1) {
const line = lines[i]!;
// Track entry/exit of triple-backtick or triple-tilde fenced code.
if (/^\s{0,3}(?:```|~~~)/.test(line)) {
inFence = !inFence;
continue;
}
if (inFence) continue;
const m = line.match(/^(#{1,6})(\s+\S)/);
if (!m) continue;
const current = m[1]!.length;
if (current >= 6) continue; // already at H6, can't demote further
if (current === 1) continue; // H1 already stripped by upstream regex; defensive
lines[i] = '#'.repeat(current + 1) + line.slice(current);
}
return lines.join('\n');
}
44 changes: 44 additions & 0 deletions scripts/render-lib/aggregator/cleaning/link-rewriting.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
/**
* @module Infrastructure/RenderLib/Aggregator/Cleaning/LinkRewriting
* @category Intelligence Operations / Supporting Infrastructure
* @name Relative link → absolute GitHub blob URL rewriting
*
* @description
* Rewrites every relative `[label](path.md)` link in aggregated markdown
* to an absolute GitHub blob URL. The rendered HTML lives at a different
* path than the source artifacts, so every link must be auditable back to
* GitHub. Leaves absolute `http(s)://…` links, fragment-only links and
* `mailto:` links untouched.
*
* Extracted from `structural.ts` to maintain the ≤200 LOC single-
* responsibility constraint.
*
* @author Hack23 AB (Infrastructure Team)
* @license Apache-2.0
*/

import path from 'path';

import { GITHUB_BLOB } from '../../constants.js';

/**
* Rewrite relative `[label](path.md)` links in the aggregated markdown to
* absolute GitHub blob URLs — the rendered HTML lives at a different path
* than the source artifacts, so every link must be auditable back to
* GitHub. Leaves absolute `http(s)://…` links, fragment-only links and
* `mailto:` links untouched.
*/
export function rewriteRelativeLinks(body: string, subfolderRepoRelPath: string): string {
return body.replace(
/\]\((?!https?:\/\/|#|mailto:)([^)]+)\)/g,
(_match, target: string) => {
const [pathPart, anchor] = target.split('#', 2) as [string, string | undefined];
if (!pathPart) return `](${target})`;
const resolved = path.posix.normalize(
path.posix.join(subfolderRepoRelPath, pathPart),
);
const href = `${GITHUB_BLOB}/${resolved}` + (anchor ? `#${anchor}` : '');
return `](${href})`;
},
);
}
Loading
Loading