MsShawnP · MsShawnP · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/...ata-quality-diagnostic-v2-requirements.md → ...ata-quality-diagnostic-v2-requirements.md b/...ata-quality-diagnostic-v2-requirements.md → ...ata-quality-diagnostic-v2-requirements.md
diff --git a/...1-feat-data-quality-diagnostic-v2-plan.md → ...1-feat-data-quality-diagnostic-v2-plan.md b/...1-feat-data-quality-diagnostic-v2-plan.md → ...1-feat-data-quality-diagnostic-v2-plan.md
diff --git a/...ation-data-quality-pipeline-2026-05-14.md → ...ation-data-quality-pipeline-2026-05-14.md b/...ation-data-quality-pipeline-2026-05-14.md → ...ation-data-quality-pipeline-2026-05-14.md
diff --git a/AUDIT.md b/AUDIT.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,51 @@
+# Changelog
+
+All notable changes to datascope are documented here.
+
+## [2.2.0] — 2026-05-15
+
+### Added
+- Parquet input support (`pip install datascope-dq[parquet]`)
+- HTML report output (`--format html`)
+- Annotated Excel output (`--format annotated-excel`) — highlights problem cells in the source file
+- Missing-value pattern analyzer (detects high null rates and distribution)
+- `--max-rows` safety guard for large datasets
+- `pip-audit` step in CI workflow
+- Regex pre-filter for CSV datetime inference (10x speedup on text-heavy files)
+
+### Changed
+- Report branding: PDF title page, versioned footers, HTML favicon and meta tags
+- JSON output includes `generator` field for provenance
+
+## [2.1.0] — 2026-05-15
+
+### Added
+- JSON output format (`--format json`) for pipeline integration
+- `--verbose` and `--quiet` CLI flags
+- GitHub Actions CI (pytest + ruff + pip-audit)
+- PyPI publishing as `datascope-dq`
+- `--format both` for PDF + JSON together
+
+### Changed
+- Promoted FindingType sub-types to first-class enum values
+- Complete `pyproject.toml` metadata for PyPI
+
+## [2.0.0] — 2026-05-14
+
+### Added
+- Complete v2 rewrite: cell-level type detection architecture
+- 5 analyzers: type consistency, sentinels, leading zeros, mixed dates, cardinality
+- Severity classification by downstream impact (Critical / Warning / Info)
+- Plain-English narrative templates ("assumption vs. reality" framing)
+- Professional PDF report via reportlab
+- CSV loader with raw-string type inference
+- Excel loader via openpyxl with per-cell type preservation
+
+### Removed
+- v1 scoring system (numeric scores replaced by severity + narrative)
+- `scorer.py` monolith
+- `--strict-types` flag (cell-level detection is always on)
+
+## [1.0.0] — 2026-03-13
+
+Initial release as "field-story-scorer." Single-file tool that scored data quality on a numeric scale. Excel-only input, landscape PDF output.
diff --git a/DECISIONS.md b/DECISIONS.md
@@ -23,3 +23,9 @@ pyarrow is large (~200MB installed). Making it a core dependency would bloat ins
 ## 2026-05-15: PEP 639 license format — drop legacy classifier
 
 Modern setuptools (isolated build env) rejects the `License :: OSI Approved :: MIT License` classifier when `license = "MIT"` is also present. Removed the classifier, keeping only the PEP 639 `license` string field. Future classifiers should not include license entries.
+
+## 2026-05-16: Stay in the file-audit niche; do not compete with pipeline tools
+
+- **Why:** GX owns rules, Pandera owns schemas, Soda owns databases, ydata owns stats. datascope's moat is cell-level detection + professional narrative reports for non-technical readers. Competing on their turf dilutes the positioning.
+- **Scope:** All future feature decisions for datascope
+- **Do not:** Add custom validation rules, database connectors, statistical profiling, Polars backend, drift detection, or web UI/SaaS
diff --git a/HANDOFF.md b/HANDOFF.md
@@ -1,5 +1,17 @@
 # Handoff
 
+## 2026-05-16 12:00 — Audit Round 2 + Demo-Proof
+
+**Started from:** v2.2.0 shipped. Preparing to share demo links with first prospects.
+
+**Did:** Full 4-phase audit for client-readiness. Executed Moves 1-3: fixed --sheet crash, grammar bugs, stale help text; added branding to PDF/HTML/JSON; added badges, CHANGELOG, sample outputs, moved dev docs to .dev/. PR #6 created and pushed.
+
+**State:** PR #6 open with 3 commits. Tool is demo-safe — no crashes, branded reports, professional repo. 283 tests pass, lint clean. Move 4 (Growth) remains future work.
+
+**Next:** Merge PR #6 to main. Then either (a) start Move 4 growth work (landing page, demo video) or (b) rename local folder to "datascope" and begin client outreach. Tool is ready.
+
+---
+
 ## Session — 2026-05-15 (Move 4 + tag v2.2.0)
 
 **Phase:** Execution complete — all 25 sub-tasks done

diff --git a/PLAN.md b/PLAN.md
@@ -1,9 +1,47 @@
 # datascope — Improvement Plan
 
-Derived from full project audit (2026-05-15). See AUDIT.md for rationale.
+Derived from project audits (2026-05-15, 2026-05-16). See AUDIT.md for rationale.
 
 Tier: Medium
-Current focus: All moves complete. Improvement plan fully executed.
+Current focus: Move 1 DEMO-PROOF — fix demo-killers before sharing with prospects.
+
+---
+
+## Decomposition: Move 1 DEMO-PROOF
+
+Goal: A prospect can run `datascope` on any file during a live call without encountering crashes, grammar errors, or stale documentation.
+
+### Steps
+
+- [x] A1: Catch invalid `--sheet` with friendly error messages
+    - Depends on: none
+    - Change: Wrap the sheet lookup in `loaders/excel.py` with try/except for `IndexError` (numeric) and `KeyError` (named). Raise `ValueError` with a message listing available sheets.
+    - Done when: `datascope samples/input/sample_mixed_types.xlsx --sheet 99` and `--sheet NonExistent` both print "Error: Sheet not found..." to stderr and exit 1 (no traceback).
+
+- [x] A2: Fix singular/plural grammar in narrative templates
+    - Depends on: none
+    - Change: In `findings/templates.py`, fix `type_inconsistency()` — "X value were" → conditional "was"/"were"; in `sentinel_value()` — "(N times)" → conditional "(1 time)"/"(N times)". Audit all 7 template functions for similar issues.
+    - Done when: `datascope` on a file with exactly 1 minority-type value outputs "1 str value was found"; sentinel with count=1 outputs "(1 time)"; `python -m pytest tests/` still passes.
+
+- [x] A3: Update argparse description to include Parquet
+    - Depends on: none
+    - Change: In `cli.py` line 30-33, update description from ".xlsx or .csv" to ".xlsx, .csv, or .parquet".
+    - Done when: `datascope --help` output mentions all three formats.
+
+- [x] A4: Fix numpy dependency in `generate_sample.py`
+    - Depends on: none
+    - Change: Add `numpy>=1.24.0` to `[project.optional-dependencies] dev` in `pyproject.toml` (it's already a dev-time tool, not needed at runtime). Update `requirements-dev.txt` if it exists.
+    - Done when: `pip install -e ".[dev]" && python generate_sample.py` succeeds; numpy is NOT in `[project.dependencies]`.
+
+- [x] A5: Warn when `--sheet` is passed for non-Excel files
+    - Depends on: A1 (sheet error handling should be in place first)
+    - Change: In `cli.py`, after resolving `ext`, if `ext != ".xlsx"` and `args.sheet is not None`, print a warning to stderr: "Warning: --sheet is ignored for {ext} files."
+    - Done when: `datascope somefile.csv --sheet Revenue` prints the warning to stderr but still runs successfully.
+
+- [x] A6: Integration verification
+    - Depends on: A1-A5 all complete
+    - Run full test suite, run all 5 output formats on both sample files, test the error paths manually.
+    - Done when: `python -m pytest tests/` passes (283+ tests); all manual edge cases produce friendly output (no tracebacks).
 
 ---
 

diff --git a/README.md b/README.md
@@ -1,5 +1,10 @@
 # datascope
 
+[![PyPI version](https://img.shields.io/pypi/v/datascope-dq)](https://pypi.org/project/datascope-dq/)
+[![CI](https://github.com/MsShawnP/datascope/actions/workflows/ci.yml/badge.svg)](https://github.com/MsShawnP/datascope/actions/workflows/ci.yml)
+[![Python 3.10+](https://img.shields.io/pypi/pyversions/datascope-dq)](https://pypi.org/project/datascope-dq/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
+
 Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%.
 
 datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report.

diff --git a/datascope/cli.py b/datascope/cli.py
@@ -27,8 +27,8 @@ def _build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(
         prog="datascope",
         description=(
-            "Analyse a tabular dataset (.xlsx or .csv) for data-quality issues\n"
-            "and generate a professional PDF diagnostic report."
+            "Analyse a tabular dataset (.xlsx, .csv, or .parquet) for data-quality\n"
+            "issues and generate a professional diagnostic report."
         ),
         formatter_class=argparse.RawDescriptionHelpFormatter,
     )
@@ -155,6 +155,7 @@ def _format_summary(findings: list, source_metadata: dict, output_path: Path) ->
 
 def _write_json(findings: list, source_metadata: dict, output_path: Path) -> None:
     """Write findings as structured JSON."""
+    from datascope import __version__
 
     counts: dict[str, int] = {"critical": 0, "warning": 0, "info": 0, "total": 0}
     for f in findings:
@@ -163,6 +164,7 @@ def _write_json(findings: list, source_metadata: dict, output_path: Path) -> Non
             counts["total"] += 1
 
     payload = {
+        "generator": {"tool": "datascope", "version": __version__},
         "source": dict(source_metadata),
         "summary": counts,
         "findings": [
@@ -220,7 +222,18 @@ def main(argv: list[str] | None = None) -> None:
     from datascope.loaders import load
 
     sheet = _parse_sheet(args.sheet)
-    result = load(input_path, sheet=sheet)
+
+    if args.sheet is not None and ext != ".xlsx":
+        print(
+            f"Warning: --sheet is ignored for {ext} files.",
+            file=sys.stderr,
+        )
+
+    try:
+        result = load(input_path, sheet=sheet)
+    except ValueError as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
 
     # --- size guard -----------------------------------------------------
     rows, cols = result.dataframe.shape

diff --git a/datascope/findings/templates.py b/datascope/findings/templates.py
@@ -67,12 +67,14 @@ def type_inconsistency(field_name: str, evidence: dict[str, Any]) -> dict[str, s
 
     minority_desc = " and ".join(minority_parts) if minority_parts else "other types"
     example_str = _join_examples(all_examples)
+    total_minority = sum(mt.get("count", 0) for mt in minority_types)
+    verb = "was" if total_minority == 1 else "were"
 
     assumption = (
         f"Column '{field_name}' appears to be purely {majority}."
     )
     reality = (
-        f"However, {minority_desc} were found among {total} non-null values "
+        f"However, {minority_desc} {verb} found among {total} non-null values "
         f"(the majority type covers {_pct(majority_pct)}). "
         f"Examples of unexpected values: {example_str}."
     )
@@ -121,7 +123,7 @@ def sentinel_value(field_name: str, evidence: dict[str, Any]) -> dict[str, str]:
 
     sentinel_examples = [s.get("value", "?") for s in sentinels]
     sentinel_counts = [
-        f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} times)"
+        f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} {'time' if s.get('count', 0) == 1 else 'times'})"
         for s in sentinels
     ]
     sentinel_desc = ", ".join(sentinel_counts) if sentinel_counts else "unknown sentinel values"
@@ -266,7 +268,7 @@ def near_constant(field_name: str, evidence: dict[str, Any]) -> dict[str, str]:
     top_values = evidence.get("top_values", [])
 
     top_parts = [
-        f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} times)"
+        f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} {'time' if tv.get('count', 0) == 1 else 'times'})"
         for tv in top_values[:3]
     ]
     top_desc = ", ".join(top_parts) if top_parts else "(no values)"

diff --git a/datascope/loaders/excel.py b/datascope/loaders/excel.py
@@ -41,8 +41,22 @@ def load_excel(path: Path, sheet: str | int = 0) -> LoaderResult:
 
     try:
         if isinstance(sheet, int):
+            if sheet < 0 or sheet >= len(wb.worksheets):
+                available = [ws.title for ws in wb.worksheets]
+                wb.close()
+                raise ValueError(
+                    f"Sheet index {sheet} is out of range. "
+                    f"Available sheets: {', '.join(available)}"
+                )
             ws = wb.worksheets[sheet]
         else:
+            if sheet not in wb.sheetnames:
+                available = wb.sheetnames
+                wb.close()
+                raise ValueError(
+                    f"Sheet '{sheet}' not found. "
+                    f"Available sheets: {', '.join(available)}"
+                )
             ws = wb[sheet]
 
         sheet_title = ws.title

diff --git a/datascope/reports/html.py b/datascope/reports/html.py
@@ -171,11 +171,16 @@ def write_html(
         </tr>
         """
 
+    from datascope import __version__
+
     page = f"""<!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="utf-8">
 <meta name="viewport" content="width=device-width, initial-scale=1">
+<meta name="description" content="Data quality diagnostic report for {_e(filename)} — generated by datascope v{__version__}">
+<meta name="generator" content="datascope v{__version__}">
+<link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><circle cx='50' cy='50' r='40' fill='%231F3864'/><text x='50' y='62' font-size='40' text-anchor='middle' fill='white' font-family='sans-serif' font-weight='bold'>d</text></svg>">
 <title>datascope diagnostic — {_e(filename)}</title>
 <style>
   * {{ margin: 0; padding: 0; box-sizing: border-box; }}
@@ -237,7 +242,8 @@ def write_html(
   </table>
 
   <div class="footer">
-    Generated by datascope &middot; {now} &middot; {total} finding{'s' if total != 1 else ''}
+    Generated by datascope v{__version__} &middot; {now} &middot; {total} finding{'s' if total != 1 else ''}
+    <br>pip install datascope-dq
   </div>
 </div>
 </body>

diff --git a/datascope/reports/pdf.py b/datascope/reports/pdf.py
@@ -186,7 +186,19 @@ def _build_title_page(
     counts: dict[Severity, int],
 ) -> None:
     """Append the title page flowables to *story*."""
-    story.append(Spacer(1, 1.5 * inch))
+    from datascope import __version__
+
+    story.append(Spacer(1, 1.2 * inch))
+
+    # Product branding
+    brand_style = ParagraphStyle(
+        "ds_brand", parent=styles["subtitle"],
+        fontSize=11, textColor=NAVY, fontName="Helvetica-Bold",
+        spaceAfter=2,
+    )
+    story.append(Paragraph("datascope", brand_style))
+    story.append(Spacer(1, 0.05 * inch))
+
     story.append(Paragraph("Data Quality Diagnostic Report", styles["title"]))
     story.append(Spacer(1, 0.1 * inch))
 
@@ -195,7 +207,8 @@ def _build_title_page(
 
     date_str = datetime.date.today().strftime("%B %d, %Y")
     story.append(Paragraph(date_str, styles["subtitle"]))
-    story.append(Spacer(1, 0.3 * inch))
+    story.append(Paragraph(f"v{__version__}", styles["caption"]))
+    story.append(Spacer(1, 0.25 * inch))
 
     story.append(HRFlowable(width="60%", thickness=2, color=NAVY))
     story.append(Spacer(1, 0.3 * inch))
@@ -599,6 +612,8 @@ def write_pdf(
 
     filename = source_metadata.get("filename", "Unknown source")
 
+    from datascope import __version__
+
     def _on_later_pages(canvas, doc):
         canvas.saveState()
         canvas.setFont("Helvetica", 8)
@@ -607,6 +622,10 @@ def _on_later_pages(canvas, doc):
             0.5 * inch, letter[1] - 0.4 * inch,
             f"datascope diagnostic — {filename}",
         )
+        canvas.drawString(
+            0.5 * inch, 0.35 * inch,
+            f"datascope v{__version__} — pip install datascope-dq",
+        )
         canvas.drawRightString(
             letter[0] - 0.5 * inch, 0.35 * inch,
             f"Page {canvas.getPageNumber()}",

diff --git a/pyproject.toml b/pyproject.toml
@@ -37,6 +37,7 @@ parquet = [
 dev = [
     "pytest>=7.0",
     "ruff>=0.4.0",
+    "numpy>=1.24.0",
 ]
 
 [project.urls]

diff --git a/samples/README.md b/samples/README.md
@@ -11,7 +11,10 @@ samples/
 │   └── sample_mixed_types.xlsx
 └── output/                 # Diagnostic reports produced by datascope
     ├── sample_sales_diagnostic.pdf
-    └── sample_mixed_types_diagnostic.pdf
+    ├── sample_sales_diagnostic.html
+    ├── sample_mixed_types_diagnostic.pdf
+    ├── sample_mixed_types_diagnostic.html
+    └── sample_mixed_types_annotated.xlsx
 ```
 
 ## Inputs
@@ -23,11 +26,11 @@ samples/
 
 ## Outputs
 
-Each input file produces a PDF diagnostic report with:
+Each input file produces diagnostic reports in multiple formats:
 
-- **Executive summary** — overall health assessment, finding counts by severity
-- **Findings by severity** — each finding as an assumption-vs-reality card with impact, fix, and prevention rule
-- **Field inventory** — summary table of all columns with detected issue types
+- **PDF** — professional report with executive summary, severity-coded finding cards, and field inventory
+- **HTML** — self-contained web page with the same structure (open in any browser)
+- **Annotated Excel** — source data with problem columns highlighted + a Findings summary sheet
 
 ## Regenerating
 

diff --git a/samples/output/sample_mixed_types_annotated.xlsx b/samples/output/sample_mixed_types_annotated.xlsx