Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
543 changes: 285 additions & 258 deletions AUDIT.md

Large diffs are not rendered by default.

51 changes: 51 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Changelog

All notable changes to datascope are documented here.

## [2.2.0] — 2026-05-15

### Added
- Parquet input support (`pip install datascope-dq[parquet]`)
- HTML report output (`--format html`)
- Annotated Excel output (`--format annotated-excel`) — highlights problem cells in the source file
- Missing-value pattern analyzer (detects high null rates and distribution)
- `--max-rows` safety guard for large datasets
- `pip-audit` step in CI workflow
- Regex pre-filter for CSV datetime inference (10x speedup on text-heavy files)

### Changed
- Report branding: PDF title page, versioned footers, HTML favicon and meta tags
- JSON output includes `generator` field for provenance

## [2.1.0] — 2026-05-15

### Added
- JSON output format (`--format json`) for pipeline integration
- `--verbose` and `--quiet` CLI flags
- GitHub Actions CI (pytest + ruff + pip-audit)
- PyPI publishing as `datascope-dq`
- `--format both` for PDF + JSON together

### Changed
- Promoted FindingType sub-types to first-class enum values
- Complete `pyproject.toml` metadata for PyPI

## [2.0.0] — 2026-05-14

### Added
- Complete v2 rewrite: cell-level type detection architecture
- 5 analyzers: type consistency, sentinels, leading zeros, mixed dates, cardinality
- Severity classification by downstream impact (Critical / Warning / Info)
- Plain-English narrative templates ("assumption vs. reality" framing)
- Professional PDF report via reportlab
- CSV loader with raw-string type inference
- Excel loader via openpyxl with per-cell type preservation

### Removed
- v1 scoring system (numeric scores replaced by severity + narrative)
- `scorer.py` monolith
- `--strict-types` flag (cell-level detection is always on)

## [1.0.0] — 2026-03-13

Initial release as "field-story-scorer." Single-file tool that scored data quality on a numeric scale. Excel-only input, landscape PDF output.
6 changes: 6 additions & 0 deletions DECISIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,9 @@ pyarrow is large (~200MB installed). Making it a core dependency would bloat ins
## 2026-05-15: PEP 639 license format — drop legacy classifier

Modern setuptools (isolated build env) rejects the `License :: OSI Approved :: MIT License` classifier when `license = "MIT"` is also present. Removed the classifier, keeping only the PEP 639 `license` string field. Future classifiers should not include license entries.

## 2026-05-16: Stay in the file-audit niche; do not compete with pipeline tools

- **Why:** GX owns rules, Pandera owns schemas, Soda owns databases, ydata owns stats. datascope's moat is cell-level detection + professional narrative reports for non-technical readers. Competing on their turf dilutes the positioning.
- **Scope:** All future feature decisions for datascope
- **Do not:** Add custom validation rules, database connectors, statistical profiling, Polars backend, drift detection, or web UI/SaaS
12 changes: 12 additions & 0 deletions HANDOFF.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Handoff

## 2026-05-16 12:00 — Audit Round 2 + Demo-Proof

**Started from:** v2.2.0 shipped. Preparing to share demo links with first prospects.

**Did:** Full 4-phase audit for client-readiness. Executed Moves 1-3: fixed --sheet crash, grammar bugs, stale help text; added branding to PDF/HTML/JSON; added badges, CHANGELOG, sample outputs, moved dev docs to .dev/. PR #6 created and pushed.

**State:** PR #6 open with 3 commits. Tool is demo-safe — no crashes, branded reports, professional repo. 283 tests pass, lint clean. Move 4 (Growth) remains future work.

**Next:** Merge PR #6 to main. Then either (a) start Move 4 growth work (landing page, demo video) or (b) rename local folder to "datascope" and begin client outreach. Tool is ready.

---

## Session — 2026-05-15 (Move 4 + tag v2.2.0)

**Phase:** Execution complete — all 25 sub-tasks done
Expand Down
42 changes: 40 additions & 2 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,47 @@
# datascope — Improvement Plan

Derived from full project audit (2026-05-15). See AUDIT.md for rationale.
Derived from project audits (2026-05-15, 2026-05-16). See AUDIT.md for rationale.

Tier: Medium
Current focus: All moves complete. Improvement plan fully executed.
Current focus: Move 1 DEMO-PROOF — fix demo-killers before sharing with prospects.

---

## Decomposition: Move 1 DEMO-PROOF

Goal: A prospect can run `datascope` on any file during a live call without encountering crashes, grammar errors, or stale documentation.

### Steps

- [x] A1: Catch invalid `--sheet` with friendly error messages
- Depends on: none
- Change: Wrap the sheet lookup in `loaders/excel.py` with try/except for `IndexError` (numeric) and `KeyError` (named). Raise `ValueError` with a message listing available sheets.
- Done when: `datascope samples/input/sample_mixed_types.xlsx --sheet 99` and `--sheet NonExistent` both print "Error: Sheet not found..." to stderr and exit 1 (no traceback).

- [x] A2: Fix singular/plural grammar in narrative templates
- Depends on: none
- Change: In `findings/templates.py`, fix `type_inconsistency()` — "X value were" → conditional "was"/"were"; in `sentinel_value()` — "(N times)" → conditional "(1 time)"/"(N times)". Audit all 7 template functions for similar issues.
- Done when: `datascope` on a file with exactly 1 minority-type value outputs "1 str value was found"; sentinel with count=1 outputs "(1 time)"; `python -m pytest tests/` still passes.

- [x] A3: Update argparse description to include Parquet
- Depends on: none
- Change: In `cli.py` line 30-33, update description from ".xlsx or .csv" to ".xlsx, .csv, or .parquet".
- Done when: `datascope --help` output mentions all three formats.

- [x] A4: Fix numpy dependency in `generate_sample.py`
- Depends on: none
- Change: Add `numpy>=1.24.0` to `[project.optional-dependencies] dev` in `pyproject.toml` (it's already a dev-time tool, not needed at runtime). Update `requirements-dev.txt` if it exists.
- Done when: `pip install -e ".[dev]" && python generate_sample.py` succeeds; numpy is NOT in `[project.dependencies]`.

- [x] A5: Warn when `--sheet` is passed for non-Excel files
- Depends on: A1 (sheet error handling should be in place first)
- Change: In `cli.py`, after resolving `ext`, if `ext != ".xlsx"` and `args.sheet is not None`, print a warning to stderr: "Warning: --sheet is ignored for {ext} files."
- Done when: `datascope somefile.csv --sheet Revenue` prints the warning to stderr but still runs successfully.

- [x] A6: Integration verification
- Depends on: A1-A5 all complete
- Run full test suite, run all 5 output formats on both sample files, test the error paths manually.
- Done when: `python -m pytest tests/` passes (283+ tests); all manual edge cases produce friendly output (no tracebacks).

---

Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# datascope

[![PyPI version](https://img.shields.io/pypi/v/datascope-dq)](https://pypi.org/project/datascope-dq/)
[![CI](https://github.com/MsShawnP/datascope/actions/workflows/ci.yml/badge.svg)](https://github.com/MsShawnP/datascope/actions/workflows/ci.yml)
[![Python 3.10+](https://img.shields.io/pypi/pyversions/datascope-dq)](https://pypi.org/project/datascope-dq/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%.

datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report.
Expand Down
19 changes: 16 additions & 3 deletions datascope/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ def _build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
prog="datascope",
description=(
"Analyse a tabular dataset (.xlsx or .csv) for data-quality issues\n"
"and generate a professional PDF diagnostic report."
"Analyse a tabular dataset (.xlsx, .csv, or .parquet) for data-quality\n"
"issues and generate a professional diagnostic report."
),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
Expand Down Expand Up @@ -155,6 +155,7 @@ def _format_summary(findings: list, source_metadata: dict, output_path: Path) ->

def _write_json(findings: list, source_metadata: dict, output_path: Path) -> None:
"""Write findings as structured JSON."""
from datascope import __version__

counts: dict[str, int] = {"critical": 0, "warning": 0, "info": 0, "total": 0}
for f in findings:
Expand All @@ -163,6 +164,7 @@ def _write_json(findings: list, source_metadata: dict, output_path: Path) -> Non
counts["total"] += 1

payload = {
"generator": {"tool": "datascope", "version": __version__},
"source": dict(source_metadata),
"summary": counts,
"findings": [
Expand Down Expand Up @@ -220,7 +222,18 @@ def main(argv: list[str] | None = None) -> None:
from datascope.loaders import load

sheet = _parse_sheet(args.sheet)
result = load(input_path, sheet=sheet)

if args.sheet is not None and ext != ".xlsx":
print(
f"Warning: --sheet is ignored for {ext} files.",
file=sys.stderr,
)

try:
result = load(input_path, sheet=sheet)
except ValueError as exc:
print(f"Error: {exc}", file=sys.stderr)
sys.exit(1)

# --- size guard -----------------------------------------------------
rows, cols = result.dataframe.shape
Expand Down
8 changes: 5 additions & 3 deletions datascope/findings/templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,14 @@ def type_inconsistency(field_name: str, evidence: dict[str, Any]) -> dict[str, s

minority_desc = " and ".join(minority_parts) if minority_parts else "other types"
example_str = _join_examples(all_examples)
total_minority = sum(mt.get("count", 0) for mt in minority_types)
verb = "was" if total_minority == 1 else "were"

assumption = (
f"Column '{field_name}' appears to be purely {majority}."
)
reality = (
f"However, {minority_desc} were found among {total} non-null values "
f"However, {minority_desc} {verb} found among {total} non-null values "
f"(the majority type covers {_pct(majority_pct)}). "
f"Examples of unexpected values: {example_str}."
)
Expand Down Expand Up @@ -121,7 +123,7 @@ def sentinel_value(field_name: str, evidence: dict[str, Any]) -> dict[str, str]:

sentinel_examples = [s.get("value", "?") for s in sentinels]
sentinel_counts = [
f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} times)"
f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} {'time' if s.get('count', 0) == 1 else 'times'})"
for s in sentinels
]
sentinel_desc = ", ".join(sentinel_counts) if sentinel_counts else "unknown sentinel values"
Expand Down Expand Up @@ -266,7 +268,7 @@ def near_constant(field_name: str, evidence: dict[str, Any]) -> dict[str, str]:
top_values = evidence.get("top_values", [])

top_parts = [
f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} times)"
f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} {'time' if tv.get('count', 0) == 1 else 'times'})"
for tv in top_values[:3]
]
top_desc = ", ".join(top_parts) if top_parts else "(no values)"
Expand Down
14 changes: 14 additions & 0 deletions datascope/loaders/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,22 @@ def load_excel(path: Path, sheet: str | int = 0) -> LoaderResult:

try:
if isinstance(sheet, int):
if sheet < 0 or sheet >= len(wb.worksheets):
available = [ws.title for ws in wb.worksheets]
wb.close()
raise ValueError(
f"Sheet index {sheet} is out of range. "
f"Available sheets: {', '.join(available)}"
)
ws = wb.worksheets[sheet]
else:
if sheet not in wb.sheetnames:
available = wb.sheetnames
wb.close()
raise ValueError(
f"Sheet '{sheet}' not found. "
f"Available sheets: {', '.join(available)}"
)
ws = wb[sheet]

sheet_title = ws.title
Expand Down
8 changes: 7 additions & 1 deletion datascope/reports/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,11 +171,16 @@ def write_html(
</tr>
"""

from datascope import __version__

page = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Data quality diagnostic report for {_e(filename)} — generated by datascope v{__version__}">
<meta name="generator" content="datascope v{__version__}">
<link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><circle cx='50' cy='50' r='40' fill='%231F3864'/><text x='50' y='62' font-size='40' text-anchor='middle' fill='white' font-family='sans-serif' font-weight='bold'>d</text></svg>">
<title>datascope diagnostic — {_e(filename)}</title>
<style>
* {{ margin: 0; padding: 0; box-sizing: border-box; }}
Expand Down Expand Up @@ -237,7 +242,8 @@ def write_html(
</table>

<div class="footer">
Generated by datascope &middot; {now} &middot; {total} finding{'s' if total != 1 else ''}
Generated by datascope v{__version__} &middot; {now} &middot; {total} finding{'s' if total != 1 else ''}
<br>pip install datascope-dq
</div>
</div>
</body>
Expand Down
23 changes: 21 additions & 2 deletions datascope/reports/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,19 @@ def _build_title_page(
counts: dict[Severity, int],
) -> None:
"""Append the title page flowables to *story*."""
story.append(Spacer(1, 1.5 * inch))
from datascope import __version__

story.append(Spacer(1, 1.2 * inch))

# Product branding
brand_style = ParagraphStyle(
"ds_brand", parent=styles["subtitle"],
fontSize=11, textColor=NAVY, fontName="Helvetica-Bold",
spaceAfter=2,
)
story.append(Paragraph("datascope", brand_style))
story.append(Spacer(1, 0.05 * inch))

story.append(Paragraph("Data Quality Diagnostic Report", styles["title"]))
story.append(Spacer(1, 0.1 * inch))

Expand All @@ -195,7 +207,8 @@ def _build_title_page(

date_str = datetime.date.today().strftime("%B %d, %Y")
story.append(Paragraph(date_str, styles["subtitle"]))
story.append(Spacer(1, 0.3 * inch))
story.append(Paragraph(f"v{__version__}", styles["caption"]))
story.append(Spacer(1, 0.25 * inch))

story.append(HRFlowable(width="60%", thickness=2, color=NAVY))
story.append(Spacer(1, 0.3 * inch))
Expand Down Expand Up @@ -599,6 +612,8 @@ def write_pdf(

filename = source_metadata.get("filename", "Unknown source")

from datascope import __version__

def _on_later_pages(canvas, doc):
canvas.saveState()
canvas.setFont("Helvetica", 8)
Expand All @@ -607,6 +622,10 @@ def _on_later_pages(canvas, doc):
0.5 * inch, letter[1] - 0.4 * inch,
f"datascope diagnostic — {filename}",
)
canvas.drawString(
0.5 * inch, 0.35 * inch,
f"datascope v{__version__} — pip install datascope-dq",
)
canvas.drawRightString(
letter[0] - 0.5 * inch, 0.35 * inch,
f"Page {canvas.getPageNumber()}",
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ parquet = [
dev = [
"pytest>=7.0",
"ruff>=0.4.0",
"numpy>=1.24.0",
]

[project.urls]
Expand Down
13 changes: 8 additions & 5 deletions samples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ samples/
│ └── sample_mixed_types.xlsx
└── output/ # Diagnostic reports produced by datascope
├── sample_sales_diagnostic.pdf
└── sample_mixed_types_diagnostic.pdf
├── sample_sales_diagnostic.html
├── sample_mixed_types_diagnostic.pdf
├── sample_mixed_types_diagnostic.html
└── sample_mixed_types_annotated.xlsx
```

## Inputs
Expand All @@ -23,11 +26,11 @@ samples/

## Outputs

Each input file produces a PDF diagnostic report with:
Each input file produces diagnostic reports in multiple formats:

- **Executive summary** — overall health assessment, finding counts by severity
- **Findings by severity** — each finding as an assumption-vs-reality card with impact, fix, and prevention rule
- **Field inventory** — summary table of all columns with detected issue types
- **PDF** — professional report with executive summary, severity-coded finding cards, and field inventory
- **HTML** — self-contained web page with the same structure (open in any browser)
- **Annotated Excel** — source data with problem columns highlighted + a Findings summary sheet

## Regenerating

Expand Down
Binary file not shown.
Loading
Loading