diff --git a/docs/brainstorms/2026-05-14-data-quality-diagnostic-v2-requirements.md b/.dev/brainstorms/2026-05-14-data-quality-diagnostic-v2-requirements.md similarity index 100% rename from docs/brainstorms/2026-05-14-data-quality-diagnostic-v2-requirements.md rename to .dev/brainstorms/2026-05-14-data-quality-diagnostic-v2-requirements.md diff --git a/docs/plans/2026-05-14-001-feat-data-quality-diagnostic-v2-plan.md b/.dev/plans/2026-05-14-001-feat-data-quality-diagnostic-v2-plan.md similarity index 100% rename from docs/plans/2026-05-14-001-feat-data-quality-diagnostic-v2-plan.md rename to .dev/plans/2026-05-14-001-feat-data-quality-diagnostic-v2-plan.md diff --git a/docs/solutions/architecture-patterns/cell-level-type-preservation-data-quality-pipeline-2026-05-14.md b/.dev/solutions/architecture-patterns/cell-level-type-preservation-data-quality-pipeline-2026-05-14.md similarity index 100% rename from docs/solutions/architecture-patterns/cell-level-type-preservation-data-quality-pipeline-2026-05-14.md rename to .dev/solutions/architecture-patterns/cell-level-type-preservation-data-quality-pipeline-2026-05-14.md diff --git a/AUDIT.md b/AUDIT.md index 0af61bd..4f0ede4 100644 --- a/AUDIT.md +++ b/AUDIT.md @@ -1,219 +1,230 @@ -# datascope — Project Audit +# datascope — Project Audit (Round 2) -Generated: 2026-05-15 +**Purpose:** Pre-launch readiness for sharing with prospective clients (C-suite, operations). +**Date:** 2026-05-16 --- ## Phase 1: Baseline Assessment -### What It Is Today +### What Was Intended -**datascope v2.0.0** — a Python CLI tool that analyzes Excel/CSV files for hidden data quality issues and produces professional PDF diagnostic reports in plain English. +A portfolio tool to show how clean or dirty a dataset is — built early in the author's solo dev journey, originally via Claude Chat before adopting a structured workflow. -**Core insight:** Most tools let pandas silently coerce types (485 numbers + 15 strings → all float64, strings become NaN). datascope reads each cell's actual Python type, detects quality issues, and explains them as "assumption vs. reality" findings for non-technical readers. +### What Exists Today -### By the Numbers +**datascope v2.2.0** — a Python CLI tool that analyzes tabular data (CSV, Excel, Parquet) for hidden data quality issues and produces professional diagnostic reports in plain English. Published on PyPI as `datascope-dq`. + +The tool works end-to-end. It reads each cell's actual type (bypassing pandas coercion), detects quality issues, classifies severity by downstream impact, and outputs reports in 5 formats (PDF, JSON, HTML, annotated Excel, PDF+JSON). + +### Tech Stack + +| Layer | Technology | +|-------|-----------| +| Language | Python 3.10+ | +| Core deps | pandas, openpyxl, reportlab, defusedxml | +| Optional deps | pyarrow (Parquet support) | +| Dev tools | pytest, ruff | +| CI/CD | GitHub Actions (test + lint + pip-audit + publish) | +| Package | PyPI (`datascope-dq`) | + +### Project Health Indicators | Metric | Value | |--------|-------| -| Production code | ~1,620 LOC across 15 files | -| Test code | ~3,450 LOC, 265 test cases | -| Test:code ratio | 2.1:1 | -| Dependencies | 4 runtime (pandas, openpyxl, reportlab, numpy) | -| Python target | 3.10+ | -| Git commits | 23 over 62 days | +| Production code | ~3,360 LOC across 20 files | +| Test code | ~3,620 LOC, 283 test cases | +| Test:code ratio | 1.08:1 | +| Dependencies | 4 runtime + 1 optional | +| Git commits | 45 over 64 days | +| Contributors | 1 (+ Claude assist) | +| Activity | Active — last commit 2026-05-15 | +| Documentation | README, samples, brainstorm/plan docs | | Open issues/PRs | 0 / 0 | -| License | MIT (Shawn, Lailara LLC) | +| License | MIT | -### Architecture (5 layers) +### Audience -``` -INPUT (Excel/CSV) - → Loaders (235 LOC) — cell-level type preservation, no silent coercion - → Analyzers (633 LOC) — 5 detectors: type consistency, sentinels, leading zeros, mixed dates, cardinality - → Findings (239 LOC) — severity classification + plain-English template composition - → Reports (605 LOC) — professional PDF via reportlab, color-coded severity cards - → CLI (205 LOC) — argparse orchestration, stdout summary -OUTPUT (PDF + stdout) -``` - -### What's Strong +- **Primary:** Prospective clients — C-suite and operations leaders evaluating the author's consulting capabilities +- **Secondary:** Other data consultants who could adopt the tool +- **Tertiary:** The author herself, for real client engagements -1. **Architecture** — clean layer separation, extensible Finding data model, no monolith -2. **Test coverage** — 265 tests at 2.1:1 ratio; unit + integration + CLI tests -3. **Documentation** — portfolio-grade README, brainstorm/plan/learning docs in docs/ -4. **Code quality** — type hints throughout, ruff configured, zero TODO/FIXME/HACK comments -5. **Product thinking** — "assumption vs. reality" framing, non-technical audience focus, severity by downstream impact -6. **Packaging** — proper pyproject.toml, entry point, editable install +### Gap Analysis -### What's Missing or Weak +The v2.0 audit identified 30 findings across architecture, UX, security, performance, and DevEx. All four improvement moves (Clean, Polish, Bridge, Grow) were executed, shipping: +- CI/CD pipeline with pip-audit +- PyPI publishing +- 5 output formats (PDF, JSON, HTML, annotated Excel, both) +- 7 analyzers (added missing-values, format-check) +- Parquet input support +- Performance improvements (regex pre-filter, --max-rows guard) -1. **No CI/CD** — no GitHub Actions; quality gates are manual -2. **No CHANGELOG** — v1→v2 was a major rewrite with no migration record -3. **Legacy scorer.py** — v1 monolith (~31KB) still in root, not archived -4. **Single output format** — PDF only; no Excel/HTML/JSON export -5. **Limited input sources** — Excel and CSV only; no database, no Parquet, no API -6. **No strict mode flag removed** — README git clone URL still says `field-story-scorer.git` but repo renamed to `datascope` -7. **No PyPI publishing** — local install only -8. **generate_sample.py in root** — utility script not in tools/ folder -9. **tools/render_strict_mode_comparison.py** — references v1 strict mode concept, may be stale +**What remains unclear for demo/client readiness:** +- Is the report output polished enough to put in front of a C-suite audience? +- Are there any embarrassing edge cases that could surface during a live demo? +- Does the GitHub repo present professionally (README, samples, docs)? +- Is the PyPI page compelling for consultants evaluating the tool? +- Are there any lingering references to "field story scorer" or v1 concepts? -### Project Identity +### Audit Motivation -- **Repo:** MsShawnP/datascope (renamed from field-story-scorer) -- **Audience:** Data consultants, developers, business analysts -- **Differentiator:** Cell-level type detection + plain-English diagnostic reports -- **Stage:** v2.0 shipped 2026-05-14, no users yet beyond author +The author is about to share demo links with the first round of prospective clients. This audit is about catching anything that would: +1. Undermine credibility during a demo +2. Break during a live walkthrough +3. Look unpolished to a C-suite viewer +4. Confuse a consultant evaluating whether to adopt the tool --- ## Phase 2: Internal Review -Five parallel reviews: architecture, testing, performance, security, UX/docs. Findings ranked by leverage — what moves the needle most for the least effort. - -### Tier 1 — High Leverage (fix before promoting the project) - -| # | Finding | Dimension | Why It Matters | -|---|---------|-----------|----------------| -| 1 | **README clone URL points to old repo name** `field-story-scorer.git` | UX | Every new user hits this immediately; the `cd` command also fails | -| 2 | **`samples/README.md` is entirely about v1** — references scorer.py, --strict-types, scoring numbers | UX | New users exploring samples/ get a completely misleading picture | -| 3 | **Missing `defusedxml` dependency** — openpyxl uses stdlib XML parser without it, exposing XML bomb/XXE risk | Security | One-line fix (`defusedxml>=0.7.0`) that closes a real attack vector for a tool that processes untrusted files | -| 4 | **Legacy `scorer.py` (791 LOC) still in root** — confusing, weaker security posture, imported by tools/ | Architecture | Confuses contributors, duplicates entry points, has unescaped user input in reportlab | -| 5 | **`generate_sample.py` references v1 CLI** — prints `python scorer.py --input ...` | UX | Unusable with v2; generates misleading instructions | -| 6 | **Backtick column names render as literal backticks in PDF** | UX | Every finding card in every report has this cosmetic defect — not "plain English" | -| 7 | **Mixed-dates template newlines ignored in PDF** — `\n`-joined list renders as run-on text | UX | Date format breakdown is unreadable in the actual report | - -### Tier 2 — Medium Leverage (meaningful improvements) - -| # | Finding | Dimension | Why It Matters | -|---|---------|-----------|----------------| -| 8 | **FindingType sub-types dispatched via evidence-key sniffing** — 6 places check magic dict keys | Architecture | Adding a new sub-type requires finding and updating all 6 locations; promote sub-types to first-class enum values | -| 9 | **No CI/CD** — no GitHub Actions, no automated test/lint gates | DevEx | Quality gates are entirely manual; one workflow file closes this | -| 10 | **Full materialization defeats openpyxl read_only=True** — `list(ws.iter_rows())` loads everything at once | Performance | 1M-row file → ~2.4GB RAM; the streaming flag is wasted | -| 11 | **CSV loads entire file into memory twice** — `list(reader)` + `inferred_rows` list | Performance | Same memory problem, compounded by the type inference copy | -| 12 | **CSV datetime inference is O(n × 7 strptime calls)** per non-date cell | Performance | 1M text-string cells → 7M failed strptime calls; regex pre-filter would cut this 10x | -| 13 | **No `--json` / `--format` output flag** | UX | Blocks pipeline integration, CI/CD usage, programmatic consumption | -| 14 | **Analyzer failures swallowed with one-line warning** — no traceback, can produce false-negative reports | UX | A "No issues detected" report when an analyzer actually crashed is dangerous | -| 15 | **No page numbers or running header in PDF** | UX | Multi-page professional deliverable without pagination | -| 16 | **`source_metadata` is untyped `dict[str, Any]`** — keys established by convention across 3 files | Architecture | Adding a new output format requires guessing which keys exist | -| 17 | **`normalize_type` creates cross-analyzer coupling** — sentinel.py and format_check.py import from type_consistency.py | Architecture | Extract to shared utility in analyzers/base.py | -| 18 | **Dependency version ranges fully open, no lock file** | Security | No reproducible builds; `pip audit` not configured | - -### Tier 3 — Polish (lower leverage but worth noting) - -| # | Finding | Dimension | Why It Matters | -|---|---------|-----------|----------------| -| 19 | **`cell_types` stores one `type` reference per cell** — near-doubles memory vs DataFrame | Performance | Could use run-length encoding or type codes instead | -| 20 | **CLI analyzer failure error path untested** (cli.py:184-187) | Testing | The only error-resilience mechanism in the pipeline has zero test coverage | -| 21 | **CSV `_infer_cell` never tested in isolation** — 6 inference branches, no direct unit tests | Testing | Leading-zero preservation, the tool's differentiating feature, is tested only indirectly | -| 22 | **Composer fallback branches untested** — 3 default cases in template dispatch | Testing | Silent wrong-template selection if a new sub-type is added | -| 23 | **No `--quiet` / `--verbose` flags** | UX | Blocks scripting (quiet) and debugging (verbose/traceback) | -| 24 | **`requirements.txt` duplicates `pyproject.toml`** — invites version drift | DevEx | Use only pyproject.toml; generate requirements.txt if needed | -| 25 | **`pyproject.toml` missing `authors`, `urls`, `readme` fields** | DevEx | Needed for PyPI publishing | -| 26 | **No mypy/pyright configuration** — type hints are documentation-only | DevEx | Extensive type hints exist but are never verified | -| 27 | **`Analyzer` type alias defined but never imported or used** | Architecture | Dead code in analyzers/base.py | -| 28 | **`numpy` listed as dependency but unused by v2 code** | Architecture | Only used by legacy scorer.py and generate_sample.py | -| 29 | **`--sheet` silently ignored for CSV files** | UX | No warning when the flag has no effect | -| 30 | **PDF health assessment doesn't mention total finding count** | UX | 25 info findings → "only informational" with no sense of volume | - -### Cross-Cutting Themes - -1. **v1 → v2 cleanup is incomplete.** scorer.py, generate_sample.py, samples/README.md, tools/render_strict_mode_comparison.py, and the README clone URL all reference v1 concepts. This is the single highest-leverage batch of fixes. - -2. **The architecture is sound but has one structural weakness.** The evidence-key sniffing pattern for sub-type dispatch (FORMAT_INCONSISTENCY and CARDINALITY_ANOMALY) creates a hidden coupling between analyzers and the findings layer. Promoting sub-types to first-class enum values eliminates this. - -3. **Performance is fine for the current audience (<10K rows) but has a hard wall.** Full materialization + cell_types doubling + strptime brute-force means the tool falls over around 100K rows. If the target audience ever includes "production data pipeline" users, this needs a streaming rewrite. - -4. **Test coverage is genuinely strong (2.1:1 ratio, 265 tests) but has specific blind spots.** The untested paths are exactly the defensive/error-handling code that matters most when things go wrong: analyzer failures, CSV type inference edge cases, composer fallbacks, PDF health assessment branches. - -5. **The PDF report has two rendering bugs** (backtick literals, newline collapse) that affect every report generated. These are quick fixes with high visible impact. +**Lens:** What would a prospect, a fellow consultant, or a demo audience encounter that would undermine credibility? + +### Tier 1 — Demo Killers (fix before sharing any links) + +| # | Finding | Dimension | Why It's a Demo Killer | +|---|---------|-----------|------------------------| +| 1 | **Unhandled crash on invalid `--sheet`** — both `--sheet 1` (out of range) and `--sheet NonExistent` produce raw Python tracebacks (`IndexError`, `KeyError`) | Demo Resilience | If a prospect types the wrong sheet name during a live walkthrough, they see a Python stacktrace instead of a helpful error. Instant credibility loss. | +| 2 | **Grammar bugs in narrative output** — "1 str value **were** found" (should be "was"); "'N/A' (1 **times**)" (should be "1 time") | Report Quality | The entire value prop is "professional plain English." Grammar errors in the core deliverable directly contradict the positioning. C-suite readers notice this. | +| 3 | **Help text says ".xlsx or .csv" — doesn't mention Parquet** — `argparse` description line 30-33 is stale after Parquet was added | UX / Accuracy | A user reading `--help` won't know Parquet is supported. Minor but sloppy for a demo. | +| 4 | **`generate_sample.py` still imports numpy** — not a runtime dependency, but if someone runs the sample generator per the samples README, they'll get `ModuleNotFoundError: No module named 'numpy'` | Demo Resilience | The samples README tells users to run `python generate_sample.py`. If numpy isn't installed, it crashes. | + +### Tier 2 — Professionalism Issues (fix before promoting broadly) + +| # | Finding | Dimension | Impact | +|---|---------|-----------|--------| +| 5 | **`--sheet` silently ignored for CSV/Parquet** — no warning when the flag has no effect | UX | A user who passes `--sheet Revenue` on a CSV file gets results for the file with no indication their flag was meaningless. Confusing. | +| 6 | **No branding/logo on PDF title page** — the PDF is the portfolio artifact, but it has no visual identity beyond the color scheme | Report Polish | Every competitor's report output has their brand. datascope's PDF looks generic. For a consulting tool, this matters — the report should look like *your* deliverable. | +| 7 | **HTML report has no favicon or meta description** — minor but visible in browser tabs | Report Polish | Browser tab just shows "datascope diagnostic — filename" with no icon. Looks like an unfinished page. | +| 8 | **Old repo name in `docs/` brainstorm/plan files** — 25+ references to "field-story-scorer" and "scorer.py" in docs/ | Repo Presentation | Anyone browsing docs/ on GitHub sees the old project name repeatedly. These are internal dev docs, but they're public and visible. | +| 9 | **No CHANGELOG.md** — no record of what's in v2.2 vs v2.0 vs v1 | Repo Presentation | Consultants evaluating adoption want to see release cadence and what changed. PyPI page links to GitHub but there's no changelog. | +| 10 | **Repo name is still "field story scorer"** (parent directory) — the GitHub repo URL appears to be `MsShawnP/datascope` but the local directory name reveals the old name | Repo Presentation | If someone clones from GitHub this won't matter, but screenshots or file paths could leak the old name. | + +### Tier 3 — Polish (nice-to-have before launch) + +| # | Finding | Dimension | Impact | +|---|---------|-----------|--------| +| 11 | **No `--version` shown in stdout summary** — the report doesn't identify which datascope version produced it | Report Polish | If a consultant runs v2.2 now and v2.3 later, there's no way to tell which version produced which report. | +| 12 | **PDF footer says "datascope diagnostic" — no version or URL** | Report Polish | The footer could link to the tool or show the version for provenance. | +| 13 | **No sample HTML or annotated-Excel in `samples/output/`** — only PDF samples are committed | Repo Presentation | A GitHub visitor can't see what the HTML or Excel output looks like without installing the tool. | +| 14 | **`docs/solutions/` architecture doc references v2.0 patterns only** | Docs Staleness | Internal doc, low impact, but could confuse a contributor. | +| 15 | **No PyPI badge in README** — no quick trust signal for "this is a real published package" | Repo Presentation | Standard for any PyPI-published project. One line to add. | + +### Cross-Cutting Assessment + +**What's strong:** +- The tool works reliably — all 283 tests pass, all 5 output formats produce output, error handling for missing files is clean. +- Report quality (PDF, HTML) is genuinely good — professional color scheme, clear structure, severity-coded cards. +- README is excellent — clear, well-structured, shows the value prop immediately. +- JSON output is well-structured and useful. +- CLI UX is clean (aside from the --sheet crash). + +**What threatens the demo:** +1. The `--sheet` crash is the only true functional bug — everything else works. +2. Grammar errors in the narrative text ("were" vs "was", "1 times" vs "1 time") undermine the "professional plain English" positioning. +3. The stale help text (no Parquet mention) and numpy dependency in generate_sample.py are paper cuts that could bite during a live demo. + +**Overall readiness: 85%.** The foundation is solid. Fixing findings #1-4 (the demo killers) takes ~1-2 hours and gets this to "safe to demo." Findings #5-10 take another 2-3 hours and get it to "proud to share the repo link." --- -## Phase 3: Landscape Scan +--- -### Competitive Set (10 tools) +## Phase 3: Landscape Scan (Updated for v2.2) -| Tool | Stars | Type | Input | Output | Type Detection | Audience | Pricing | -|------|-------|------|-------|--------|----------------|----------|---------| -| **ydata-profiling** | 13.6k | Library | DataFrame | HTML | Column-level inference | Data scientists | Free | -| **Great Expectations** | 11.5k | Framework | DataFrame/SQL | HTML "Data Docs" | Rule-based (expectations) | Data engineers | Free + Cloud | -| **Pandera** | 4.3k | Library | DataFrame | Exceptions (no report) | Schema-as-code | Engineers | Free | -| **SweetViz** | 3.1k | Library | DataFrame | HTML | Column-level | DS/ML | Free | -| **whylogs** | 2.8k | Library | DataFrame | JSON profiles | Column-level stats | ML engineers | Freemium | -| **Soda Core** | 2.3k | CLI | SQL/databases | Pass/fail + Cloud UI | YAML check rules | Data engineers | Free + $750/mo Cloud | -| **DataPrep** | 2.2k | Library | DataFrame | HTML | Column-level (Dask) | Data scientists | Free | -| **Pointblank** | <1k | Library | DataFrame/SQL | HTML tables | Threshold-based validation | Analysts (newer) | Free | -| **DataProfiler** | 1.6k | Library | CSV/JSON/Parquet | JSON | Column-level + PII | Data/security analysts | Free | -| **Deepchecks** | — | Library | DataFrame | HTML | Column-ratio mixed-type check | ML engineers | Freemium | +### Market Changes Since v2.0 Audit (1 day ago, but capturing shifts) -### Feature Matrix — Where datascope Sits +| Event | Impact on datascope | +|-------|-------------------| +| **Fivetran acquired Great Expectations** (May 2026) | GX is being pulled into enterprise data movement. Less likely to serve the "consultant with a file" user. Widens datascope's niche. | +| **ydata-profiling rebranded to fg-data-profiling** | Organizational confusion, import path changes. Makes ydata less of a stable reference point for users choosing a tool. | +| **Pointblank (Posit) growing** — 432 stars, `pb` CLI, interactive HTML reports | Closest new competitor for "shareable output." But requires YAML/Python config for validation; no PDF; targets analysts in R/Python, not consultants. | +| **"Agentic data quality" marketing wave** | Monte Carlo, Soda, Elementary all pitching AI agents. Marketing noise, not a functional competitor to datascope's niche. | -| Capability | datascope | ydata-profiling | Great Expectations | Pandera | Soda Core | Pointblank | -|------------|:---------:|:---------------:|:------------------:|:-------:|:---------:|:----------:| +### Updated Feature Matrix — datascope v2.2 vs Landscape + +| Capability | datascope v2.2 | fg-data-profiling | Great Expectations | Pandera | Soda Core | Pointblank | +|------------|:--------------:|:-----------------:|:------------------:|:-------:|:---------:|:----------:| | **Cell-level type detection** | **Yes** | No | No | No | No | No | -| **Excel-native reading** (openpyxl, no pandas coercion) | **Yes** | No | No | No | No | No | -| **Plain-English narrative** | **Yes** | No | Partial (Data Docs) | No | No | Partial | -| **PDF report output** | **Yes** | No | No | No | No | No | -| **Zero-config CLI** (file in → report out) | **Yes** | No (1-liner but library) | No (expectations required) | No (schema required) | No (YAML required) | No (code required) | -| **CSV support** | Yes | Yes (via pandas) | Yes (via pandas) | Yes | Yes (via SQL) | Yes | -| **Statistical profiling** | No | **Yes** | No | No | No | No | +| **Excel-native reading** (no pandas coercion) | **Yes** | No | No | No | No | No | +| **Plain-English narrative** | **Yes** | No | No | No | No | Partial | +| **PDF report** | **Yes** | No | No | No | No | No | +| **HTML report** | **Yes** | Yes | No | No | No (Cloud only) | **Yes** | +| **JSON/machine-readable** | **Yes** | Yes | Yes | Yes | Yes | Yes | +| **Zero-config CLI** | **Yes** | No | No | No | Near (YAML) | Near (`pb info`) | +| **Parquet support** | **Yes** | Yes | Yes | Yes | Yes | Yes | +| **PyPI published** | **Yes** | Yes | Yes | Yes | Yes | Yes | +| **CI/CD pipeline** | **Yes** | Yes | Yes | Yes | Yes | Yes | +| **CSV support** | Yes | Yes | Yes | Yes | Yes | Yes | +| **Annotated source output** | **Yes** | No | No | No | No | No | | **Custom validation rules** | No | No | **Yes** | **Yes** | **Yes** | **Yes** | -| **Pipeline integration** | No | Yes | **Yes** | **Yes** | **Yes** | Yes | -| **Database support** | No | No | Yes | No | **Yes** | Yes | -| **Parquet/Arrow support** | No | Yes | Yes | Yes | Yes | Yes | -| **JSON/machine-readable output** | No | Yes | Yes | Yes | Yes | Yes | -| **Large file performance** | Weak (>100K rows) | Weak | Good | **Good** | Good | Good | -| **Polars support** | No | No | No | Yes | No | **Yes** | -| **Community/stars** | New | 13.6k | 11.5k | 4.3k | 2.3k | <1k | - -### datascope's Position: What's Better, Worse, Unique, Missing - -**Unique (no competitor does this):** -1. **Cell-level type detection** — every other tool uses column-level inference after pandas/SQL coercion. datascope reads each cell's actual Python type via openpyxl before any coercion happens. This is the core technical moat. -2. **"Assumption vs. reality" narrative framing** — no tool produces prose explanations aimed at non-technical readers. The closest (GX Data Docs, Pointblank tables) are validation result tables, not narratives. -3. **Excel-native reading** — every competitor requires loading through pandas first, which is exactly where type coercion destroys the signal datascope detects. -4. **PDF as portable audit artifact** — no competitor outputs PDF. The inspection-report analogy: the artifact is passed to a client who doesn't control the toolchain. - -**Better than competitors:** -5. **Zero-config experience** — `datascope file.xlsx` produces a full report. GX requires expectation suites, Pandera requires schemas, Soda requires YAML. The setup cost for datascope is zero. -6. **Non-technical audience targeting** — while competitors target engineers, datascope targets consultants handing reports to clients. - -**Worse than competitors:** -7. **No machine-readable output** — every major competitor supports JSON/HTML/programmatic output. datascope has PDF + unstructured stdout only. -8. **No custom validation rules** — can't define domain-specific checks ("price must be positive", "date must be after 2020"). -9. **No pipeline integration** — can't embed in CI/CD, dbt, or Airflow workflows without parsing stdout. -10. **No statistical profiling** — no distributions, correlations, missing-value analysis beyond what the 5 analyzers detect. -11. **Performance ceiling** — falls over at ~100K rows due to full materialization and strptime brute-force. -12. **No community** — new project with zero external users/stars. - -**Missing (competitors have, datascope doesn't):** -13. **Database/Parquet/Arrow input** — limited to Excel + CSV. -14. **Polars support** — Polars is the growth vector in the Python data ecosystem. -15. **HTML report option** — for web/email embedding. -16. **Drift detection** — comparing two datasets or monitoring over time (whylogs, Soda territory). - -### Market Context - -**Where the market is going:** -- Pipeline-integrated observability platforms (Monte Carlo, Sifflet, Datafold — VC-funded) -- AI-augmented test generation (GX DraftValidation, DataOps TestGen) -- Validation-as-feature inside frameworks (Pydantic in FastAPI, dbt tests) - -**Where the market is NOT going:** -- Standalone CLI tools for one-shot file auditing -- Stakeholder-facing prose reports -- Excel-native anything - -**This is the opportunity.** The market is leaving the "consultant analyzes a client's messy Excel file and needs a professional report" use case completely unaddressed. Every tool is moving toward engineers, pipelines, and platforms. datascope occupies an empty niche. - -**The risk:** The niche may be empty because it's small. The growth path requires either (a) staying niche but being the definitive tool for data consultants, or (b) adding enough pipeline features (JSON output, CI integration) to serve both audiences. - -### Analogies That Clarify Position - -- **Building inspection reports** — inspectors don't hand homeowners JSON schemas. They produce written reports. datascope is the building inspector for data files. -- **Spell-checker UX** — surfaces problems inline, in the user's own document, with one-click fixes. No existing tool does this for data files. -- **Rust compiler errors** — the shift toward human-readable, actionable error messages ("expected integer, found string at column B row 14") maps directly onto datascope's narrative approach. +| **Statistical profiling** | No | **Yes** | No | No | No | Partial | +| **Database support** | No | No | Yes | No | **Yes** | **Yes** | +| **Polars support** | No | No | No | **Yes** | No | **Yes** | +| **Large file performance** | Weak (>100K) | Weak | Good | Good | Good | Good | +| **Community/stars** | New (0) | 13.6k | 11.5k | 4.3k | 2.3k | 432 | + +### What Changed Since v2.0 Audit + +**Gaps closed by v2.2:** +- ~~No machine-readable output~~ → JSON output shipped +- ~~No HTML report~~ → HTML shipped +- ~~No Parquet support~~ → Parquet shipped +- ~~No CI/CD~~ → GitHub Actions (test + lint + audit + publish) +- ~~No PyPI~~ → Published as `datascope-dq` +- ~~Single output format~~ → 5 formats +- ~~5 analyzers~~ → 7 analyzers (added missing-values, format-check) + +**Gaps that remain (intentionally):** +- No custom validation rules (GX/Pandera territory — don't compete) +- No database support (Soda territory — stay in file lane) +- No statistical profiling (fg-data-profiling territory — datascope finds *problems*, not *stats*) +- No Polars support (low priority for target audience) +- No drift detection (whylogs territory — datascope is point-in-time) + +### datascope's Updated Competitive Position + +**Still unique (no competitor does this):** +1. Cell-level type detection via openpyxl (technical moat — unchanged) +2. "Assumption vs. reality" narrative framing for non-technical readers +3. PDF as first-class output format (still zero competitors with native PDF) +4. Annotated Excel output highlighting problem cells in the source file +5. True zero-config: `datascope file.xlsx` → full report, no setup + +**Stronger since v2.0:** +6. JSON + HTML output bridges the engineer audience (was the #1 gap) +7. PyPI publishing + CI/CD provides trust signals +8. Parquet support covers the modern data stack input format + +**Still weaker:** +9. No community — 0 external users/stars (biggest risk for consultant adoption) +10. Performance ceiling at ~100K rows (fine for target audience) +11. No web UI or SaaS (fine for now — CLI + report is the right form factor) + +### The Niche Assessment (Revised) + +The v2.0 audit concluded: *"The market is leaving the 'consultant analyzes a client's messy Excel file and needs a professional report' use case completely unaddressed."* + +**This is still true.** In fact it's MORE true: +- GX being absorbed by Fivetran makes it more enterprise/pipeline focused, not less +- fg-data-profiling's organizational turmoil (rename, maintainer change) creates uncertainty +- Pointblank is the closest emerging competitor for "shareable output" but still requires configuration and doesn't produce PDF +- The "agentic" wave is all about pipelines and monitoring, not one-shot file audits + +**datascope v2.2 now covers the full minimum-viable competitive surface:** +- Multiple input formats (Excel, CSV, Parquet) +- Multiple output formats (PDF, HTML, JSON, annotated Excel) +- Published and installable (`pip install datascope-dq`) +- CI/CD trust signals +- Professional report quality + +**What would make it a clear winner for the consulting niche:** +- Social proof (stars, testimonials, case studies) +- Branding on the PDF report +- A demo page or hosted example report +- Logo/visual identity + +--- --- @@ -221,113 +232,129 @@ Five parallel reviews: architecture, testing, performance, security, UX/docs. Fi ### Strategic Frame -datascope has a **genuine technical moat** (cell-level type detection) and a **genuine product moat** (plain-English narrative for non-technical readers). No competitor combines both. The architecture is sound, the test coverage is strong, and the code quality is portfolio-grade. +datascope v2.2 is **functionally complete** for its niche. The tool works, the reports are professional, the feature set covers input/output formats that matter. The gap is no longer "can it do X?" — it's "does it *present* as a credible professional tool when a prospect evaluates it?" -But the project can't capitalize on either moat yet because: -1. **v1 artifacts confuse first impressions** — scorer.py, old README URL, stale samples -2. **PDF rendering bugs undermine the "professional report" value prop** — the core product has cosmetic defects -3. **No machine-readable output blocks the bridge audience** — engineers who'd champion datascope in their org can't integrate it -4. **No CI/CD or PyPI hurts credibility** — open-source adoption requires trust signals +This is a positioning and polish problem, not a feature problem. -The synthesis produces four ranked move categories: **Clean → Polish → Bridge → Grow.** +### Priority Matrix + +Cross-referencing Phase 2 findings with Phase 3 landscape position: + +``` + HIGH landscape impact + │ + ┌──────────────────┼──────────────────┐ + │ BRAND │ DEMO-PROOF │ + │ (Move 2) │ (Move 1) │ + │ PDF branding │ Fix crashes │ + │ Hosted demo │ Fix grammar │ + │ Logo/identity │ Fix help text │ +LOW effort ├──────────────────┼──────────────────┤ HIGH effort + │ HYGIENE │ GROWTH │ + │ (Move 3) │ (Move 4) │ + │ Changelog │ Landing page │ + │ PyPI badge │ Community │ + │ Sample outputs │ Case studies │ + │ │ │ + └──────────────────┼──────────────────┘ + │ + LOW landscape impact +``` --- -### Move 1: CLEAN — Ship-ready baseline (1-2 sessions) +### Move 1: DEMO-PROOF — Nothing breaks live (~1-2 hours) -*Goal: A stranger who finds the repo can install, run, and trust what they see.* +*Goal: A prospect can run `datascope` on any file during a call without seeing a crash or grammar error.* -| Task | Internal Finding | Landscape Rationale | Effort | -|------|-----------------|---------------------|--------| -| Fix README clone URL + cd command | Phase 2 #1 | First-touch UX; every competitor has working install instructions | 5 min | -| Rewrite `samples/README.md` for v2 | Phase 2 #2 | Samples are the "try it yourself" onramp | 30 min | -| Delete `scorer.py` from root | Phase 2 #4 | Eliminates confusion, removes weaker security surface | 5 min | -| Update `generate_sample.py` for v2 CLI | Phase 2 #5 | Makes sample generation actually work | 15 min | -| Retire or port `tools/render_strict_mode_comparison.py` | Phase 2 #4 | Last v1 import reference | 15 min | -| Add `defusedxml>=0.7.0` to dependencies | Phase 2 #3 | One-line fix; closes XML bomb vector for a tool processing untrusted files | 2 min | -| Drop `numpy` from dependencies (unused by v2) | Phase 2 #28 | Smaller install footprint, honest dependency list | 2 min | +| Task | Phase 2 Finding | Effort | +|------|----------------|--------| +| Catch invalid `--sheet` (index out of range + name not found) → friendly error message | #1 | 20 min | +| Fix grammar: "1 str value were" → "was"; "(1 times)" → "(1 time)" �� audit all templates for singular/plural | #2 | 30 min | +| Update argparse description to mention `.parquet` alongside `.xlsx` and `.csv` | #3 | 5 min | +| Replace `numpy` in `generate_sample.py` with stdlib `random` (or add numpy to `[dev]` extras) | #4 | 20 min | +| Warn when `--sheet` is passed for CSV/Parquet files | #5 | 10 min | -**Total: ~75 minutes of focused work. Zero architectural risk.** +**Why this is Move 1:** These are the only things that can *embarrass you live*. Everything else is about how good it looks when no one's watching. This is about when someone IS watching. --- -### Move 2: POLISH — The report is the product (1-2 sessions) +### Move 2: BRAND — The report IS the product (~2-3 hours) -*Goal: Every PDF datascope produces is genuinely professional and correct.* +*Goal: A PDF or HTML report from datascope looks like it came from a professional consulting tool, not a side project.* -| Task | Internal Finding | Landscape Rationale | Effort | -|------|-----------------|---------------------|--------| -| Fix backtick literals in PDF templates | Phase 2 #6 | The report is datascope's differentiator; literal backticks aren't "plain English" | 30 min | -| Fix newline collapse in mixed-dates template | Phase 2 #7 | Date format breakdown is unreadable as run-on text | 20 min | -| Add page numbers + running header to PDF | Phase 2 #15 | Every competitor's HTML report has navigation; PDF needs pagination | 45 min | -| Fix PDF health assessment to mention total count | Phase 2 #30 | "Only informational" with 25 findings buries the signal | 15 min | -| Regenerate v2 sample outputs in `samples/output/` | Phase 1 gap | Current samples are v1 artifacts | 15 min | +| Task | Phase 2 Finding | Landscape Rationale | Effort | +|------|----------------|--------------------:|--------| +| Add configurable branding to PDF title page (tool name + optional logo placeholder) | #6 | No competitor has PDF — make datascope's PDF look like a *product* | 45 min | +| Add datascope version + URL to PDF footer | #12 | Provenance — which version produced this report | 15 min | +| Add favicon + meta tags to HTML report | #7 | Pointblank has polished HTML; datascope's should match | 15 min | +| Add `datascope --version` to report metadata (JSON `"generator"` field) | #11 | Machine-readable provenance | 10 min | +| Create a hosted example report (HTML) — link from README | Landscape #9 (social proof) | Prospects can see the output quality without installing | 45 min | -**Total: ~2 hours. These are the changes users actually see.** - -The landscape confirms this priority: datascope's unique position is the **report**. ydata-profiling has better stats. GX has better rules. Pandera has better schemas. But none of them produce a report you'd hand to a non-technical client. If the PDF has cosmetic bugs, the entire value proposition is undermined. +**Why this is Move 2:** The report is what prospects see. It's the artifact they judge you by. Right now it's functional but anonymous — it doesn't *own* its identity. Adding branding turns "a PDF someone generated" into "a datascope diagnostic." --- -### Move 3: BRIDGE — Serve both audiences (2-3 sessions) +### Move 3: REPO HYGIENE — The GitHub page passes scrutiny (~1 hour) -*Goal: Engineers can integrate datascope into pipelines; consultants still get their PDF.* +*Goal: A consultant browsing the GitHub repo sees a maintained, professional project.* -| Task | Internal Finding | Landscape Rationale | Effort | -|------|-----------------|---------------------|--------| -| Add `--format json` output flag | Phase 2 #13 | Every competitor has machine-readable output; this is datascope's biggest functional gap | 2-3 hr | -| Add `--verbose` / `--quiet` flags | Phase 2 #23 | `--quiet` enables scripting (exit code only); `--verbose` enables debugging | 1 hr | -| Add GitHub Actions CI (pytest + ruff) | Phase 2 #9 | Table-stakes trust signal for open-source adoption | 30 min | -| Promote FindingType sub-types to first-class enums | Phase 2 #8 | Eliminates evidence-key sniffing in 6 locations; unblocks adding new analyzers | 2 hr | -| Type `source_metadata` as TypedDict | Phase 2 #16 | Unblocks adding new output formats without guessing keys | 30 min | -| Complete `pyproject.toml` metadata (authors, urls, readme) | Phase 2 #25 | Required for PyPI publishing | 10 min | -| Publish to PyPI | Phase 1 gap | `pip install datascope` is the expected install path; git clone is friction | 1 hr | +| Task | Phase 2 Finding | Effort | +|------|----------------|--------| +| Add PyPI badge + CI badge to README top | #15 | 5 min | +| Create CHANGELOG.md (v1.0 → v2.0 → v2.1 → v2.2 with highlights) | #9 | 30 min | +| Commit sample HTML + annotated-Excel outputs to `samples/output/` | #13 | 10 min | +| Remove or archive `docs/plans/` and `docs/brainstorms/` (internal dev docs with old names) | #8 | 15 min | -**Why `--format json` is the single highest-leverage feature:** -- It's the bridge between datascope's consultant audience and the engineer audience -- Engineers who discover datascope via PyPI can plug it into CI/CD: `datascope data.csv --format json | jq '.findings[] | select(.severity == "CRITICAL")'` -- JSON output is the prerequisite for GitHub Actions integration, Slack alerts, dashboard embedding -- It costs 2-3 hours and doubles the addressable audience +**Why this is Move 3:** These are trust signals. A consultant evaluating whether to use or recommend datascope will check the repo. Badges, changelog, and sample outputs answer "is this maintained?" and "what does the output look like?" without requiring installation. --- -### Move 4: GROW — Expand the moat (future sessions) +### Move 4: GROWTH — Social proof and discovery (ongoing) -*Goal: Make datascope the definitive tool for its niche, then expand.* +*Goal: People find datascope and see that others use it.* | Task | Landscape Rationale | Effort | -|------|---------------------|--------| -| **Add HTML report option** | Email-embeddable; web-viewable; complements PDF for different delivery contexts | 3-4 hr | -| **Add `--max-rows` / size guard** | Prevents OOM on large files; sets user expectations honestly | 1 hr | -| **Regex pre-filter for CSV datetime inference** | 10x speedup on text-heavy CSVs; moves the performance wall from 100K to 1M rows | 1 hr | -| **Add Parquet/CSV-from-stdin input** | Parquet is the growth vector; stdin enables piping from other tools | 2-3 hr | -| **Add a sixth analyzer: missing-value patterns** | Gap vs. ydata-profiling; "15% of rows have no email" is a finding consultants care about | 2-3 hr | -| **Add annotated Excel output** — highlight problem cells in the source file | The "spell-checker UX" analogy; no competitor does this; huge differentiation | 4-6 hr | -| **Stream-process loaders** | Eliminates the 100K-row memory wall entirely | 4-6 hr | -| **Lock file + `pip audit` in CI** | Reproducible builds + vulnerability scanning | 30 min | - -The annotated Excel output is the **long-term differentiator**. No tool in the landscape highlights problem cells in the user's own file. Combined with the PDF diagnostic report, this creates a two-artifact deliverable: "here's your file with problems highlighted, and here's the report explaining what each problem means." That's the building-inspection analogy made concrete. +|------|--------------------:|--------| +| Create a simple landing page or GitHub Pages site with example reports | Only tool in niche with PDF + HTML — show it | 3-4 hr | +| Write a "how I built this" or "data quality for consultants" blog post | SEO + positioning for the unoccupied niche | 2-3 hr | +| Submit to Python data quality lists/awesome-lists | Discovery in the ecosystem | 30 min | +| Add a "Used by" or testimonial section after first client engagement | Social proof — the strongest trust signal | Ongoing | +| Consider a short demo video (2 min: file in → report out) | Shows zero-config UX without installation | 1-2 hr | + +**Why this is Move 4:** The tool is ready. The gap is now awareness. No amount of code improvement matters if prospects can't find it or don't see social proof. But this depends on Moves 1-3 being done first — don't promote something that can crash during a demo. --- -### Strategic Summary +### What NOT to Build + +| Temptation | Why to Resist | +|-----------|---------------| +| Custom validation rules | GX + Pandera own this. You'd compete with 15k-star projects on their turf. | +| Database connectivity | Soda owns this. Stay in the file lane — it's where consultants live. | +| Statistical profiling | fg-data-profiling owns this. datascope finds *problems*, not *distributions*. | +| Web UI / SaaS | Premature. The CLI + report is the right form factor for demos and consulting. | +| Polars backend | Your audience uses Excel and CSV. Polars is an engineer concern. | +| AI-generated fix suggestions | Marketing noise in the landscape. Concrete fix recommendations > vague AI. | + +--- + +### Execution Order ``` -Now Soon Next Later -───────────────────────────────────────────────────────── -CLEAN POLISH BRIDGE GROW -v1 artifacts PDF rendering --format json HTML reports -defusedxml page numbers CI/CD Parquet input -scorer.py sample outputs PyPI publish Annotated Excel -README URL health text enum sub-types Stream loaders - --verbose/--quiet Missing-value analyzer +Week 1 (before sharing links): + Move 1: DEMO-PROOF [~1-2 hours] + Move 2: BRAND [~2-3 hours] + +Week 2 (before promoting repo): + Move 3: REPO HYGIENE [~1 hour] + +Ongoing (after first client conversations): + Move 4: GROWTH [as time allows] ``` -**The thesis:** datascope's moat is the combination of cell-level detection and professional narrative output. Clean the repo (Move 1), make the report flawless (Move 2), then add JSON output to bridge the engineer audience (Move 3). Everything after that deepens the moat or expands the audience. +--- + +### The One-Sentence Thesis -**What NOT to build:** -- Custom validation rules (GX/Pandera own this; don't compete on their turf) -- Statistical profiling (ydata-profiling owns this; datascope finds *problems*, not *statistics*) -- Database connectivity (Soda owns this; stay in the file-auditing lane) -- Drift detection (whylogs owns this; datascope is point-in-time, not longitudinal) -- Web UI / SaaS (premature; the CLI + report is the right form factor for now) +**datascope is the only zero-config tool that produces professional PDF/HTML reports from messy data files — fix the 4 demo-killers, brand the report output, and it's ready for prospects.** diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..6211e57 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,51 @@ +# Changelog + +All notable changes to datascope are documented here. + +## [2.2.0] — 2026-05-15 + +### Added +- Parquet input support (`pip install datascope-dq[parquet]`) +- HTML report output (`--format html`) +- Annotated Excel output (`--format annotated-excel`) — highlights problem cells in the source file +- Missing-value pattern analyzer (detects high null rates and distribution) +- `--max-rows` safety guard for large datasets +- `pip-audit` step in CI workflow +- Regex pre-filter for CSV datetime inference (10x speedup on text-heavy files) + +### Changed +- Report branding: PDF title page, versioned footers, HTML favicon and meta tags +- JSON output includes `generator` field for provenance + +## [2.1.0] — 2026-05-15 + +### Added +- JSON output format (`--format json`) for pipeline integration +- `--verbose` and `--quiet` CLI flags +- GitHub Actions CI (pytest + ruff + pip-audit) +- PyPI publishing as `datascope-dq` +- `--format both` for PDF + JSON together + +### Changed +- Promoted FindingType sub-types to first-class enum values +- Complete `pyproject.toml` metadata for PyPI + +## [2.0.0] — 2026-05-14 + +### Added +- Complete v2 rewrite: cell-level type detection architecture +- 5 analyzers: type consistency, sentinels, leading zeros, mixed dates, cardinality +- Severity classification by downstream impact (Critical / Warning / Info) +- Plain-English narrative templates ("assumption vs. reality" framing) +- Professional PDF report via reportlab +- CSV loader with raw-string type inference +- Excel loader via openpyxl with per-cell type preservation + +### Removed +- v1 scoring system (numeric scores replaced by severity + narrative) +- `scorer.py` monolith +- `--strict-types` flag (cell-level detection is always on) + +## [1.0.0] — 2026-03-13 + +Initial release as "field-story-scorer." Single-file tool that scored data quality on a numeric scale. Excel-only input, landscape PDF output. diff --git a/DECISIONS.md b/DECISIONS.md index e1cb9b7..ebe437b 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -23,3 +23,9 @@ pyarrow is large (~200MB installed). Making it a core dependency would bloat ins ## 2026-05-15: PEP 639 license format — drop legacy classifier Modern setuptools (isolated build env) rejects the `License :: OSI Approved :: MIT License` classifier when `license = "MIT"` is also present. Removed the classifier, keeping only the PEP 639 `license` string field. Future classifiers should not include license entries. + +## 2026-05-16: Stay in the file-audit niche; do not compete with pipeline tools + +- **Why:** GX owns rules, Pandera owns schemas, Soda owns databases, ydata owns stats. datascope's moat is cell-level detection + professional narrative reports for non-technical readers. Competing on their turf dilutes the positioning. +- **Scope:** All future feature decisions for datascope +- **Do not:** Add custom validation rules, database connectors, statistical profiling, Polars backend, drift detection, or web UI/SaaS diff --git a/HANDOFF.md b/HANDOFF.md index 2aca100..349c66d 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -1,5 +1,17 @@ # Handoff +## 2026-05-16 12:00 — Audit Round 2 + Demo-Proof + +**Started from:** v2.2.0 shipped. Preparing to share demo links with first prospects. + +**Did:** Full 4-phase audit for client-readiness. Executed Moves 1-3: fixed --sheet crash, grammar bugs, stale help text; added branding to PDF/HTML/JSON; added badges, CHANGELOG, sample outputs, moved dev docs to .dev/. PR #6 created and pushed. + +**State:** PR #6 open with 3 commits. Tool is demo-safe — no crashes, branded reports, professional repo. 283 tests pass, lint clean. Move 4 (Growth) remains future work. + +**Next:** Merge PR #6 to main. Then either (a) start Move 4 growth work (landing page, demo video) or (b) rename local folder to "datascope" and begin client outreach. Tool is ready. + +--- + ## Session — 2026-05-15 (Move 4 + tag v2.2.0) **Phase:** Execution complete — all 25 sub-tasks done diff --git a/PLAN.md b/PLAN.md index 376cd7d..e5e3793 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,9 +1,47 @@ # datascope — Improvement Plan -Derived from full project audit (2026-05-15). See AUDIT.md for rationale. +Derived from project audits (2026-05-15, 2026-05-16). See AUDIT.md for rationale. Tier: Medium -Current focus: All moves complete. Improvement plan fully executed. +Current focus: Move 1 DEMO-PROOF — fix demo-killers before sharing with prospects. + +--- + +## Decomposition: Move 1 DEMO-PROOF + +Goal: A prospect can run `datascope` on any file during a live call without encountering crashes, grammar errors, or stale documentation. + +### Steps + +- [x] A1: Catch invalid `--sheet` with friendly error messages + - Depends on: none + - Change: Wrap the sheet lookup in `loaders/excel.py` with try/except for `IndexError` (numeric) and `KeyError` (named). Raise `ValueError` with a message listing available sheets. + - Done when: `datascope samples/input/sample_mixed_types.xlsx --sheet 99` and `--sheet NonExistent` both print "Error: Sheet not found..." to stderr and exit 1 (no traceback). + +- [x] A2: Fix singular/plural grammar in narrative templates + - Depends on: none + - Change: In `findings/templates.py`, fix `type_inconsistency()` — "X value were" → conditional "was"/"were"; in `sentinel_value()` — "(N times)" → conditional "(1 time)"/"(N times)". Audit all 7 template functions for similar issues. + - Done when: `datascope` on a file with exactly 1 minority-type value outputs "1 str value was found"; sentinel with count=1 outputs "(1 time)"; `python -m pytest tests/` still passes. + +- [x] A3: Update argparse description to include Parquet + - Depends on: none + - Change: In `cli.py` line 30-33, update description from ".xlsx or .csv" to ".xlsx, .csv, or .parquet". + - Done when: `datascope --help` output mentions all three formats. + +- [x] A4: Fix numpy dependency in `generate_sample.py` + - Depends on: none + - Change: Add `numpy>=1.24.0` to `[project.optional-dependencies] dev` in `pyproject.toml` (it's already a dev-time tool, not needed at runtime). Update `requirements-dev.txt` if it exists. + - Done when: `pip install -e ".[dev]" && python generate_sample.py` succeeds; numpy is NOT in `[project.dependencies]`. + +- [x] A5: Warn when `--sheet` is passed for non-Excel files + - Depends on: A1 (sheet error handling should be in place first) + - Change: In `cli.py`, after resolving `ext`, if `ext != ".xlsx"` and `args.sheet is not None`, print a warning to stderr: "Warning: --sheet is ignored for {ext} files." + - Done when: `datascope somefile.csv --sheet Revenue` prints the warning to stderr but still runs successfully. + +- [x] A6: Integration verification + - Depends on: A1-A5 all complete + - Run full test suite, run all 5 output formats on both sample files, test the error paths manually. + - Done when: `python -m pytest tests/` passes (283+ tests); all manual edge cases produce friendly output (no tracebacks). --- diff --git a/README.md b/README.md index fc22749..7d0fc02 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,10 @@ # datascope +[](https://pypi.org/project/datascope-dq/) +[](https://github.com/MsShawnP/datascope/actions/workflows/ci.yml) +[](https://pypi.org/project/datascope-dq/) +[](LICENSE) + Data created upstream — by manufacturing teams entering UPCs, inventory staff assigning product codes, offshore developers choosing column types — silently breaks systems downstream. A product code with letters where EDI expects numbers. Fifteen "N/A" strings buried in 500 numeric rows that pandas silently drops, skewing every calculation by 3%. datascope finds these problems, explains what's wrong in plain English, and tells you what to fix. It reads each cell's actual type (not what pandas infers), detects hidden quality issues, classifies their severity by downstream impact, and generates a professional diagnostic report. diff --git a/datascope/cli.py b/datascope/cli.py index 12c7dfc..3f84091 100644 --- a/datascope/cli.py +++ b/datascope/cli.py @@ -27,8 +27,8 @@ def _build_parser() -> argparse.ArgumentParser: parser = argparse.ArgumentParser( prog="datascope", description=( - "Analyse a tabular dataset (.xlsx or .csv) for data-quality issues\n" - "and generate a professional PDF diagnostic report." + "Analyse a tabular dataset (.xlsx, .csv, or .parquet) for data-quality\n" + "issues and generate a professional diagnostic report." ), formatter_class=argparse.RawDescriptionHelpFormatter, ) @@ -155,6 +155,7 @@ def _format_summary(findings: list, source_metadata: dict, output_path: Path) -> def _write_json(findings: list, source_metadata: dict, output_path: Path) -> None: """Write findings as structured JSON.""" + from datascope import __version__ counts: dict[str, int] = {"critical": 0, "warning": 0, "info": 0, "total": 0} for f in findings: @@ -163,6 +164,7 @@ def _write_json(findings: list, source_metadata: dict, output_path: Path) -> Non counts["total"] += 1 payload = { + "generator": {"tool": "datascope", "version": __version__}, "source": dict(source_metadata), "summary": counts, "findings": [ @@ -220,7 +222,18 @@ def main(argv: list[str] | None = None) -> None: from datascope.loaders import load sheet = _parse_sheet(args.sheet) - result = load(input_path, sheet=sheet) + + if args.sheet is not None and ext != ".xlsx": + print( + f"Warning: --sheet is ignored for {ext} files.", + file=sys.stderr, + ) + + try: + result = load(input_path, sheet=sheet) + except ValueError as exc: + print(f"Error: {exc}", file=sys.stderr) + sys.exit(1) # --- size guard ----------------------------------------------------- rows, cols = result.dataframe.shape diff --git a/datascope/findings/templates.py b/datascope/findings/templates.py index f8811f2..9327443 100644 --- a/datascope/findings/templates.py +++ b/datascope/findings/templates.py @@ -67,12 +67,14 @@ def type_inconsistency(field_name: str, evidence: dict[str, Any]) -> dict[str, s minority_desc = " and ".join(minority_parts) if minority_parts else "other types" example_str = _join_examples(all_examples) + total_minority = sum(mt.get("count", 0) for mt in minority_types) + verb = "was" if total_minority == 1 else "were" assumption = ( f"Column '{field_name}' appears to be purely {majority}." ) reality = ( - f"However, {minority_desc} were found among {total} non-null values " + f"However, {minority_desc} {verb} found among {total} non-null values " f"(the majority type covers {_pct(majority_pct)}). " f"Examples of unexpected values: {example_str}." ) @@ -121,7 +123,7 @@ def sentinel_value(field_name: str, evidence: dict[str, Any]) -> dict[str, str]: sentinel_examples = [s.get("value", "?") for s in sentinels] sentinel_counts = [ - f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} times)" + f"{_quote(s.get('value', '?'))} ({s.get('count', 0)} {'time' if s.get('count', 0) == 1 else 'times'})" for s in sentinels ] sentinel_desc = ", ".join(sentinel_counts) if sentinel_counts else "unknown sentinel values" @@ -266,7 +268,7 @@ def near_constant(field_name: str, evidence: dict[str, Any]) -> dict[str, str]: top_values = evidence.get("top_values", []) top_parts = [ - f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} times)" + f"{_quote(tv.get('value', '?'))} ({tv.get('count', 0)} {'time' if tv.get('count', 0) == 1 else 'times'})" for tv in top_values[:3] ] top_desc = ", ".join(top_parts) if top_parts else "(no values)" diff --git a/datascope/loaders/excel.py b/datascope/loaders/excel.py index f08ce7d..6d309bd 100644 --- a/datascope/loaders/excel.py +++ b/datascope/loaders/excel.py @@ -41,8 +41,22 @@ def load_excel(path: Path, sheet: str | int = 0) -> LoaderResult: try: if isinstance(sheet, int): + if sheet < 0 or sheet >= len(wb.worksheets): + available = [ws.title for ws in wb.worksheets] + wb.close() + raise ValueError( + f"Sheet index {sheet} is out of range. " + f"Available sheets: {', '.join(available)}" + ) ws = wb.worksheets[sheet] else: + if sheet not in wb.sheetnames: + available = wb.sheetnames + wb.close() + raise ValueError( + f"Sheet '{sheet}' not found. " + f"Available sheets: {', '.join(available)}" + ) ws = wb[sheet] sheet_title = ws.title diff --git a/datascope/reports/html.py b/datascope/reports/html.py index 39895f8..cd75fcd 100644 --- a/datascope/reports/html.py +++ b/datascope/reports/html.py @@ -171,11 +171,16 @@ def write_html( """ + from datascope import __version__ + page = f"""
+ + +Health Assessment: This dataset has 2 critical findings that will cause silent data loss or incorrect calculations if not addressed. These should be fixed before using this data for any downstream purpose.
+Assumption: Column 'revenue_mixed' appears to be purely numeric.
+Reality: However, 15 str values were found among 200 non-null values (the majority type covers 92.5%). Examples of unexpected values: 'N/A', 'N/A', 'N/A', and 2 more.
+Impact: Rows with non-numeric values in 'revenue_mixed' will be silently dropped or converted to NaN during sums, averages, and other calculations, producing incorrect results without any error message.
+Recommended Fix: Review the non-numeric values in 'revenue_mixed' and decide whether they should be converted to numeric, replaced with a proper null, or moved to a separate column.
+Prevention Rule: Every value in 'revenue_mixed' should be the same type (numeric). Add a type-check validation rule at data entry or ingestion time.
+Assumption: Column 'revenue_mixed' appears to be a clean numeric column.
+Reality: However, 7.5% of values (1 distinct sentinel string) are placeholder text rather than real data: 'N/A' (15 times).
+Impact: Tools like pandas and Excel silently drop sentinel strings when computing sums or averages on 'revenue_mixed', making totals lower than expected. No error is raised, so the data loss goes unnoticed.
+Recommended Fix: Replace sentinel values in 'revenue_mixed' with proper null/blank cells so that downstream tools handle missing data correctly and row counts reflect reality.
+Prevention Rule: Never use placeholder text like 'N/A' in a numeric column. Use blank cells or a dedicated status column instead.
+| Field | Issue Type | Severity |
|---|---|---|
| revenue_mixed | +Type Inconsistency | +Critical | +
| revenue_mixed | +Sentinel Value | +Critical | +
Health Assessment: No critical issues were found, but 3 warnings indicate potential problems that could cause misinterpretation or key mismatches.
+Assumption: Column 'mostly_null' is expected to be fully populated.
+Reality: However, 412 of 500 rows (82.4%) are null or blank. The missing values are scattered across the dataset.
+Impact: With more than half its values missing, 'mostly_null' is unreliable for analysis. Aggregations will silently exclude the missing rows, and any model trained on this column will learn from a biased sample.
+Recommended Fix: Investigate why 'mostly_null' has missing values. If the blanks represent a known condition, consider a default value or a separate status column. If they are data-entry gaps, backfill from the source system.
+Prevention Rule: Add a NOT NULL or completeness check for 'mostly_null' at ingestion time. Flag any batch where null rate exceeds the historical baseline.
+Assumption: Column 'notes' is expected to be fully populated.
+Reality: However, 250 of 500 rows (50.0%) are null or blank. The missing values are scattered across the dataset.
+Impact: With more than half its values missing, 'notes' is unreliable for analysis. Aggregations will silently exclude the missing rows, and any model trained on this column will learn from a biased sample.
+Recommended Fix: Investigate why 'notes' has missing values. If the blanks represent a known condition, consider a default value or a separate status column. If they are data-entry gaps, backfill from the source system.
+Prevention Rule: Add a NOT NULL or completeness check for 'notes' at ingestion time. Flag any batch where null rate exceeds the historical baseline.
+Assumption: Column 'revenue' appears to be a unique identifier (ID column).
+Reality: However, 496 of 500 values are unique (uniqueness ratio: 99.2%), meaning some IDs appear more than once. Duplicate values include: '117.71', '210.24', '263.77', '9.52'.
+Impact: Duplicate IDs in 'revenue' cause row-level joins to fan out, producing unexpected extra rows in merged datasets. Aggregations that assume one row per ID will double-count affected records.
+Recommended Fix: Investigate the duplicate values in 'revenue' to determine whether they are true duplicates (same record entered twice) or legitimate repeats (one-to-many relationship). De-duplicate or re-model accordingly.
+Prevention Rule: If 'revenue' is meant to be a primary key, enforce a uniqueness constraint at the database or validation layer.
+Assumption: Column 'constant_col' is expected to carry meaningful, varying data.
+Reality: However, only 1 distinct value was found across 500 rows (uniqueness ratio: 0.2%). Most common: 'FIXED' (500 times).
+Impact: A near-constant column like 'constant_col' adds no analytical value. Including it in models or reports may mislead readers into thinking the field varies when it does not.
+Recommended Fix: Verify whether 'constant_col' should actually vary. If not, document it as a constant and consider removing it from analysis. If it should vary, investigate why the data is uniform.
+Prevention Rule: If 'constant_col' is supposed to carry diverse values, add a data-quality check that flags columns with fewer than 1% unique values.
+Assumption: Column 'discount_pct' is expected to be fully populated.
+Reality: However, 80 of 500 rows (16.0%) are null or blank. The missing values are scattered across the dataset.
+Impact: Missing values in 'discount_pct' will be silently excluded from calculations. If the missingness is not random, aggregations and models will be biased toward the non-missing subset.
+Recommended Fix: Investigate why 'discount_pct' has missing values. If the blanks represent a known condition, consider a default value or a separate status column. If they are data-entry gaps, backfill from the source system.
+Prevention Rule: Add a NOT NULL or completeness check for 'discount_pct' at ingestion time. Flag any batch where null rate exceeds the historical baseline.
+Assumption: Column 'is_renewal' is expected to carry meaningful, varying data.
+Reality: However, only 2 distinct values were found across 500 rows (uniqueness ratio: 0.4%). Most common: 'True' (266 times), 'False' (234 times).
+Impact: A near-constant column like 'is_renewal' adds no analytical value. Including it in models or reports may mislead readers into thinking the field varies when it does not.
+Recommended Fix: Verify whether 'is_renewal' should actually vary. If not, document it as a constant and consider removing it from analysis. If it should vary, investigate why the data is uniform.
+Prevention Rule: If 'is_renewal' is supposed to carry diverse values, add a data-quality check that flags columns with fewer than 1% unique values.
+Assumption: Column 'order_status' is expected to carry meaningful, varying data.
+Reality: However, only 4 distinct values were found across 500 rows (uniqueness ratio: 0.8%). Most common: 'Pending' (136 times), 'Closed' (131 times), 'Active' (127 times).
+Impact: A near-constant column like 'order_status' adds no analytical value. Including it in models or reports may mislead readers into thinking the field varies when it does not.
+Recommended Fix: Verify whether 'order_status' should actually vary. If not, document it as a constant and consider removing it from analysis. If it should vary, investigate why the data is uniform.
+Prevention Rule: If 'order_status' is supposed to carry diverse values, add a data-quality check that flags columns with fewer than 1% unique values.
+| Field | Issue Type | Severity |
|---|---|---|
| mostly_null | +Missing Values | +Warning | +
| notes | +Missing Values | +Warning | +
| revenue | +Suspected Duplicate IDs | +Warning | +
| constant_col | +Near-Constant Column | +Info | +
| discount_pct | +Missing Values | +Info | +
| is_renewal | +Near-Constant Column | +Info | +
| order_status | +Near-Constant Column | +Info | +