Skip to content

As a developer, I want nabledge-creator tool so that I can create and manage knowledge files efficiently#103

Open
kiyotis wants to merge 34 commits intomainfrom
99-nabledge-creator-tool
Open

As a developer, I want nabledge-creator tool so that I can create and manage knowledge files efficiently#103
kiyotis wants to merge 34 commits intomainfrom
99-nabledge-creator-tool

Conversation

@kiyotis
Copy link
Contributor

@kiyotis kiyotis commented Mar 2, 2026

Closes #99

Approach

Implement nabledge-creator tool to automate conversion of Nablarch official documentation (RST/MD/Excel) into AI-searchable knowledge files (JSON). The tool follows a 6-step pipeline design:

  1. List sources - Scan official documentation files
  2. Classify - Categorize by Type/Category based on directory structure
  3. Generate - Extract knowledge with claude -p using "nothing missed, nothing added" principle
  4. Build index - Aggregate metadata from all knowledge files
  5. Generate docs - Convert JSON to human-readable Markdown
  6. Validate - Verify structure (17 items) and content (4 aspects)

Technical choices:

  • Python 3 with standard library (concurrent.futures for parallelism)
  • Claude Sonnet 4.5 for all AI-powered steps
  • JSON Schema for structured output
  • Resume capability for cost-efficient reruns

Quality assurance:

  • 17 structural checks (JSON format, required fields, URL validity, hints coverage)
  • 4 content validation aspects (information completeness, no hallucination, section splitting, search hint quality)
  • All checks must pass before completion

Tasks

  • Delete existing knowledge files (clean slate)
  • Create project structure (tools/knowledge-creator/)
  • Implement run.py (CLI entry point)
  • Implement Step 1 (list_sources.py: source file scanning)
  • Implement Step 2 (classify.py: Type/Category classification)
  • Implement Step 3 (generate.py: knowledge file generation with concurrency)
  • Implement Step 4 (build_index.py: index.toon generation)
  • Implement Step 5 (generate_docs.py: Markdown documentation)
  • Implement Step 6 (validate.py: validation framework)
  • Create prompt templates (generate.md, classify_patterns.md, validate.md)
  • Create comprehensive documentation (README.md)
  • Apply expert review improvements (path validation, error handling, documentation clarity)
  • Fix JSON schema extraction error (--json-schema usage)
  • Remove logs from git tracking and update .gitignore
  • Generate v6 knowledge files in test mode and verify results (User runs tool, AI verifies)
  • Generate all v6 knowledge files and verify results (User runs tool, AI verifies)

Expert Review

Three AI expert reviews were conducted on the implementation:

  • Software Engineer Review - Rating: 4/5

    • Overall: Well-structured with good separation of concerns
    • Improvements: Path validation added, minor issues deferred to future
  • Prompt Engineer Review - Rating: 4/5

    • Overall: Clear instructions with comprehensive validation rules
    • Improvements: Error handling added, thresholds clarified, validation criteria defined
  • Technical Writer Review - Rating: 4/5

    • Overall: Comprehensive documentation with clear structure
    • Improvements: File paths standardized, terminology made consistent, heading hierarchy fixed

9 high-value improvements implemented, including path validation, error handling documentation, threshold clarification, and documentation consistency fixes.

Success Criteria Check

Criterion Status Evidence
Delete all existing knowledge files before starting implementation ✅ Done All docs/knowledge files removed in commit 7169525
nabledge-creator tool is created and functional ✅ Done Implemented at tools/knowledge-creator/ with complete 6-step pipeline
Implementation follows design document ✅ Done Follows doc/99-nabledge-creator-tool/knowledge-creator-design.md specifications
Tool supports nabledge-6 and nabledge-5 ✅ Done Supports --version 6/5/all parameter
Tool provides clear error messages ✅ Done Per-file logging at logs/v{version}/ with success/error tracking
Documentation includes usage examples ✅ Done README.md with usage examples, CLI help, troubleshooting guide
All knowledge files recreated using tool ⬜ Pending Requires test mode verification and full generation (User executes tool)

🤖 Generated with Claude Code

@kiyotis
Copy link
Contributor Author

kiyotis commented Mar 2, 2026

Closing this PR.

@kiyotis kiyotis closed this Mar 2, 2026
@kiyotis kiyotis reopened this Mar 2, 2026
kiyotis and others added 24 commits March 2, 2026 13:13
- Define 6 validation objectives (end-to-end, variation, concurrency, etc.)
- Design test mode: single command to run curated file set
- Implementation: modify only Step 2 output (maintainable approach)
- Test file set defined in test-files.json (declarative)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…neration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reference design doc README structure for clarity.
- Add processing flow table (6 steps with method and description)
- Add validation details table (17 structural + 4 content checks)
- Reorganize usage section (quick start, basic, options, resume, updates)
- Add quality assurance explanation
- Simplify and focus on user needs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hardcoded model ID 'claude-sonnet-4-5-20250929' was invalid
and causing 400 errors. Removed --model option to use the
environment's default model instead.

This is simpler and more flexible than hardcoding a specific model ID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add utilities for reading Excel files and generating knowledge files.
Include initial batch of generated knowledge files from documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed critical bugs preventing knowledge file generation:

1. Path duplication: Removed duplicate repo prefix in output_path and
   assets_dir construction (step3_generate.py lines 118, 133)
2. Timeout: Increased from 600s to 900s to handle complex files
3. Added BUGFIX.md documenting root cause and verification steps
4. Added test_paths.py to verify path construction logic
5. Updated README with usage warning about Claude Code sessions

Files were generated in wrong location (home/.../home/.../work3/...)
causing validation to fail with json_files=0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive code tracing revealed 4 more critical bugs:

1. extract_assets: Removed duplicate repo prefix for assets_dir_abs
   (was adding repo to already-absolute path parameter)

2. extract_assets: Fixed assets_path to use relative path instead of
   absolute path in knowledge file references

3. generate_one: Modified to pass both absolute and relative paths
   to extract_assets for correct file operations and references

4. delete_knowledge (run.py): Use file_info['assets_dir'] instead
   of manual path construction

Created comprehensive test (test_path_logic.py) verifying:
- All Context paths are absolute
- All file_info paths are relative
- No path duplication anywhere
- Assets relative paths computed correctly

All tests pass. Manual verification from regular shell required
(cannot test claude -p from within Claude Code session).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --json-schema support to claude -p invocation for structured
output validation. Implement automatic file splitting for large RST
files based on section sizes.

Changes:
- common.py: Add json_schema parameter to run_claude()
- step2_classify.py: Add h2/h3 section analysis and auto-splitting
  - Split files with sections > 1000 lines
  - Expand large h2 sections into h3 subsections
  - Pattern B: Section-based splitting (no file-level threshold)
- step3_generate.py: Extract JSON schema from prompt and pass to claude
  - Add section range extraction for split files

Results:
- tag (3085 lines) → 4 parts (913, 975, 898, 224 lines)
- database (1359 lines) → 2 parts (996, 340 lines)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add clean.py to remove generated files:
- Knowledge files (.claude/skills/nabledge-{6,5}/knowledge/)
- Browsable docs (.claude/skills/nabledge-{6,5}/docs/)
- Python cache (steps/__pycache__)
- Optionally logs (--logs flag)

Usage:
  python3 tools/knowledge-creator/clean.py --version 6
  python3 tools/knowledge-creator/clean.py --version 6 --logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolved issue where multiple files with the same name (feature_details.rst,
getting_started.rst) in different categories were assigned duplicate IDs,
causing generation failures with "Expecting value: line 1 column 1 (char 0)".

Changes:
- Modified generate_id() to accept category parameter
- Changed ID format from "filename" to "category-filename"
- Reordered classification to run before ID generation
- Updated test-files.json with new ID format

Result:
- feature_details: 7 unique IDs (was 1 duplicate)
- getting_started: 5 unique IDs (was 1 duplicate)
- All test files correctly identified (21/21)

Note: Some duplicates remain in deeper subdirectories (e.g.,
libraries/data_io/functional_comparison.rst vs
libraries/validation/functional_comparison.rst) but these are
edge cases that can be addressed separately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix JSON extraction error by properly using --json-schema option.

Root cause:
- Passed --json-schema to claude -p but missing --output-format json
- Did not extract structured_output from response
- Did not check subtype field for success/failure

Changes:
- Add --output-format json when using --json-schema
- Extract structured_output from response JSON
- Handle subtype (success/error_max_structured_output_retries)
- Simplify extract_json() as output is already JSON

Testing requires running outside Claude Code session:
  python tools/knowledge-creator/run.py --version 6 --test-mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove knowledge-creator logs from repository as they are build artifacts.
Add .gitignore rules for:
- tools/knowledge-creator/logs/
- tools/knowledge-creator/steps/__pycache__/
- Generated knowledge files (except index.toon)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add internal label extraction for RST files to distinguish internal vs external references
- Update prompt with clear internal reference list and conversion rules
- Remove markdown code fence handling from classify_patterns output
- Relax S13 validation for very short valid content ("なし")
- Fix S14 to validate only internal references (# prefix)
- Add CLAUDECODE environment cleanup for nested execution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prepare clean state for knowledge file generation test.
Generated index.toon will be committed after test run to show clear diff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add file-level split threshold (FILE_LINE_THRESHOLD=1000)
- Unify timeout to 600 seconds across all steps
- Add structural validation loop in Step 3 (max 20 iterations)
- Add content validation and improvement loop in Step 6 (max 20 iterations)
- Create validate_single.py for agent-based validation
- Update test-files.json to include all split file parts
- Improve S2 validation to collect all errors in one pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kiyotis and others added 8 commits March 2, 2026 20:27
After generating split parts, merge them back into single JSON files:
- Combine sections, contents, and assets from all parts
- Deduplicate search_hints, internal_labels, and related_topics
- Consolidate asset directories
- Clean up individual part files after successful merge

This ensures split large files are presented as single knowledge files
for better searchability and user experience.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
After merging split knowledge files back into single files, update
classified_list.json to reflect the new structure:
- Remove split entries (part 1, 2, 3, etc.)
- Add merged file entry with original ID
- Preserve all metadata from original entries
- Sort files consistently by type/category/id

This ensures Step 4, 5, 6 can correctly locate merged files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated clean_cache() to remove both:
- tools/knowledge-creator/__pycache__
- tools/knowledge-creator/steps/__pycache__

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Move BUGFIX documentation to .pr/00103/notes.md per project conventions
- Remove redundant test_paths.py (superseded by test_path_logic.py)
- Fix undefined content_fail variable in step6_validate.py
- Remove unused requests import from step6_validate.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed three critical issues in split file merging:

1. Output structure: Use 5-field JSON Schema structure (id, title,
   official_doc_urls, index, sections) instead of adding metadata fields

2. Asset path updates: Replace "assets/file-id-1/" with
   "assets/file-id/" in section contents to fix broken links after merge

3. Index hints merging: Deduplicate and merge hints when same section_id
   appears in multiple parts (defensive programming)

Changes ensure merged files are identical in structure to non-split files
and pass all validation checks (Steps 4-6).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step3 (generate.md):
- Add completion condition at top with warning emoji
- Force validation script execution (max 20 retries)
- Add reminder section before output

Step6 (validate.md):
- Add completion condition at top (same approach as Step3)
- Force validate_single.py execution in loop

Step6 (step6_validate.py):
- Remove Python-side structure validation (delegate to claude)
- Add missing placeholders: OUTPUT_PATH, SOURCE_PATH, FORMAT
- Simplify summary: unified status instead of structure/content split
- Always run content validation (even if structure errors exist initially)

Design:
- Single validation logic in Step6Validate.validate_structure()
- Python invokes claude, claude runs validation loop
- Eliminates double-validation inconsistency issue

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Increase timeout for large files:
- Step3 (generate): 600s -> 1200s
- Step4 (classify patterns): 600s -> 1200s
- Step6 (validate): 600s -> 1200s

Context:
- libraries-tag-1 (913 lines) timed out at 600s during generation
- libraries-database (1359 lines) timed out at 600s during validation
- With validation loop (max 20 retries), 10min is insufficient for large files

Trade-off: Longer timeout increases total processing time but prevents failures on large files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Translate three prompt files to English for better AI performance:
- classify_patterns.md: Processing pattern classification prompt
- generate.md: Knowledge file generation prompt
- validate.md: Validation and improvement prompt

Rationale:
- English prompts improve instruction following accuracy
- Better performance with complex logic and validation loops
- Addresses current issues: instruction adherence, structure errors
- Aligns with CLAUDE.md language guidelines (AI conversations = English)

Japanese content preserved:
- Placeholder content (source files, knowledge files)
- Output examples showing Japanese content
- Markdown formatting rules with Japanese examples

Expected improvements:
- Higher probability Claude follows validation loop instructions
- More precise adherence to structural validation rules
- Better efficiency in processing large files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

As a developer, I want nabledge-creator tool so that I can create and manage knowledge files efficiently

1 participant