As a developer, I want nabledge-creator tool so that I can create and manage knowledge files efficiently#103
Open
As a developer, I want nabledge-creator tool so that I can create and manage knowledge files efficiently#103
Conversation
…edesign Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
Closing this PR. |
- Define 6 validation objectives (end-to-end, variation, concurrency, etc.) - Design test mode: single command to run curated file set - Implementation: modify only Step 2 output (maintainable approach) - Test file set defined in test-files.json (declarative) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…neration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reference design doc README structure for clarity. - Add processing flow table (6 steps with method and description) - Add validation details table (17 structural + 4 content checks) - Reorganize usage section (quick start, basic, options, resume, updates) - Add quality assurance explanation - Simplify and focus on user needs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hardcoded model ID 'claude-sonnet-4-5-20250929' was invalid and causing 400 errors. Removed --model option to use the environment's default model instead. This is simpler and more flexible than hardcoding a specific model ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add utilities for reading Excel files and generating knowledge files. Include initial batch of generated knowledge files from documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed critical bugs preventing knowledge file generation: 1. Path duplication: Removed duplicate repo prefix in output_path and assets_dir construction (step3_generate.py lines 118, 133) 2. Timeout: Increased from 600s to 900s to handle complex files 3. Added BUGFIX.md documenting root cause and verification steps 4. Added test_paths.py to verify path construction logic 5. Updated README with usage warning about Claude Code sessions Files were generated in wrong location (home/.../home/.../work3/...) causing validation to fail with json_files=0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive code tracing revealed 4 more critical bugs: 1. extract_assets: Removed duplicate repo prefix for assets_dir_abs (was adding repo to already-absolute path parameter) 2. extract_assets: Fixed assets_path to use relative path instead of absolute path in knowledge file references 3. generate_one: Modified to pass both absolute and relative paths to extract_assets for correct file operations and references 4. delete_knowledge (run.py): Use file_info['assets_dir'] instead of manual path construction Created comprehensive test (test_path_logic.py) verifying: - All Context paths are absolute - All file_info paths are relative - No path duplication anywhere - Assets relative paths computed correctly All tests pass. Manual verification from regular shell required (cannot test claude -p from within Claude Code session). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --json-schema support to claude -p invocation for structured output validation. Implement automatic file splitting for large RST files based on section sizes. Changes: - common.py: Add json_schema parameter to run_claude() - step2_classify.py: Add h2/h3 section analysis and auto-splitting - Split files with sections > 1000 lines - Expand large h2 sections into h3 subsections - Pattern B: Section-based splitting (no file-level threshold) - step3_generate.py: Extract JSON schema from prompt and pass to claude - Add section range extraction for split files Results: - tag (3085 lines) → 4 parts (913, 975, 898, 224 lines) - database (1359 lines) → 2 parts (996, 340 lines) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add clean.py to remove generated files:
- Knowledge files (.claude/skills/nabledge-{6,5}/knowledge/)
- Browsable docs (.claude/skills/nabledge-{6,5}/docs/)
- Python cache (steps/__pycache__)
- Optionally logs (--logs flag)
Usage:
python3 tools/knowledge-creator/clean.py --version 6
python3 tools/knowledge-creator/clean.py --version 6 --logs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolved issue where multiple files with the same name (feature_details.rst, getting_started.rst) in different categories were assigned duplicate IDs, causing generation failures with "Expecting value: line 1 column 1 (char 0)". Changes: - Modified generate_id() to accept category parameter - Changed ID format from "filename" to "category-filename" - Reordered classification to run before ID generation - Updated test-files.json with new ID format Result: - feature_details: 7 unique IDs (was 1 duplicate) - getting_started: 5 unique IDs (was 1 duplicate) - All test files correctly identified (21/21) Note: Some duplicates remain in deeper subdirectories (e.g., libraries/data_io/functional_comparison.rst vs libraries/validation/functional_comparison.rst) but these are edge cases that can be addressed separately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix JSON extraction error by properly using --json-schema option. Root cause: - Passed --json-schema to claude -p but missing --output-format json - Did not extract structured_output from response - Did not check subtype field for success/failure Changes: - Add --output-format json when using --json-schema - Extract structured_output from response JSON - Handle subtype (success/error_max_structured_output_retries) - Simplify extract_json() as output is already JSON Testing requires running outside Claude Code session: python tools/knowledge-creator/run.py --version 6 --test-mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove knowledge-creator logs from repository as they are build artifacts. Add .gitignore rules for: - tools/knowledge-creator/logs/ - tools/knowledge-creator/steps/__pycache__/ - Generated knowledge files (except index.toon) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add internal label extraction for RST files to distinguish internal vs external references
- Update prompt with clear internal reference list and conversion rules
- Remove markdown code fence handling from classify_patterns output
- Relax S13 validation for very short valid content ("なし")
- Fix S14 to validate only internal references (# prefix)
- Add CLAUDECODE environment cleanup for nested execution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prepare clean state for knowledge file generation test. Generated index.toon will be committed after test run to show clear diff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add file-level split threshold (FILE_LINE_THRESHOLD=1000) - Unify timeout to 600 seconds across all steps - Add structural validation loop in Step 3 (max 20 iterations) - Add content validation and improvement loop in Step 6 (max 20 iterations) - Create validate_single.py for agent-based validation - Update test-files.json to include all split file parts - Improve S2 validation to collect all errors in one pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After generating split parts, merge them back into single JSON files: - Combine sections, contents, and assets from all parts - Deduplicate search_hints, internal_labels, and related_topics - Consolidate asset directories - Clean up individual part files after successful merge This ensures split large files are presented as single knowledge files for better searchability and user experience. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
After merging split knowledge files back into single files, update classified_list.json to reflect the new structure: - Remove split entries (part 1, 2, 3, etc.) - Add merged file entry with original ID - Preserve all metadata from original entries - Sort files consistently by type/category/id This ensures Step 4, 5, 6 can correctly locate merged files. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated clean_cache() to remove both: - tools/knowledge-creator/__pycache__ - tools/knowledge-creator/steps/__pycache__ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Move BUGFIX documentation to .pr/00103/notes.md per project conventions - Remove redundant test_paths.py (superseded by test_path_logic.py) - Fix undefined content_fail variable in step6_validate.py - Remove unused requests import from step6_validate.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixed three critical issues in split file merging: 1. Output structure: Use 5-field JSON Schema structure (id, title, official_doc_urls, index, sections) instead of adding metadata fields 2. Asset path updates: Replace "assets/file-id-1/" with "assets/file-id/" in section contents to fix broken links after merge 3. Index hints merging: Deduplicate and merge hints when same section_id appears in multiple parts (defensive programming) Changes ensure merged files are identical in structure to non-split files and pass all validation checks (Steps 4-6). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step3 (generate.md): - Add completion condition at top with warning emoji - Force validation script execution (max 20 retries) - Add reminder section before output Step6 (validate.md): - Add completion condition at top (same approach as Step3) - Force validate_single.py execution in loop Step6 (step6_validate.py): - Remove Python-side structure validation (delegate to claude) - Add missing placeholders: OUTPUT_PATH, SOURCE_PATH, FORMAT - Simplify summary: unified status instead of structure/content split - Always run content validation (even if structure errors exist initially) Design: - Single validation logic in Step6Validate.validate_structure() - Python invokes claude, claude runs validation loop - Eliminates double-validation inconsistency issue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Increase timeout for large files: - Step3 (generate): 600s -> 1200s - Step4 (classify patterns): 600s -> 1200s - Step6 (validate): 600s -> 1200s Context: - libraries-tag-1 (913 lines) timed out at 600s during generation - libraries-database (1359 lines) timed out at 600s during validation - With validation loop (max 20 retries), 10min is insufficient for large files Trade-off: Longer timeout increases total processing time but prevents failures on large files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Translate three prompt files to English for better AI performance: - classify_patterns.md: Processing pattern classification prompt - generate.md: Knowledge file generation prompt - validate.md: Validation and improvement prompt Rationale: - English prompts improve instruction following accuracy - Better performance with complex logic and validation loops - Addresses current issues: instruction adherence, structure errors - Aligns with CLAUDE.md language guidelines (AI conversations = English) Japanese content preserved: - Placeholder content (source files, knowledge files) - Output examples showing Japanese content - Markdown formatting rules with Japanese examples Expected improvements: - Higher probability Claude follows validation loop instructions - More precise adherence to structural validation rules - Better efficiency in processing large files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #99
Approach
Implement nabledge-creator tool to automate conversion of Nablarch official documentation (RST/MD/Excel) into AI-searchable knowledge files (JSON). The tool follows a 6-step pipeline design:
Technical choices:
Quality assurance:
Tasks
Expert Review
Three AI expert reviews were conducted on the implementation:
Software Engineer Review - Rating: 4/5
Prompt Engineer Review - Rating: 4/5
Technical Writer Review - Rating: 4/5
9 high-value improvements implemented, including path validation, error handling documentation, threshold clarification, and documentation consistency fixes.
Success Criteria Check
tools/knowledge-creator/with complete 6-step pipelinedoc/99-nabledge-creator-tool/knowledge-creator-design.mdspecifications--version 6/5/allparameterlogs/v{version}/with success/error tracking🤖 Generated with Claude Code