Skip to content

【PaddlePaddle Hackathon 10】Add Doc2Prototype OpenVINO demo#548

Open
Dryoung95 wants to merge 8 commits into
openvinotoolkit:masterfrom
Dryoung95:doc2prototype-mvp
Open

【PaddlePaddle Hackathon 10】Add Doc2Prototype OpenVINO demo#548
Dryoung95 wants to merge 8 commits into
openvinotoolkit:masterfrom
Dryoung95:doc2prototype-mvp

Conversation

@Dryoung95
Copy link
Copy Markdown

@Dryoung95 Dryoung95 commented May 15, 2026

Description

This PR adds a Doc2Prototype MVP demo for the Intel/OpenVINO PaddlePaddle Hackathon 10 task.

The demo implements a reproducible document-understanding-to-downstream-processing workflow:

document / diagram image -> PaddleOCR-VL deployed with OpenVINO -> structured JSON -> downstream Agent/Coder workflow -> generated prototype artifact -> visual report

Primary scenarios:

  • API documentation image -> endpoint JSON -> generated FastAPI skeleton
  • Flowchart image -> node/edge JSON -> generated Mermaid diagram
  • Technical/reference document image -> structured sections -> Markdown summary

Requirement Alignment

  • Adds a standalone demo under demos/doc2prototype_demo.
  • Uses OpenVINO IR deployment for PaddleOCR-VL inference.
  • Shows the complete handoff from document/visual understanding to downstream intelligent processing.
  • Recommends OpenVINO GenAI for the real local Coder model path.
  • Keeps deterministic generation as the fast reviewer smoke path.
  • Provides source code, README, pinned dependencies, model preparation commands, example inputs, tracked smoke outputs, screenshots, and run metadata.

Downstream Agent / Coder Flow

The structured output is passed through downstream_agent.py:

  • PlannerAgent builds the downstream generation plan from structured JSON.
  • GeneratorAgent creates the downstream artifact.
  • ReviewAgent checks whether the generated artifact covers extracted endpoints, nodes, edges, or document sections.

Coder backend policy:

  • Recommended real local Coder path: --code-model-backend openvino with OpenVINO/Qwen2.5-Coder-0.5B-Instruct-int4-ov.
  • Fast default reviewer path: deterministic template backend, no second model download required.
  • Fallback/comparison path: HuggingFace backend with --code-model-backend hf.

Verified OpenVINO Coder path:

python -c "from code_generator import download_openvino_code_model; print(download_openvino_code_model())"
python main.py examples/api_doc_sample.png --task api_doc --device CPU --code-model-path _models/OpenVINO/Qwen2.5-Coder-0.5B-Instruct-int4-ov --code-model-backend openvino --code-max-new-tokens 768 --output-dir outputs/mvp_api_image_ov_coder

Result:

Scenario Parser backend Coder backend Device Structured output Agent review Total
API image + OpenVINO Coder PaddleOCR-VL OpenVINO OpenVINO GenAI Coder CPU 5 endpoints pass 27.162 s

Effect Display

API document image -> deterministic FastAPI skeleton:

Doc2Prototype API report

This image shows examples/api_doc_sample.png parsed by PaddleOCR-VL with OpenVINO. The structured JSON contains five endpoints. The downstream Agent checks that all five extracted endpoints are represented in the generated FastAPI skeleton. The timing chart separates model load, OpenVINO inference, structure extraction, and generation time. The layout overlay marks detected text regions, and the heatmap shows text-density concentration.

API document image -> OpenVINO Coder model:

Doc2Prototype API OpenVINO Coder report

This image uses the same API input but switches downstream generation to OpenVINO/Qwen2.5-Coder-0.5B-Instruct-int4-ov through --code-model-backend openvino. A correct run should show OpenVINO: True, Backend: OpenVINO Coder model inside agent workflow, five extracted endpoints, and Agent review status pass.

Flowchart image -> Mermaid diagram:

Doc2Prototype flowchart report

This image shows examples/flowchart_sample.png parsed by PaddleOCR-VL with OpenVINO. The structured JSON contains six nodes and five directed edges. The downstream Agent generates a Mermaid diagram, and the review passes when all extracted nodes are represented in the generated diagram.

Reproduction

From demos/doc2prototype_demo:

python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
python prepare_model.py --device CPU
python scripts/make_sample_images.py
python main.py examples/api_doc_sample.png --task api_doc --device CPU --output-dir outputs/mvp_api_image_smoke
python main.py examples/flowchart_sample.png --task flowchart --device CPU --output-dir outputs/mvp_flow_image_smoke

Expected smoke results:

Scenario OpenVINO Device Structured output Downstream artifact Agent review
API document image yes CPU 5 endpoints generated_api.py pass
Flowchart image yes CPU 6 nodes / 5 edges generated_flowchart.mmd pass

Intel Hardware Scope

Validated on a local Core Ultra / RTX 5070 Ti laptop environment, but the benchmark and recommendation focus remains Intel/OpenVINO:

  • CPU: primary reproducible path for this PR.
  • GPU.0: Intel iGPU path, available for optional Intel GPU validation.
  • NPU and AUTO: visible but currently limited by the stateful/dynamic-shape LLM path in the PaddleOCR-VL export, so they are documented as limitations rather than successful benchmarks.
  • GPU.1: NVIDIA dGPU on this machine; not used as a project highlight or primary benchmark for the Intel/OpenVINO task.

If final validation needs to match the provided GMK Intel Core Ultra mini PC more closely, I can migrate the same branch and commands back to that device and report CPU / Intel iGPU / NPU behavior there.

Artifacts

Tracked smoke reports:

  • demos/doc2prototype_demo/outputs/mvp_api_image_smoke/visual_report.html
  • demos/doc2prototype_demo/outputs/mvp_api_image_smoke/agent_review.md
  • demos/doc2prototype_demo/outputs/mvp_flow_image_smoke/visual_report.html
  • demos/doc2prototype_demo/outputs/mvp_flow_image_smoke/agent_review.md

Screenshots:

  • demos/doc2prototype_demo/assets/doc2prototype_api_report.png
  • demos/doc2prototype_demo/assets/doc2prototype_api_openvino_coder_report.png
  • demos/doc2prototype_demo/assets/doc2prototype_flowchart_report.png

@Dryoung95
Copy link
Copy Markdown
Author

This PR is submitted for PaddlePaddle Hackathon 10 Intel advanced task.

It adds the Doc2Prototype MVP demo:

  • PaddleOCR-VL exported to OpenVINO IR
  • API document and flowchart parsing
  • structured JSON output
  • FastAPI / Mermaid prototype generation
  • static visual reports with timing, layout overlay, and text heatmap

Validated locally with the commands listed in the PR description.

@Dryoung95 Dryoung95 changed the title Add Doc2Prototype OpenVINO demo 【PaddlePaddle Hackathon 10】Add Doc2Prototype OpenVINO demo May 22, 2026
@uvv-01
Copy link
Copy Markdown
Contributor

uvv-01 commented May 29, 2026

@Dryoung95 This looks really interesting. I had one question though: what happens if the input image is blurry, low quality, or OCR is not able to extract much text from it? Does the demo handle such cases gracefully, or show any specific error message? It might be helpful to mention these scenarios in the documentation as well since new users may run into them while testing.

@Dryoung95
Copy link
Copy Markdown
Author

Dryoung95 commented May 29, 2026

Thanks, good point. The demo should not crash in that case. If OCR extracts very little text, it still writes the normal outputs, but structured.json may have empty fields and agent_review.md will show needs_attention instead of pass. I added a short README note for blurry / low-quality inputs in 536a439.

@uvv-01
Copy link
Copy Markdown
Contributor

uvv-01 commented May 29, 2026

@Dryoung95 Thanks for the clarification and for adding the README note.

I was also wondering, would it make sense to show a warning in the console output or visual report when very little text is extracted? I feel that could help users quickly understand that the input quality might be affecting the results, especially when testing the demo for the first time.

@Dryoung95
Copy link
Copy Markdown
Author

Yes, agreed. I added this in 3fff826: weak OCR runs now print [mvp] warning: lines in the CLI, and visual_report.html shows a Warnings section. I also verified a low-text probe still writes the normal artifacts and marks the review as needs_attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants