diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.04.27~2026.05.08.md b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.04.27~2026.05.08.md new file mode 100644 index 00000000..7e8bd25d --- /dev/null +++ b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.04.27~2026.05.08.md @@ -0,0 +1,82 @@ +### 认领者 GitHub ID +megemini + +### 赛题信息 + +- **进阶任务序号**:#15 +- **赛题名称**:基于天数智芯硬件与文心多模态模型的创新应用 +- **关联厂商**:天数 + +### 本周工作 + +1. **RFC 文档** + + - 已经完成 RFC 文档 + - AI Studio 地址:https://aistudio.baidu.com/project/edit/10221576 + +2. **代码实现** + + - 已经完成 AI Studio 项目的 notebook + - 已经创建了双卡的天数环境 + +3. **README** + + - 可以参考 AI Studio 项目的 notebook + +4. **演示视频/截图** + + - 待完成 + +5. **问题与解决** + + - 问题:AI Studio 的 notebook 中无法正常调用 ERNIE-4.5-0.3B-Paddle + + 现在有一个很奇怪的问题,AI Studio 的 notebook 中无法 `正常` 调用 ERNIE-4.5-0.3B-Paddle 模型。模型可以正常的运行,但是,输出是 `答非所问` 。 + + 请看下面的截图,我将 PaddleOCR-VL-1.5 识别的结果手动放入到 prompt 中: + + ![images/cli_prompt.png](images/cli_prompt.png) + + 使用命令行调用模型,输出是正常的: + + ![images/cli_ok.png](images/cli_ok.png) + + 但是,如果放到 notebook 中,输出就是一长串的空白(空格和回车)! + + 我手动将 notebook 中的 prompt 修改为 `你是谁` 测试模型的输出: + + ![images/notebook_prompt.png](images/notebook_input.png) + + 输出是一段奇怪的东西: + + ![images/notebook_output.png](images/notebook_output.png) + + 有时候还会给我输出一段完形填空题。 + + 我尝试在 notebook 中进行函数调用,也尝试使用子进行调用,都不行! + + 现在附上 notebook 文件 `medical_pipeline_20260503.ipynbS`,可以直接执行。 + + 另外,还发现个问题,在 AI Studio 中,显存有时无法释放,可以看到截图中,即便什么都没有,现在也被占用了 45% 的显存。我不确定是 AI Studio 的问题,还是 Fastdeploy 配合天数硬件的问题。 请帮忙看一下。 + + - 问题:天数的双卡的框架开发环境,只有命令行模式,不能使用 notebook,也不能进行项目公开 + + 现在的解决方案是,先在单卡环境中调通 notebook,然后再双卡环境中验证 pipeline 是否能够走通。 + +### 下周计划 + +1. 调试 notebook +2. 调试双卡环境 + +### 当前阻塞(无则填"无") + +- 解决 notebook 中无法正常调用 ERNIE-4.5-0.3B-Paddle 模型的问题 + +### 交付物进展 + +| 交付物 | 状态 | 备注 | +|--------|:----:|------| +| RFC 文档 | ✅ 已完成 | - | +| 代码实现 | 🔄 | | +| README | 🔄 | - | +| 演示视频/截图 |🔄 | - | \ No newline at end of file diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.08~2026.05.28.md b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.08~2026.05.28.md new file mode 100644 index 00000000..52612ccf --- /dev/null +++ b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.08~2026.05.28.md @@ -0,0 +1,206 @@ +### 认领者 GitHub ID +megemini + +### 赛题信息 + +- **进阶任务序号**:#15 +- **赛题名称**:基于天数智芯硬件与文心多模态模型的创新应用 +- **关联厂商**:天数 + +### 本周工作 + +1. **RFC 文档** + + - 已经完成 RFC 文档 + - AI Studio 地址:https://aistudio.baidu.com/project/edit/10221576 + +2. **代码实现** + + - 已经完成 AI Studio 项目的 notebook + - 已经创建了双卡的天数环境 + - 已完成 cli 的脚本,`drug_ocr_cli.py` + - 已发布 AI Studio notebook 项目:https://aistudio.baidu.com/projectdetail/10413884 + > 注意:因为后面提到的 AI Studio 环境问题,此 notebook 的 ERNIE-4.5-0.3B-Paddle 输出混乱,因此,此 notebook 仅作为参考,可在本地最新的天数环境运行调试。 + +3. **README** + + - 可以参考 AI Studio 项目的 notebook + +4. **演示视频/截图** + + - 待完成 + +5. **问题与解决** + + - 问题:AI Studio 的 notebook 中无法正常调用 ERNIE-4.5-0.3B-Paddle + + 解决:经确认,AI Studio 的 notebook 环境有问题,后续使用 cli 的方式 + + ![notebook](images/notebook.png) + + - 问题:天数的双卡框架开发环境中不能编译最新的 FastDeploy 版本 https://github.com/PaddlePaddle/FastDeploy/issues/7948 + + ```shell + /home/aistudio/FastDeploy/custom_ops/build/fastdeploy_ops/temp.linux-x86_64-cpython-310/build/fastdeploy_ops/temp.linux-x86_64-cpython-310/iluvatar_ops/runtime/iluvatar_context.o is compiled + /home/aistudio/FastDeploy/custom_ops/iluvatar_ops/paged_attn.cu:199:37: error: no matching constructor for initialization of 'PageAttentionWithKVCacheArguments' + 199 | PageAttentionWithKVCacheArguments args{ + | ^ ~ + 200 | static_cast(scale), + | ~~~~~~~~~~~~~~~~~~~~~~~~~~ + 201 | 1.0, + | ~~~~ + 202 | 1.0, + | ~~~~ + 203 | static_cast(softcap), + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 204 | window_left, + | ~~~~~~~~~~~~ + 205 | window_right, + | ~~~~~~~~~~~~~ + 206 | causal, + | ~~~~~~~ + 207 | use_sqrt_alibi, + | ~~~~~~~~~~~~~~~ + 208 | enable_cuda_graph, + | ~~~~~~~~~~~~~~~~~~ + 209 | false, + | ~~~~~~ + 210 | alibi_slopes_ptr, + | ~~~~~~~~~~~~~~~~~ + 211 | key_ptr, + | ~~~~~~~~ + 212 | value_ptr, + | ~~~~~~~~~~ + 213 | workspace_ptr, + | ~~~~~~~~~~~~~~ + 214 | merged_qkv, + | ~~~~~~~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3699:3: note: candidate constructor not viable: requires at most 27 arguments, but 28 were provided + 3699 | PageAttentionWithKVCacheArguments( + | ^ + 3700 | float scale = 1.f, float k_scale = 1.f, float v_scale = 1.f, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3701 | float softcap = 0.f, int window_size_left = -1, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3702 | int window_size_right = -1, bool is_causal = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3703 | bool alibi_sqrt = false, bool enable_cuda_graph = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3704 | bool is_bbhh = false, const float *alibi_slopes_ptr = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3705 | const void *key = nullptr, const void *value = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3706 | void *workspace = nullptr, bool merge_qkv = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3707 | const float *rope_sin = nullptr, const float *rope_cos = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3708 | const float *qScalePtr = nullptr, const float *kScalePtr = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3709 | const float *vScalePtr = nullptr, const float *kScaleVec = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3710 | int qLength = 1, int keyStride = 0, int valueStride = 0, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3711 | const void *aux = nullptr, const size_t rope_batch_stride = 0, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3712 | const cuinferAttentionRopeMode_t rope_type = CUINFER_ATTEN_NORMAL) + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3666:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 28 were provided + 3666 | struct PageAttentionWithKVCacheArguments { + | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3666:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 28 were provided + 3666 | struct PageAttentionWithKVCacheArguments { + | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + /home/aistudio/FastDeploy/custom_ops/iluvatar_ops/mixed_fused_attn.cu:269:37: error: no matching constructor for initialization of 'PageAttentionWithKVCacheArguments' + 269 | PageAttentionWithKVCacheArguments args{ + | ^ ~ + 270 | static_cast(scale), + | ~~~~~~~~~~~~~~~~~~~~~~~~~~ + 271 | 1.0, + | ~~~~ + 272 | 1.0, + | ~~~~ + 273 | static_cast(softcap), + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 274 | window_left, + | ~~~~~~~~~~~~ + 275 | window_right, + | ~~~~~~~~~~~~~ + 276 | causal, + | ~~~~~~~ + 277 | use_sqrt_alibi, + | ~~~~~~~~~~~~~~~ + 278 | enable_cuda_graph, + | ~~~~~~~~~~~~~~~~~~ + 279 | false, + | ~~~~~~ + 280 | nullptr, + | ~~~~~~~~ + 281 | decode_qkv_ptr, + | ~~~~~~~~~~~~~~~ + 282 | decode_qkv_ptr, + | ~~~~~~~~~~~~~~~ + 283 | decode_workspace_ptr, + | ~~~~~~~~~~~~~~~~~~~~~ + 284 | true, + | ~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3699:3: note: candidate constructor not viable: requires at most 27 arguments, but 28 were provided + 3699 | PageAttentionWithKVCacheArguments( + | ^ + 3700 | float scale = 1.f, float k_scale = 1.f, float v_scale = 1.f, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3701 | float softcap = 0.f, int window_size_left = -1, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3702 | int window_size_right = -1, bool is_causal = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3703 | bool alibi_sqrt = false, bool enable_cuda_graph = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3704 | bool is_bbhh = false, const float *alibi_slopes_ptr = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3705 | const void *key = nullptr, const void *value = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3706 | void *workspace = nullptr, bool merge_qkv = false, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3707 | const float *rope_sin = nullptr, const float *rope_cos = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3708 | const float *qScalePtr = nullptr, const float *kScalePtr = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3709 | const float *vScalePtr = nullptr, const float *kScaleVec = nullptr, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3710 | int qLength = 1, int keyStride = 0, int valueStride = 0, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3711 | const void *aux = nullptr, const size_t rope_batch_stride = 0, + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 3712 | const cuinferAttentionRopeMode_t rope_type = CUINFER_ATTEN_NORMAL) + | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3666:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 28 were provided + 3666 | struct PageAttentionWithKVCacheArguments { + | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + /usr/local/corex-4.3.8/include/ixinfer.h:3666:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 28 were provided + 3666 | struct PageAttentionWithKVCacheArguments { + | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 1 error generated when compiling for ivcore11. + /home/aistudio/FastDeploy/custom_ops/iluvatar_ops/paged_attn.cu compile failed, command '/usr/local/corex/bin/clang++' failed with exit code 1 + /home/aistudio/FastDeploy/custom_ops/build/fastdeploy_ops/temp.linux-x86_64-cpython-310/build/fastdeploy_ops/temp.linux-x86_64-cpython-310/iluvatar_ops/paged_attn.cu.o is compiled + 1 error generated when compiling for ivcore11. + /home/aistudio/FastDeploy/custom_ops/iluvatar_ops/mixed_fused_attn.cu compile failed, command '/usr/local/corex/bin/clang++' failed with exit code 1 + + ``` + + 解决:使用 commit: 172ab6020dbe1ccb730f09df74764d6ea388d88f 重新编译 + +### 下周计划 + +1. 调试双卡环境 + +### 当前阻塞(无则填"无") + +- 重新编译 FastDeploy + +### 交付物进展 + +| 交付物 | 状态 | 备注 | +|--------|:----:|------| +| RFC 文档 | ✅ 已完成 | - | +| 代码实现 | 🔄 | | +| README | 🔄 | - | +| 演示视频/截图 |🔄 | - | \ No newline at end of file diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.28~2026.06.11.md b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.28~2026.06.11.md new file mode 100644 index 00000000..f3d3d44a --- /dev/null +++ b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/[WeeklyReport]2026.05.28~2026.06.11.md @@ -0,0 +1,72 @@ +### 认领者 GitHub ID +megemini + +### 赛题信息 + +- **进阶任务序号**:#15 +- **赛题名称**:基于天数智芯硬件与文心多模态模型的创新应用 +- **关联厂商**:天数 + +### 本周工作 + +1. **RFC 文档** + + - 已经完成 RFC 文档 + - AI Studio 地址:https://aistudio.baidu.com/project/edit/10221576 + +2. **代码实现** + + - 已经完成 AI Studio 项目的 notebook + - 已经创建了双卡的天数环境 + - 已完成 cli 的脚本,`drug_ocr_cli.py` + 针对最新实现的脚本,有两处大的改进: + 1. 增加了 patch aistudio sdk 的函数,原因是,paddlespeech 与 fastdeploy 对于 aistudio sdk 使用的版本不一样,因此,需要先利用脚本修改源码,将其统一 + 2. 增加了针对 tts 合成的音频文件进行音量修改的函数,原因是,天数的框架开发环境的音频编码库好像与 paddlespeech 有点兼容问题,导致合成的音频会出现截止的情况,因此,与 paddlespeech 的研发讨论后,决定增加这个后处理的函数。 + - 已发布 AI Studio notebook 项目:https://aistudio.baidu.com/projectdetail/10413884 + > 注意:因为后面提到的 AI Studio 环境问题,此 notebook 的 ERNIE-4.5-0.3B-Paddle 输出混乱,因此,此 notebook 仅作为参考,可在本地最新的天数环境运行调试。 + - 已经在天数双卡环境中验证了 `tensor_parallel_size` 为 `2` 时,可以加载与使用 `ERNIE-4.5-VL-28B-A3B-Thinking` 模型,单卡占用显存约 `20G` ,共计 `40G` 显存的占用。 + +3. **README** + + - 可以参考 AI Studio 项目的 notebook + +4. **演示视频/截图** + + ![ernie28b](images/ernie_28b.png) + +5. **问题与解决** + + - 问题:天数的框架开发环境极度不稳定,导致没有办法持续的验证优化后的脚本。 + + 天数的框架开发环境好像是共享的模式,所谓的启动、关闭,只是用于控制用户是否可以 ssh 远程连接到服务器。这就导致了,经常出现: + + - 突然被断开连接,踢出了环境 + - 新连接的环境,显存已经被占用满了 + - 运行过程中提示硬盘没有空间了,实际上 aistudio 的工作目录只有 76G 的文件(包括模型文件等) + - 运行过程中提示识别不到模型,实际上模型没有问题,可能再运行一次就好了 + - 运行过程中加载模型很慢,有的时候要将近10分钟才能加载完 `ERNIE-4.5-VL-28B-A3B-Thinking` + - 加载完模型后,输出 token 到一半就卡住了,再运行一次可能又会在其他地方卡住 + - ixsmi 命令有时候不能反应当前环境的显存使用情况,比如,模型都加载完了,还显示只有 64MB 的显存占用 + + 以上只是这两周遇到的部分环境问题,导致,从上周开始调试到现在,只有 `2` 次能够完整的运行完脚本,其他时间都是不断的被各种情况打断。 + + 目前的状态是:脚本应该没有什么问题,但是,还需要再至少完整的运行完一次,从而抓取完整的日志。 + + 周报目录中的 `drug_ernie03.log` 是使用 `ERNIE-4.5-0.3B-Paddle` 完整运行后的日志,`ERNIE-4.5-VL-28B-A3B-Thinking` 也完整运行过一次,不过,当时并没有注意到这个模型会先输出 thinking 部分,导致最终输出的 token 不够,因此,最新的 `drug_ocr_cli.py` 脚本已经进行了修改,但是,到目前为止,整整一周多的时间都没有再完整的运行过一次。 + +### 下周计划 + +1. 调试双卡环境 + +### 当前阻塞(无则填"无") + +- AI Studio 环境极度不稳定 + +### 交付物进展 + +| 交付物 | 状态 | 备注 | +|--------|:----:|------| +| RFC 文档 | ✅ 已完成 | - | +| 代码实现 | ✅ 已完成 | | +| README | ✅ 已完成 | - | +| 演示视频/截图 | ✅ 已完成 | - | \ No newline at end of file diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/drug_ernie03.log b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/drug_ernie03.log new file mode 100644 index 00000000..e80ec62b --- /dev/null +++ b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/drug_ernie03.log @@ -0,0 +1,822 @@ +aistudio@ssh-942478-10243234-79b6d74556-jggs6:~$ python drug_ocr_cli.py --image resource/test.jpg --no-split --ocr-tokens 200 --llm-tokens 100 +File already patched. +20:24:01 [drug_ocr] INFO: ============================================================ +20:24:01 [drug_ocr] INFO: Drug OCR pipeline started (subprocess mode) +20:24:01 [drug_ocr] INFO: Image path: resource/test.jpg +20:24:01 [drug_ocr] INFO: Image split: False (num_splits=4, overlap=0.10) +20:24:01 [drug_ocr] INFO: ============================================================ +20:24:01 [drug_ocr] INFO: [OCR Step] Loading image... +20:24:01 [drug_ocr] INFO: [OCR Step] Image loaded, size: (2014, 2881) +20:24:01 [drug_ocr] INFO: [OCR Step] Skipping image split +20:24:03 [drug_ocr] INFO: [OCR Step] Starting OCR subprocess... +I0601 20:24:06.824122 7196 init.cc:254] ENV [CUSTOM_DEVICE_ROOT]=/home/aistudio/.local/lib/python3.10/site-packages/paddle/paddle_custom_device +I0601 20:24:06.824187 7196 init.cc:162] Try loading custom device libs from: [/home/aistudio/.local/lib/python3.10/site-packages/paddle/paddle_custom_device] +I0601 20:24:06.954595 7196 custom_device_load.cc:51] Succeed in loading custom runtime in lib: /home/aistudio/.local/lib/python3.10/site-packages/paddle/paddle_custom_device/libpaddle-iluvatar-gpu.so +I0601 20:24:06.954654 7196 custom_device_load.cc:58] Skipped lib [/home/aistudio/.local/lib/python3.10/site-packages/paddle/paddle_custom_device/libpaddle-iluvatar-gpu.so]: no custom engine Plugin symbol in this lib. +I0601 20:24:06.964550 7196 custom_kernel.cc:68] Succeed in loading 913 custom kernel(s) from loaded lib(s), will be used like native ones. +I0601 20:24:06.964919 7196 init.cc:174] Finished in LoadCustomDevice with libs_path: [/home/aistudio/.local/lib/python3.10/site-packages/paddle/paddle_custom_device] +I0601 20:24:06.964968 7196 init.cc:260] CustomDevice: iluvatar_gpu, visible devices count: 2 +WARNING 2026-06-01 20:24:19,431 7196 transfer_manager.py[line:30] cupy not available, falling back to synchronous transfers +[OCR Worker] Loading OCR model (PaddleOCR-VL)... +WARNING 2026-06-01 20:24:23,436 7196 common.py[line:63] Model path 'baidu/PaddleOCR-VL-1.5' is not a local directory or file, will try to download from huggingface hub. +WARNING 2026-06-01 20:24:26,471 7196 common.py[line:73] Cannot reach huggingface.co. If the model is stored locally, please check the path 'baidu/PaddleOCR-VL-1.5'. Otherwise check network/proxy settings (DOWNLOAD_SOURCE=huggingface). +INFO 2026-06-01 20:24:26,764 7196 log.py[line:76] Downloading Model from remote to directory: /home/aistudio/PaddlePaddle/PaddleOCR-VL-1.5 +INFO 2026-06-01 20:24:26,981 7196 log.py[line:76] Got 18 files, start to download ... +Processing 18 items: 0%| | 0.00/18.0 [00:00, is_text_generation=False, is_multimodal=True, is_reasoning=False, is_pooling=False, module_path='paddleocr_vl.paddleocr_vl', default_pooling_type='LAST') +INFO:legacy.config:_architecture : PaddleOCRVLForConditionalGeneration +INFO:legacy.config:mla_use_absorb : False +INFO:legacy.config:max_stop_seqs_num : 5 +INFO:legacy.config:stop_seqs_max_len : 8 +INFO:legacy.config:model_config : {'architectures': ['PaddleOCRVLForConditionalGeneration'], 'attention_probs_dropout_prob': 0.0, 'auto_map': {'AutoConfig': 'configuration_paddleocr_vl.PaddleOCRVLConfig', 'AutoModel': 'modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration', 'AutoModelForCausalLM': 'modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration'}, 'compression_ratio': 1.0, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_dropout_prob': 0.0, 'hidden_size': 1024, 'ignored_index': -100, 'image_token_id': 100295, 'intermediate_size': 3072, 'max_position_embeddings': 131072, 'max_sequence_length': None, 'model_type': 'paddleocr_vl', 'num_attention_heads': 16, 'num_hidden_layers': 18, 'num_key_value_heads': 2, 'pad_token_id': 0, 'rms_norm_eps': 1e-05, 'rope_scaling': {'mrope_section': [16, 24, 24], 'rope_type': 'default', 'type': 'default'}, 'rope_theta': 500000, 'sliding_window': None, 'tie_word_embeddings': False, 'torch_dtype': 'bfloat16', 'transformers_version': '4.55.0', 'use_bias': False, 'use_cache': False, 'use_flash_attention': False, 'video_token_id': 101307, 'vision_config': {'architectures': ['PaddleOCRVisionModel'], 'attention_dropout': 0.0, 'auto_map': {'AutoConfig': 'configuration_paddleocr_vl.PaddleOCRVLConfig', 'AutoModel': 'modeling_paddleocr_vl.PaddleOCRVisionModel'}, 'hidden_act': 'gelu_pytorch_tanh', 'hidden_size': 1152, 'image_size': 384, 'intermediate_size': 4304, 'layer_norm_eps': 1e-06, 'model_type': 'paddleocr_vl', 'num_attention_heads': 16, 'num_channels': 3, 'num_hidden_layers': 27, 'pad_token_id': 0, 'patch_size': 14, 'spatial_merge_size': 2, 'temporal_patch_size': 2, 'tokens_per_second': 2, 'torch_dtype': 'bfloat16'}, 'vision_start_token_id': 101305, 'vision_end_token_id': 101306, 'vocab_size': 103424, 'weight_share_add_bias': True, 'use_3d_rope': True, 'rope_is_neox_style': True} +INFO:legacy.config:moe_phase : +INFO:legacy.config:============================================================= +INFO:legacy.config:Cache Configuration Information : +INFO:legacy.config:block_size : 16 +INFO:legacy.config:gpu_memory_utilization: 0.9 +INFO:legacy.config:num_gpu_blocks_override: None +INFO:legacy.config:kv_cache_ratio : 0.75 +INFO:legacy.config:enc_dec_block_num : 2 +INFO:legacy.config:prealloc_dec_block_slot_num_threshold: 12 +INFO:legacy.config:cache_dtype : bfloat16 +INFO:legacy.config:model_cfg : +INFO:legacy.config:enable_chunked_prefill: False +INFO:legacy.config:rdma_comm_ports : [25285] +INFO:legacy.config:local_rdma_comm_ports: [25285] +INFO:legacy.config:cache_transfer_protocol: ipc,rdma +INFO:legacy.config:pd_comm_port : [60151] +INFO:legacy.config:local_pd_comm_port : 60151 +INFO:legacy.config:enable_prefix_caching: False +INFO:legacy.config:enable_ssd_cache : False +INFO:legacy.config:cache_queue_port : [59850] +INFO:legacy.config:local_cache_queue_port: 59850 +INFO:legacy.config:swap_space : None +INFO:legacy.config:max_encoder_cache : 0 +INFO:legacy.config:max_processor_cache : -1 +INFO:legacy.config:enable_output_caching: False +INFO:legacy.config:disable_chunked_mm_input: False +INFO:legacy.config:kvcache_storage_backend: None +INFO:legacy.config:write_policy : write_through +INFO:legacy.config:write_through_threshold: 2 +INFO:legacy.config:num_cpu_blocks : 0 +INFO:legacy.config:use_mla_cache : False +INFO:legacy.config:head_num : 2 +INFO:legacy.config:head_dim : 128 +INFO:legacy.config:byte_size : 2 +INFO:legacy.config:kv_factor : 2 +INFO:legacy.config:bytes_per_token_per_layer: 1024 +INFO:legacy.config:bytes_per_block : 294912 +INFO:legacy.config:max_block_num_per_seq: 512 +INFO:legacy.config:dec_token_num : 32 +INFO:legacy.config:total_block_num : 528 +INFO:legacy.config:prefill_kvcache_block_num: 528 +INFO:legacy.config:============================================================= +INFO:legacy.config:LocalScheduler Configuration Information : +INFO:legacy.config:max_size : -1 +INFO:legacy.config:ttl : 900 +INFO:legacy.config:max_model_len : 8192 +INFO:legacy.config:enable_chunked_prefill: False +INFO:legacy.config:max_num_partial_prefills: 1 +INFO:legacy.config:max_long_partial_prefills: 1 +INFO:legacy.config:long_prefill_token_threshold: 327 +INFO:legacy.config:============================================================= +INFO:legacy.config:Parallel Configuration Information : +INFO:legacy.config:sequence_parallel : False +INFO:legacy.config:use_ep : False +INFO:legacy.config:msg_queue_id : 1 +INFO:legacy.config:tensor_parallel_rank: 0 +INFO:legacy.config:tensor_parallel_size: 1 +INFO:legacy.config:expert_parallel_rank: 0 +INFO:legacy.config:expert_parallel_size: 1 +INFO:legacy.config:data_parallel_rank : 0 +INFO:legacy.config:data_parallel_size : 1 +INFO:legacy.config:enable_expert_parallel: False +INFO:legacy.config:enable_chunked_moe : False +INFO:legacy.config:chunked_moe_size : 256 +INFO:legacy.config:local_data_parallel_id: 0 +INFO:legacy.config:engine_worker_queue_port: [46509] +INFO:legacy.config:local_engine_worker_queue_port: 46509 +INFO:legacy.config:device_ids : 0 +INFO:legacy.config:first_token_id : 1 +INFO:legacy.config:engine_pid : None +INFO:legacy.config:do_profile : False +INFO:legacy.config:use_internode_ll_two_stage: False +INFO:legacy.config:disable_sequence_parallel_moe: False +INFO:legacy.config:shutdown_comm_group_if_worker_idle: True +INFO:legacy.config:ep_prefill_use_worst_num_tokens: False +INFO:legacy.config:pod_ip : None +INFO:legacy.config:disable_custom_all_reduce: False +INFO:legacy.config:enable_flashinfer_allreduce_fusion: False +INFO:legacy.config:pd_disaggregation_mode: None +INFO:legacy.config:prefill_one_step_stop: False +INFO:legacy.config:use_sequence_parallel_moe: False +INFO:legacy.config:============================================================= +INFO:legacy.config:speculative_config : {"method_list": ["ngram", "mtp", "naive", "suffix"], "mtp_strategy_list": ["default", "with_ngram"], "mtp_strategy": "default", "num_speculative_tokens": 1, "num_model_steps": 1, "max_candidate_len": 5, "verify_window": 2, "max_ngram_size": 5, "min_ngram_size": 2, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": -1, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1, "model": "/home/aistudio/PaddlePaddle/PaddleOCR-VL-1.5", "quantization": "wint8", "num_gpu_block_expand_ratio": 1.0, "model_type": "main", "benchmark_mode": false, "enf_gen_phase_tag": false, "enable_draft_logprob": false, "verify_strategy": "target_match", "accept_policy": "normal", "model_config": {}, "num_extra_cache_layer": 0} +INFO:legacy.config:eplb_config : +INFO:legacy.config:device_config : None +INFO:legacy.config:load_config : {"load_choices": "default_v1", "is_pre_sharded": false, "dynamic_load_weight": false, "load_strategy": "normal", "rsync_config": null, "model_loader_extra_config": null} +INFO:legacy.config:quant_config : None +INFO:legacy.config:graph_opt_config : {"graph_opt_level": 0, "sot_warmup_sizes": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 32, 64, 128], "use_cudagraph": false, "cudagraph_capture_sizes": [8, 4, 2, 1], "flag_cudagraph_capture_sizes_initlized": true, "cudagraph_capture_sizes_prefill": [512, 512, 480, 448, 416, 384, 352, 320, 288, 256, 240, 224, 208, 192, 176, 160, 144, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "cudagraph_num_of_warmups": 2, "cudagraph_copy_inputs": false, "cudagraph_splitting_ops": [], "cudagraph_only_prefill": false, "full_cuda_graph": true, "max_capture_size": 8, "real_shape_to_captured_size": {"4": 4, "5": 8, "6": 8, "7": 8, "2": 2, "3": 4, "1": 1, "0": 0, "8": 8}, "real_bsz_to_captured_size": {}, "use_unique_memory_pool": true, "draft_model_use_cudagraph": true, "max_capture_shape_prefill": 512, "max_capture_size_prefill": 512, "real_shape_to_captured_size_prefill": {"480": 480, "481": 512, "482": 512, "483": 512, "484": 512, "485": 512, "486": 512, "487": 512, "488": 512, "489": 512, "490": 512, "491": 512, "492": 512, "493": 512, "494": 512, "495": 512, "496": 512, "497": 512, "498": 512, "499": 512, "500": 512, "501": 512, "502": 512, "503": 512, "504": 512, "505": 512, "506": 512, "507": 512, "508": 512, "509": 512, "510": 512, "511": 512, "448": 448, "449": 480, "450": 480, "451": 480, "452": 480, "453": 480, "454": 480, "455": 480, "456": 480, "457": 480, "458": 480, "459": 480, "460": 480, "461": 480, "462": 480, "463": 480, "464": 480, "465": 480, "466": 480, "467": 480, "468": 480, "469": 480, "470": 480, "471": 480, "472": 480, "473": 480, "474": 480, "475": 480, "476": 480, "477": 480, "478": 480, "479": 480, "416": 416, "417": 448, "418": 448, "419": 448, "420": 448, "421": 448, "422": 448, "423": 448, "424": 448, "425": 448, "426": 448, "427": 448, "428": 448, "429": 448, "430": 448, "431": 448, "432": 448, "433": 448, "434": 448, "435": 448, "436": 448, "437": 448, "438": 448, "439": 448, "440": 448, "441": 448, "442": 448, "443": 448, "444": 448, "445": 448, "446": 448, "447": 448, "384": 384, "385": 416, "386": 416, "387": 416, "388": 416, "389": 416, "390": 416, "391": 416, "392": 416, "393": 416, "394": 416, "395": 416, "396": 416, "397": 416, "398": 416, "399": 416, "400": 416, "401": 416, "402": 416, "403": 416, "404": 416, "405": 416, "406": 416, "407": 416, "408": 416, "409": 416, "410": 416, "411": 416, "412": 416, "413": 416, "414": 416, "415": 416, "352": 352, "353": 384, "354": 384, "355": 384, "356": 384, "357": 384, "358": 384, "359": 384, "360": 384, "361": 384, "362": 384, "363": 384, "364": 384, "365": 384, "366": 384, "367": 384, "368": 384, "369": 384, "370": 384, "371": 384, "372": 384, "373": 384, "374": 384, "375": 384, "376": 384, "377": 384, "378": 384, "379": 384, "380": 384, "381": 384, "382": 384, "383": 384, "320": 320, "321": 352, "322": 352, "323": 352, "324": 352, "325": 352, "326": 352, "327": 352, "328": 352, "329": 352, "330": 352, "331": 352, "332": 352, "333": 352, "334": 352, "335": 352, "336": 352, "337": 352, "338": 352, "339": 352, "340": 352, "341": 352, "342": 352, "343": 352, "344": 352, "345": 352, "346": 352, "347": 352, "348": 352, "349": 352, "350": 352, "351": 352, "288": 288, "289": 320, "290": 320, "291": 320, "292": 320, "293": 320, "294": 320, "295": 320, "296": 320, "297": 320, "298": 320, "299": 320, "300": 320, "301": 320, "302": 320, "303": 320, "304": 320, "305": 320, "306": 320, "307": 320, "308": 320, "309": 320, "310": 320, "311": 320, "312": 320, "313": 320, "314": 320, "315": 320, "316": 320, "317": 320, "318": 320, "319": 320, "256": 256, "257": 288, "258": 288, "259": 288, "260": 288, "261": 288, "262": 288, "263": 288, "264": 288, "265": 288, "266": 288, "267": 288, "268": 288, "269": 288, "270": 288, "271": 288, "272": 288, "273": 288, "274": 288, "275": 288, "276": 288, "277": 288, "278": 288, "279": 288, "280": 288, "281": 288, "282": 288, "283": 288, "284": 288, "285": 288, "286": 288, "287": 288, "240": 240, "241": 256, "242": 256, "243": 256, "244": 256, "245": 256, "246": 256, "247": 256, "248": 256, "249": 256, "250": 256, "251": 256, "252": 256, "253": 256, "254": 256, "255": 256, "224": 224, "225": 240, "226": 240, "227": 240, "228": 240, "229": 240, "230": 240, "231": 240, "232": 240, "233": 240, "234": 240, "235": 240, "236": 240, "237": 240, "238": 240, "239": 240, "208": 208, "209": 224, "210": 224, "211": 224, "212": 224, "213": 224, "214": 224, "215": 224, "216": 224, "217": 224, "218": 224, "219": 224, "220": 224, "221": 224, "222": 224, "223": 224, "192": 192, "193": 208, "194": 208, "195": 208, "196": 208, "197": 208, "198": 208, "199": 208, "200": 208, "201": 208, "202": 208, "203": 208, "204": 208, "205": 208, "206": 208, "207": 208, "176": 176, "177": 192, "178": 192, "179": 192, "180": 192, "181": 192, "182": 192, "183": 192, "184": 192, "185": 192, "186": 192, "187": 192, "188": 192, "189": 192, "190": 192, "191": 192, "160": 160, "161": 176, "162": 176, "163": 176, "164": 176, "165": 176, "166": 176, "167": 176, "168": 176, "169": 176, "170": 176, "171": 176, "172": 176, "173": 176, "174": 176, "175": 176, "144": 144, "145": 160, "146": 160, "147": 160, "148": 160, "149": 160, "150": 160, "151": 160, "152": 160, "153": 160, "154": 160, "155": 160, "156": 160, "157": 160, "158": 160, "159": 160, "128": 128, "129": 144, "130": 144, "131": 144, "132": 144, "133": 144, "134": 144, "135": 144, "136": 144, "137": 144, "138": 144, "139": 144, "140": 144, "141": 144, "142": 144, "143": 144, "120": 120, "121": 128, "122": 128, "123": 128, "124": 128, "125": 128, "126": 128, "127": 128, "112": 112, "113": 120, "114": 120, "115": 120, "116": 120, "117": 120, "118": 120, "119": 120, "104": 104, "105": 112, "106": 112, "107": 112, "108": 112, "109": 112, "110": 112, "111": 112, "96": 96, "97": 104, "98": 104, "99": 104, "100": 104, "101": 104, "102": 104, "103": 104, "88": 88, "89": 96, "90": 96, "91": 96, "92": 96, "93": 96, "94": 96, "95": 96, "80": 80, "81": 88, "82": 88, "83": 88, "84": 88, "85": 88, "86": 88, "87": 88, "72": 72, "73": 80, "74": 80, "75": 80, "76": 80, "77": 80, "78": 80, "79": 80, "64": 64, "65": 72, "66": 72, "67": 72, "68": 72, "69": 72, "70": 72, "71": 72, "56": 56, "57": 64, "58": 64, "59": 64, "60": 64, "61": 64, "62": 64, "63": 64, "48": 48, "49": 56, "50": 56, "51": 56, "52": 56, "53": 56, "54": 56, "55": 56, "40": 40, "41": 48, "42": 48, "43": 48, "44": 48, "45": 48, "46": 48, "47": 48, "32": 32, "33": 40, "34": 40, "35": 40, "36": 40, "37": 40, "38": 40, "39": 40, "24": 24, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32, "30": 32, "31": 32, "16": 16, "17": 24, "18": 24, "19": 24, "20": 24, "21": 24, "22": 24, "23": 24, "8": 8, "9": 16, "10": 16, "11": 16, "12": 16, "13": 16, "14": 16, "15": 16, "4": 4, "5": 8, "6": 8, "7": 8, "2": 2, "3": 4, "1": 1, "0": 0, "512": 512}} +INFO:legacy.config:early_stop_config : {"enable_early_stop": false, "strategy": "repetition", "window_size": 3000, "threshold": 0.99} +INFO:legacy.config:plas_attention_config: {"plas_encoder_top_k_left": null, "plas_encoder_top_k_right": null, "plas_decoder_top_k_left": null, "plas_decoder_top_k_right": null, "plas_use_encoder_seq_limit": null, "plas_use_decoder_seq_limit": null, "plas_block_size": 128, "mlp_weight_name": "plas_attention_mlp_weight.safetensors", "plas_max_seq_length": 131072} +INFO:legacy.config:structured_outputs_config: {"reasoning_parser": null, "guided_decoding_backend": "off", "disable_any_whitespace": true, "logits_processors": null} +INFO:legacy.config:router_config : {"router": null, "api_server_host": "10.234.11.170", "api_server_port": null, "metrics_port": null} +INFO:legacy.config:routing_replay_config: {"enable_routing_replay": false, "routing_store_type": "local", "local_store_dir": "./routing_replay_output", "rdma_store_server": "", "only_last_turn": false, "use_fused_put": false} +INFO:legacy.config:deploy_modality : mixed +INFO:legacy.config:tokenizer : /home/aistudio/PaddlePaddle/PaddleOCR-VL-1.5 +INFO:legacy.config:ips : None +INFO:legacy.config:tool_parser : None +INFO:legacy.config:master_ip : 0.0.0.0 +INFO:legacy.config:host_ip : 10.234.11.170 +INFO:legacy.config:nnode : 1 +INFO:legacy.config:node_rank : 0 +INFO:legacy.config:limit_mm_per_prompt : None +INFO:legacy.config:mm_processor_kwargs : None +INFO:legacy.config:use_warmup : 0 +INFO:legacy.config:max_num_partial_prefills: 1 +INFO:legacy.config:max_long_partial_prefills: 1 +INFO:legacy.config:long_prefill_token_threshold: 327 +INFO:legacy.config:max_prefill_batch : 8 +INFO:legacy.config:max_chips_per_node : 16 +INFO:legacy.config:worker_num_per_node : 1 +INFO:legacy.config:is_master : True +INFO:legacy.config:paddle_commit_id : 28667cd939ab01444ead356a35b2dfea066dd39b +INFO:legacy.config:local_device_ids : ['0'] +INFO:legacy.config:splitwise_version : v1 +INFO:legacy.config:register_info : {'role': 'mixed', 'host_ip': '10.234.11.170', 'port': None, 'metrics_port': None, 'connector_port': 60151, 'rdma_ports': [25285], 'engine_worker_queue_port': 46509, 'device_ids': ['0'], 'transfer_protocol': ['ipc', 'rdma'], 'tp_size': 1, 'is_paused': False, 'version': 'init', 'connected_decodes': []} +INFO:legacy.config:============================================================= +INFO:legacy.prefix_cache_manager:Prefix cache manager is initialized with 528 gpu blocks and 0 cpu blocks, bytes_per_token_per_layer for each rank: 1024.0 +INFO 2026-06-01 20:25:58,230 7196 download.py[line:146] Using download source: huggingface +INFO 2026-06-01 20:25:58,238 7196 configuration_utils.py[line:425] Loading configuration file /home/aistudio/PaddlePaddle/PaddleOCR-VL-1.5/generation_config.json +INFO 2026-06-01 20:25:58,296 7196 tokenizer_utils.py[line:257] Using download source: huggingface +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +INFO 2026-06-01 20:26:03,785 7196 engine.py[line:159] Waiting for worker processes to be ready... +Loading Weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:36<00:00, 2.73it/s] +Loading Layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2195970.68it/s]INFO:legacy.config:Reset block num, the total_block_num:10922, prefill_kvcache_block_num:10922 +Loading Layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.69it/s] +INFO 2026-06-01 20:26:46,032 7196 engine.py[line:218] Worker processes are launched with 102.24678826332092 seconds. +INFO 2026-06-01 20:26:46,033 7196 engine.py[line:229] Detected 10922 gpu blocks and 0 cpu blocks in cache (block size: 16). +INFO 2026-06-01 20:26:46,033 7196 engine.py[line:232] FastDeploy will be serving 8 running requests if each sequence reaches its maximum length: 8192 +[OCR Worker] OCR model loaded, elapsed: 142.60s +[OCR Worker] Recognizing image 1/1, size: (2014, 2881) +Processed prompts: 0%| | 0/1 [00:00, is_text_generation=True, is_multimodal=False, is_reasoning=False, is_pooling=False, module_path='ernie4_5_moe', default_pooling_type='LAST') +INFO:legacy.config:_architecture : Ernie4_5_ForCausalLM +INFO:legacy.config:mla_use_absorb : False +INFO:legacy.config:max_stop_seqs_num : 5 +INFO:legacy.config:stop_seqs_max_len : 8 +INFO:legacy.config:compression_ratio : 1.0 +INFO:legacy.config:model_config : {'architectures': ['Ernie4_5_ForCausalLM'], 'bos_token_id': 1, 'eos_token_id': 2, 'hidden_act': 'silu', 'hidden_size': 1024, 'intermediate_size': 3072, 'max_position_embeddings': 131072, 'model_type': 'ernie4_5', 'num_attention_heads': 16, 'num_key_value_heads': 2, 'head_dim': 128, 'num_hidden_layers': 18, 'pad_token_id': 0, 'rms_norm_eps': 1e-05, 'use_cache': False, 'vocab_size': 103424, 'rope_theta': 500000, 'use_rmsnorm': True, 'tie_word_embeddings': True, 'use_bias': False, 'dtype': 'bfloat16'} +INFO:legacy.config:moe_phase : +INFO:legacy.config:============================================================= +INFO:legacy.config:Cache Configuration Information : +INFO:legacy.config:block_size : 16 +INFO:legacy.config:gpu_memory_utilization: 0.9 +INFO:legacy.config:num_gpu_blocks_override: None +INFO:legacy.config:kv_cache_ratio : 0.75 +INFO:legacy.config:enc_dec_block_num : 2 +INFO:legacy.config:prealloc_dec_block_slot_num_threshold: 12 +INFO:legacy.config:cache_dtype : bfloat16 +INFO:legacy.config:model_cfg : +INFO:legacy.config:enable_chunked_prefill: False +INFO:legacy.config:rdma_comm_ports : [15270, 15271] +INFO:legacy.config:local_rdma_comm_ports: [15270, 15271] +INFO:legacy.config:cache_transfer_protocol: ipc,rdma +INFO:legacy.config:pd_comm_port : [36146] +INFO:legacy.config:local_pd_comm_port : 36146 +INFO:legacy.config:enable_prefix_caching: False +INFO:legacy.config:enable_ssd_cache : False +INFO:legacy.config:cache_queue_port : [31913] +INFO:legacy.config:local_cache_queue_port: 31913 +INFO:legacy.config:swap_space : None +INFO:legacy.config:max_encoder_cache : 0 +INFO:legacy.config:max_processor_cache : -1 +INFO:legacy.config:enable_output_caching: False +INFO:legacy.config:disable_chunked_mm_input: False +INFO:legacy.config:kvcache_storage_backend: None +INFO:legacy.config:write_policy : write_through +INFO:legacy.config:write_through_threshold: 2 +INFO:legacy.config:num_cpu_blocks : 0 +INFO:legacy.config:use_mla_cache : False +INFO:legacy.config:head_num : 2 +INFO:legacy.config:head_dim : 128 +INFO:legacy.config:byte_size : 2 +INFO:legacy.config:kv_factor : 2 +INFO:legacy.config:bytes_per_token_per_layer: 1024 +INFO:legacy.config:bytes_per_block : 294912 +INFO:legacy.config:max_block_num_per_seq: 256 +INFO:legacy.config:dec_token_num : 32 +INFO:legacy.config:total_block_num : 272 +INFO:legacy.config:prefill_kvcache_block_num: 272 +INFO:legacy.config:============================================================= +INFO:legacy.config:LocalScheduler Configuration Information : +INFO:legacy.config:max_size : -1 +INFO:legacy.config:ttl : 900 +INFO:legacy.config:max_model_len : 4096 +INFO:legacy.config:enable_chunked_prefill: False +INFO:legacy.config:max_num_partial_prefills: 1 +INFO:legacy.config:max_long_partial_prefills: 1 +INFO:legacy.config:long_prefill_token_threshold: 163 +INFO:legacy.config:============================================================= +INFO:legacy.config:Parallel Configuration Information : +INFO:legacy.config:sequence_parallel : False +INFO:legacy.config:use_ep : False +INFO:legacy.config:msg_queue_id : 1 +INFO:legacy.config:tensor_parallel_rank: 0 +INFO:legacy.config:tensor_parallel_size: 2 +INFO:legacy.config:expert_parallel_rank: 0 +INFO:legacy.config:expert_parallel_size: 1 +INFO:legacy.config:data_parallel_rank : 0 +INFO:legacy.config:data_parallel_size : 1 +INFO:legacy.config:enable_expert_parallel: False +INFO:legacy.config:enable_chunked_moe : False +INFO:legacy.config:chunked_moe_size : 256 +INFO:legacy.config:local_data_parallel_id: 0 +INFO:legacy.config:engine_worker_queue_port: [51568] +INFO:legacy.config:local_engine_worker_queue_port: 51568 +INFO:legacy.config:device_ids : 0,1 +INFO:legacy.config:first_token_id : 1 +INFO:legacy.config:engine_pid : None +INFO:legacy.config:do_profile : False +INFO:legacy.config:use_internode_ll_two_stage: False +INFO:legacy.config:disable_sequence_parallel_moe: False +INFO:legacy.config:shutdown_comm_group_if_worker_idle: True +INFO:legacy.config:ep_prefill_use_worst_num_tokens: False +INFO:legacy.config:pod_ip : None +INFO:legacy.config:disable_custom_all_reduce: False +INFO:legacy.config:enable_flashinfer_allreduce_fusion: False +INFO:legacy.config:pd_disaggregation_mode: None +INFO:legacy.config:prefill_one_step_stop: False +INFO:legacy.config:use_sequence_parallel_moe: False +INFO:legacy.config:============================================================= +INFO:legacy.config:speculative_config : {"method_list": ["ngram", "mtp", "naive", "suffix"], "mtp_strategy_list": ["default", "with_ngram"], "mtp_strategy": "default", "num_speculative_tokens": 1, "num_model_steps": 1, "max_candidate_len": 5, "verify_window": 2, "max_ngram_size": 5, "min_ngram_size": 2, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": -1, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1, "model": "/home/aistudio/PaddlePaddle/ERNIE-4.5-0.3B-Paddle", "quantization": "wint8", "num_gpu_block_expand_ratio": 1.0, "model_type": "main", "benchmark_mode": false, "enf_gen_phase_tag": false, "enable_draft_logprob": false, "verify_strategy": "target_match", "accept_policy": "normal", "model_config": {}, "num_extra_cache_layer": 0} +INFO:legacy.config:eplb_config : +INFO:legacy.config:device_config : None +INFO:legacy.config:load_config : {"load_choices": "default_v1", "is_pre_sharded": false, "dynamic_load_weight": false, "load_strategy": "normal", "rsync_config": null, "model_loader_extra_config": null} +INFO:legacy.config:quant_config : None +INFO:legacy.config:graph_opt_config : {"graph_opt_level": 0, "sot_warmup_sizes": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 32, 64, 128], "use_cudagraph": false, "cudagraph_capture_sizes": [8, 4, 2, 1], "flag_cudagraph_capture_sizes_initlized": true, "cudagraph_capture_sizes_prefill": [512, 512, 480, 448, 416, 384, 352, 320, 288, 256, 240, 224, 208, 192, 176, 160, 144, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "cudagraph_num_of_warmups": 2, "cudagraph_copy_inputs": false, "cudagraph_splitting_ops": [], "cudagraph_only_prefill": false, "full_cuda_graph": true, "max_capture_size": 8, "real_shape_to_captured_size": {"4": 4, "5": 8, "6": 8, "7": 8, "2": 2, "3": 4, "1": 1, "0": 0, "8": 8}, "real_bsz_to_captured_size": {}, "use_unique_memory_pool": true, "draft_model_use_cudagraph": true, "max_capture_shape_prefill": 512, "max_capture_size_prefill": 512, "real_shape_to_captured_size_prefill": {"480": 480, "481": 512, "482": 512, "483": 512, "484": 512, "485": 512, "486": 512, "487": 512, "488": 512, "489": 512, "490": 512, "491": 512, "492": 512, "493": 512, "494": 512, "495": 512, "496": 512, "497": 512, "498": 512, "499": 512, "500": 512, "501": 512, "502": 512, "503": 512, "504": 512, "505": 512, "506": 512, "507": 512, "508": 512, "509": 512, "510": 512, "511": 512, "448": 448, "449": 480, "450": 480, "451": 480, "452": 480, "453": 480, "454": 480, "455": 480, "456": 480, "457": 480, "458": 480, "459": 480, "460": 480, "461": 480, "462": 480, "463": 480, "464": 480, "465": 480, "466": 480, "467": 480, "468": 480, "469": 480, "470": 480, "471": 480, "472": 480, "473": 480, "474": 480, "475": 480, "476": 480, "477": 480, "478": 480, "479": 480, "416": 416, "417": 448, "418": 448, "419": 448, "420": 448, "421": 448, "422": 448, "423": 448, "424": 448, "425": 448, "426": 448, "427": 448, "428": 448, "429": 448, "430": 448, "431": 448, "432": 448, "433": 448, "434": 448, "435": 448, "436": 448, "437": 448, "438": 448, "439": 448, "440": 448, "441": 448, "442": 448, "443": 448, "444": 448, "445": 448, "446": 448, "447": 448, "384": 384, "385": 416, "386": 416, "387": 416, "388": 416, "389": 416, "390": 416, "391": 416, "392": 416, "393": 416, "394": 416, "395": 416, "396": 416, "397": 416, "398": 416, "399": 416, "400": 416, "401": 416, "402": 416, "403": 416, "404": 416, "405": 416, "406": 416, "407": 416, "408": 416, "409": 416, "410": 416, "411": 416, "412": 416, "413": 416, "414": 416, "415": 416, "352": 352, "353": 384, "354": 384, "355": 384, "356": 384, "357": 384, "358": 384, "359": 384, "360": 384, "361": 384, "362": 384, "363": 384, "364": 384, "365": 384, "366": 384, "367": 384, "368": 384, "369": 384, "370": 384, "371": 384, "372": 384, "373": 384, "374": 384, "375": 384, "376": 384, "377": 384, "378": 384, "379": 384, "380": 384, "381": 384, "382": 384, "383": 384, "320": 320, "321": 352, "322": 352, "323": 352, "324": 352, "325": 352, "326": 352, "327": 352, "328": 352, "329": 352, "330": 352, "331": 352, "332": 352, "333": 352, "334": 352, "335": 352, "336": 352, "337": 352, "338": 352, "339": 352, "340": 352, "341": 352, "342": 352, "343": 352, "344": 352, "345": 352, "346": 352, "347": 352, "348": 352, "349": 352, "350": 352, "351": 352, "288": 288, "289": 320, "290": 320, "291": 320, "292": 320, "293": 320, "294": 320, "295": 320, "296": 320, "297": 320, "298": 320, "299": 320, "300": 320, "301": 320, "302": 320, "303": 320, "304": 320, "305": 320, "306": 320, "307": 320, "308": 320, "309": 320, "310": 320, "311": 320, "312": 320, "313": 320, "314": 320, "315": 320, "316": 320, "317": 320, "318": 320, "319": 320, "256": 256, "257": 288, "258": 288, "259": 288, "260": 288, "261": 288, "262": 288, "263": 288, "264": 288, "265": 288, "266": 288, "267": 288, "268": 288, "269": 288, "270": 288, "271": 288, "272": 288, "273": 288, "274": 288, "275": 288, "276": 288, "277": 288, "278": 288, "279": 288, "280": 288, "281": 288, "282": 288, "283": 288, "284": 288, "285": 288, "286": 288, "287": 288, "240": 240, "241": 256, "242": 256, "243": 256, "244": 256, "245": 256, "246": 256, "247": 256, "248": 256, "249": 256, "250": 256, "251": 256, "252": 256, "253": 256, "254": 256, "255": 256, "224": 224, "225": 240, "226": 240, "227": 240, "228": 240, "229": 240, "230": 240, "231": 240, "232": 240, "233": 240, "234": 240, "235": 240, "236": 240, "237": 240, "238": 240, "239": 240, "208": 208, "209": 224, "210": 224, "211": 224, "212": 224, "213": 224, "214": 224, "215": 224, "216": 224, "217": 224, "218": 224, "219": 224, "220": 224, "221": 224, "222": 224, "223": 224, "192": 192, "193": 208, "194": 208, "195": 208, "196": 208, "197": 208, "198": 208, "199": 208, "200": 208, "201": 208, "202": 208, "203": 208, "204": 208, "205": 208, "206": 208, "207": 208, "176": 176, "177": 192, "178": 192, "179": 192, "180": 192, "181": 192, "182": 192, "183": 192, "184": 192, "185": 192, "186": 192, "187": 192, "188": 192, "189": 192, "190": 192, "191": 192, "160": 160, "161": 176, "162": 176, "163": 176, "164": 176, "165": 176, "166": 176, "167": 176, "168": 176, "169": 176, "170": 176, "171": 176, "172": 176, "173": 176, "174": 176, "175": 176, "144": 144, "145": 160, "146": 160, "147": 160, "148": 160, "149": 160, "150": 160, "151": 160, "152": 160, "153": 160, "154": 160, "155": 160, "156": 160, "157": 160, "158": 160, "159": 160, "128": 128, "129": 144, "130": 144, "131": 144, "132": 144, "133": 144, "134": 144, "135": 144, "136": 144, "137": 144, "138": 144, "139": 144, "140": 144, "141": 144, "142": 144, "143": 144, "120": 120, "121": 128, "122": 128, "123": 128, "124": 128, "125": 128, "126": 128, "127": 128, "112": 112, "113": 120, "114": 120, "115": 120, "116": 120, "117": 120, "118": 120, "119": 120, "104": 104, "105": 112, "106": 112, "107": 112, "108": 112, "109": 112, "110": 112, "111": 112, "96": 96, "97": 104, "98": 104, "99": 104, "100": 104, "101": 104, "102": 104, "103": 104, "88": 88, "89": 96, "90": 96, "91": 96, "92": 96, "93": 96, "94": 96, "95": 96, "80": 80, "81": 88, "82": 88, "83": 88, "84": 88, "85": 88, "86": 88, "87": 88, "72": 72, "73": 80, "74": 80, "75": 80, "76": 80, "77": 80, "78": 80, "79": 80, "64": 64, "65": 72, "66": 72, "67": 72, "68": 72, "69": 72, "70": 72, "71": 72, "56": 56, "57": 64, "58": 64, "59": 64, "60": 64, "61": 64, "62": 64, "63": 64, "48": 48, "49": 56, "50": 56, "51": 56, "52": 56, "53": 56, "54": 56, "55": 56, "40": 40, "41": 48, "42": 48, "43": 48, "44": 48, "45": 48, "46": 48, "47": 48, "32": 32, "33": 40, "34": 40, "35": 40, "36": 40, "37": 40, "38": 40, "39": 40, "24": 24, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32, "30": 32, "31": 32, "16": 16, "17": 24, "18": 24, "19": 24, "20": 24, "21": 24, "22": 24, "23": 24, "8": 8, "9": 16, "10": 16, "11": 16, "12": 16, "13": 16, "14": 16, "15": 16, "4": 4, "5": 8, "6": 8, "7": 8, "2": 2, "3": 4, "1": 1, "0": 0, "512": 512}} +INFO:legacy.config:early_stop_config : {"enable_early_stop": false, "strategy": "repetition", "window_size": 3000, "threshold": 0.99} +INFO:legacy.config:plas_attention_config: {"plas_encoder_top_k_left": null, "plas_encoder_top_k_right": null, "plas_decoder_top_k_left": null, "plas_decoder_top_k_right": null, "plas_use_encoder_seq_limit": null, "plas_use_decoder_seq_limit": null, "plas_block_size": 128, "mlp_weight_name": "plas_attention_mlp_weight.safetensors", "plas_max_seq_length": 131072} +INFO:legacy.config:structured_outputs_config: {"reasoning_parser": null, "guided_decoding_backend": "off", "disable_any_whitespace": true, "logits_processors": null} +INFO:legacy.config:router_config : {"router": null, "api_server_host": "10.234.11.170", "api_server_port": null, "metrics_port": null} +INFO:legacy.config:routing_replay_config: {"enable_routing_replay": false, "routing_store_type": "local", "local_store_dir": "./routing_replay_output", "rdma_store_server": "", "only_last_turn": false, "use_fused_put": false} +INFO:legacy.config:deploy_modality : mixed +INFO:legacy.config:tokenizer : /home/aistudio/PaddlePaddle/ERNIE-4.5-0.3B-Paddle +INFO:legacy.config:ips : None +INFO:legacy.config:tool_parser : None +INFO:legacy.config:master_ip : 0.0.0.0 +INFO:legacy.config:host_ip : 10.234.11.170 +INFO:legacy.config:nnode : 1 +INFO:legacy.config:node_rank : 0 +INFO:legacy.config:limit_mm_per_prompt : None +INFO:legacy.config:mm_processor_kwargs : None +INFO:legacy.config:use_warmup : 0 +INFO:legacy.config:max_num_partial_prefills: 1 +INFO:legacy.config:max_long_partial_prefills: 1 +INFO:legacy.config:long_prefill_token_threshold: 163 +INFO:legacy.config:max_prefill_batch : 3 +INFO:legacy.config:max_chips_per_node : 16 +INFO:legacy.config:worker_num_per_node : 2 +INFO:legacy.config:is_master : True +INFO:legacy.config:paddle_commit_id : 28667cd939ab01444ead356a35b2dfea066dd39b +INFO:legacy.config:local_device_ids : ['0', '1'] +INFO:legacy.config:splitwise_version : v1 +INFO:legacy.config:register_info : {'role': 'mixed', 'host_ip': '10.234.11.170', 'port': None, 'metrics_port': None, 'connector_port': 36146, 'rdma_ports': [15270, 15271], 'engine_worker_queue_port': 51568, 'device_ids': ['0', '1'], 'transfer_protocol': ['ipc', 'rdma'], 'tp_size': 2, 'is_paused': False, 'version': 'init', 'connected_decodes': []} +INFO:legacy.config:============================================================= +INFO:legacy.prefix_cache_manager:Prefix cache manager is initialized with 272 gpu blocks and 0 cpu blocks, bytes_per_token_per_layer for each rank: 512.0 +INFO 2026-06-01 20:27:59,772 24555 download.py[line:146] Using download source: huggingface +INFO 2026-06-01 20:27:59,772 24555 configuration_utils.py[line:425] Loading configuration file /home/aistudio/PaddlePaddle/ERNIE-4.5-0.3B-Paddle/generation_config.json +/home/aistudio/.local/lib/python3.10/site-packages/paddleformers/generation/configuration_utils.py:250: UserWarning: using greedy search strategy. However, `temperature` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. +/home/aistudio/.local/lib/python3.10/site-packages/paddleformers/generation/configuration_utils.py:255: UserWarning: using greedy search strategy. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. +WARNING 2026-06-01 20:27:59,790 24555 log.py[line:135] PretrainedTokenizer will be deprecated and removed in the next major release. Please migrate to Hugging Face's transformers.PreTrainedTokenizer. use class QWenTokenizer(PaddleTokenizerMixin, hf.PreTrainedTokenizer) to support multisource download and Paddle tokenizer operations. +INFO 2026-06-01 20:28:05,026 24555 engine.py[line:159] Waiting for worker processes to be ready... +Loading Weights: 0%| | 0/100 [02:28 {new_line}") + + +patch_aistudio_utils() + + +# ============================================================================ +# Image splitting +# ============================================================================ + +def split_image(image, num_splits=4, overlap_ratio=0.1): + """Split an image into num_splits parts (NxN grid) with overlap.""" + grid_size = int(math.sqrt(num_splits)) + if grid_size * grid_size != num_splits: + raise ValueError(f"num_splits must be a perfect square (e.g. 4, 9, 16), got: {num_splits}") + + w, h = image.size + cell_w = w / grid_size + cell_h = h / grid_size + overlap_w = cell_w * overlap_ratio + overlap_h = cell_h * overlap_ratio + + sub_images = [] + for row in range(grid_size): + for col in range(grid_size): + left = max(0, col * cell_w - overlap_w) + upper = max(0, row * cell_h - overlap_h) + right = min(w, (col + 1) * cell_w + overlap_w) + lower = min(h, (row + 1) * cell_h + overlap_h) + sub_img = image.crop((int(left), int(upper), int(right), int(lower))) + sub_images.append(sub_img) + + return sub_images + + +# ============================================================================ +# OCR module (subprocess) +# ============================================================================ + +def ocr_worker_process(ocr_model_dir, image_data_list, max_new_tokens, result_queue): + """Worker function for OCR subprocess - loads model, performs OCR, returns result.""" + try: + import time + import base64 + import io + from PIL import Image + from fastdeploy import LLM, SamplingParams + + # Load OCR model + print("[OCR Worker] Loading OCR model (PaddleOCR-VL)...") + start = time.perf_counter() + ocr_model = LLM( + model=ocr_model_dir, + tensor_parallel_size=1, + max_model_len=8192, + block_size=16, + quantization="wint8", + graph_optimization_config={"use_cudagraph": False}, + ) + elapsed = time.perf_counter() - start + print(f"[OCR Worker] OCR model loaded, elapsed: {elapsed:.2f}s") + + # Process each image + all_ocr_texts = [] + for i, img_bytes in enumerate(image_data_list): + image = Image.open(io.BytesIO(img_bytes)).convert("RGB") + print(f"[OCR Worker] Recognizing image {i+1}/{len(image_data_list)}, size: {image.size}") + + # Prepare image for OCR + buf = io.BytesIO() + image.save(buf, format="PNG") + base64_image = base64.b64encode(buf.getvalue()).decode("utf-8") + image_url = f"data:image/png;base64,{base64_image}" + + prompts = [{ + "messages": [{ + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": "OCR:"}, + ], + }] + }] + sampling_params = SamplingParams( + temperature=0.8, top_p=0.95, max_tokens=max_new_tokens, + ) + outputs = ocr_model.generate(prompts, sampling_params) + response = outputs[0].outputs.text + all_ocr_texts.append(response) + print(f"[OCR Worker] Image {i+1} done, text length: {len(response)}") + + # Combine results + combined_text = "\n\n".join(all_ocr_texts) + print(f"[OCR Worker] All images done, total text length: {len(combined_text)}") + + # Put result in queue + result_queue.put(("success", combined_text)) + + # Clean up + del ocr_model + import gc + gc.collect() + print("[OCR Worker] OCR model released") + + except Exception as e: + import traceback + result_queue.put(("error", str(e) + "\n" + traceback.format_exc())) + + +def ocr_step( + ocr_model_dir, + image_path, + enable_split=True, + num_splits=4, + overlap_ratio=0.1, + max_new_tokens=5120, +): + """Execute the OCR step in a subprocess: load image, optionally split, and run OCR.""" + step_start = time.perf_counter() + logger.info("[OCR Step] Loading image...") + image = Image.open(image_path).convert("RGB") + logger.info("[OCR Step] Image loaded, size: %s", image.size) + + if enable_split: + logger.info("[OCR Step] Splitting image (num_splits=%d, overlap=%.2f)...", num_splits, overlap_ratio) + sub_images = split_image(image, num_splits=num_splits, overlap_ratio=overlap_ratio) + ocr_images = [image] + sub_images + logger.info("[OCR Step] Split done, 1 original + %d split = %d total", len(sub_images), len(ocr_images)) + else: + logger.info("[OCR Step] Skipping image split") + ocr_images = [image] + + # Serialize images to bytes for subprocess + image_data_list = [] + for img in ocr_images: + buf = io.BytesIO() + img.save(buf, format="PNG") + image_data_list.append(buf.getvalue()) + + # Create subprocess for OCR + logger.info("[OCR Step] Starting OCR subprocess...") + result_queue = Queue() + ocr_process = Process( + target=ocr_worker_process, + args=(str(ocr_model_dir), image_data_list, max_new_tokens, result_queue) + ) + ocr_process.start() + + # Wait for result + status, result = result_queue.get() + ocr_process.join() + ocr_process.close() + + if status == "error": + logger.error("[OCR Step] OCR subprocess failed: %s", result) + raise RuntimeError(f"OCR subprocess failed: {result}") + + combined_ocr_text = result + logger.info("[OCR Step] OCR complete, total text length: %d, elapsed: %.2fs", len(combined_ocr_text), time.perf_counter() - step_start) + + return {"ocr_text": combined_ocr_text, "ocr_images": ocr_images} + + +# ============================================================================ +# LLM module (subprocess) +# ============================================================================ + +def clean_for_tts(text): + """Clean text for TTS synthesis by removing emojis and markdown formatting.""" + import re + # Remove emojis (Unicode ranges for common emojis) + # NOTE: Must avoid ranges that overlap with CJK characters (U+4E00-U+9FFF) + text = re.sub( + r"[\U0001F600-\U0001F64F" # emoticons + r"\U0001F300-\U0001F5FF" # symbols & pictographs + r"\U0001F680-\U0001F6FF" # transport & map + r"\U0001F1E0-\U0001F1FF" # flags + r"\U00002702-\U000027B0" # dingbats + r"\U000024C2-\U0000324F" # enclosed alphanumerics (stop before CJK) + r"\U0001F200-\U0001F251" # enclosed CJK supplement (above CJK range) + r"\U0001F900-\U0001F9FF" # supplemental symbols + r"\U0001FA00-\U0001FA6F" # chess symbols + r"\U0001FA70-\U0001FAFF" # symbols extended-A + r"\U00002600-\U000026FF" # misc symbols + r"\U0000FE00-\U0000FE0F" # variation selectors + r"\U0000200D" # zero-width joiner + r"]+", + "", + text, + ) + # Remove thinking blocks (including bare ) + text = re.sub(r".*?", "", text, flags=re.DOTALL) + text = re.sub(r"^.*?\s*", "", text, flags=re.DOTALL) + # Remove markdown code blocks (```...```) + text = re.sub(r"```.*?```", "", text, flags=re.DOTALL) + # Remove inline code (`...`) -> content + text = re.sub(r"`([^`\n]+)`", r"\1", text) + # Remove markdown headers (# ## ### etc.) at line start + text = re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE) + # Remove markdown bold (**text**) -> text + text = re.sub(r"\*\*([^*\n]+?)\*\*", r"\1", text) + # Remove markdown bold (__text__) -> text + text = re.sub(r"__([^_\n]+?)__", r"\1", text) + # Remove markdown italic (*text*) -> text + text = re.sub(r"\*([^*\n]+?)\*", r"\1", text) + # Remove markdown italic (_text_) -> text (only when _ is at word boundary) + text = re.sub(r"(? text + text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text) + # Remove markdown images ![alt](url) + text = re.sub(r"!\[[^\]]*\]\([^)]+\)", "", text) + # Remove markdown horizontal rules (---, ***, ___) + text = re.sub(r"^[-*_]{3,}\s*$", "", text, flags=re.MULTILINE) + # Remove markdown bullet list markers (- , * , + ) at line start, keep content + text = re.sub(r"^(\s*)[-*+]\s+", r"\1", text, flags=re.MULTILINE) + # Remove markdown numbered list markers (1. 2. etc.) at line start, keep content + text = re.sub(r"^(\s*)\d+\.\s+", r"\1", text, flags=re.MULTILINE) + # Remove markdown table pipes + text = re.sub(r"\|", " ", text) + # Remove markdown table separator lines (---:---:---) + text = re.sub(r"^[-: ]+$", "", text, flags=re.MULTILINE) + # Collapse multiple blank lines into one + text = re.sub(r"\n{3,}", "\n\n", text) + # Strip leading/trailing whitespace per line + lines = [line.strip() for line in text.splitlines()] + text = "\n".join(lines) + # Remove leading/trailing whitespace overall + text = text.strip() + return text + + +def llm_worker_process(llm_model_dir, ocr_text, max_new_tokens, result_queue, tensor_parallel_size=2): + """Worker function for LLM subprocess - loads model, extracts info, returns result.""" + try: + import time + from fastdeploy import LLM, SamplingParams + + import os + gpu_ids = ",".join(str(i) for i in range(tensor_parallel_size)) + os.environ["ILUVATAR_VISIBLE_DEVICES"] = gpu_ids + + # Load LLM model + print(f"[LLM Worker] Loading LLM model (ERNIE) with tensor_parallel_size={tensor_parallel_size}...") + start = time.perf_counter() + llm_model = LLM( + model=llm_model_dir, + tensor_parallel_size=tensor_parallel_size, + max_model_len=4096, + block_size=16, + quantization="wint8", + graph_optimization_config={"use_cudagraph": False}, + ) + elapsed = time.perf_counter() - start + print(f"[LLM Worker] LLM model loaded, elapsed: {elapsed:.2f}s") + + # Prepare prompt + prompt_text = f"""以下是药品说明书的 OCR 识别结果,供参考: + +{ocr_text} + +请根据以上 OCR 识别结果,提取并整理以下关键信息,用清晰易懂的语言重新表述,方便老年人阅读理解: + +1. 药品名称 +2. 药品适应症(这个药治什么病) +3. 药品的用法与用量(怎么吃、吃多少) +4. 药品的禁忌(什么人不能吃、什么情况不能吃) +5. 药品的不良反应(吃药后可能出现的不舒服) + +要求: +- 只输出整理后的关键信息,不要重复或复述 OCR 原文 +- 用简洁、通俗的语言回答,避免使用专业术语 +- 不要使用表情符号、emoji +- 不要使用markdown格式符号(如#、**、-等),直接用纯文本输出 +- 用自然流畅的口语化表达,方便语音播报 +- 总字数控制在 {max_new_tokens} 字以内""" + + prompts = [prompt_text] + sampling_params = SamplingParams( + temperature=0.8, top_p=0.95, max_tokens=4096, + ) + + print(f"[LLM Worker] Generating response (max_new_tokens={max_new_tokens})...") + gen_start = time.perf_counter() + outputs = llm_model.generate(prompts, sampling_params) + result = outputs[0].outputs.text + gen_elapsed = time.perf_counter() - gen_start + + # Clean result + result = clean_for_tts(result) + print(f"[LLM Worker] Extraction done, gen elapsed: {gen_elapsed:.2f}s, result length: {len(result)}") + + # Put result in queue + result_queue.put(("success", result)) + + # Clean up + del llm_model + import gc + gc.collect() + print("[LLM Worker] LLM model released") + + except Exception as e: + import traceback + result_queue.put(("error", str(e) + "\n" + traceback.format_exc())) + + +def llm_step( + llm_model_dir, + ocr_text, + max_new_tokens=1024, + tensor_parallel_size=2, +): + """Execute the LLM extraction step in a subprocess.""" + step_start = time.perf_counter() + logger.info("[LLM Step] LLM extraction...") + + # Create subprocess for LLM + logger.info("[LLM Step] Starting LLM subprocess...") + result_queue = Queue() + llm_process = Process( + target=llm_worker_process, + args=(str(llm_model_dir), ocr_text, max_new_tokens, result_queue, tensor_parallel_size) + ) + llm_process.start() + + # Wait for result + status, result = result_queue.get() + llm_process.join() + llm_process.close() + + if status == "error": + logger.error("[LLM Step] LLM subprocess failed: %s", result) + raise RuntimeError(f"LLM subprocess failed: {result}") + + extracted_info = result + logger.info("[LLM Step] LLM extraction done, result length: %d, elapsed: %.2fs", len(extracted_info), time.perf_counter() - step_start) + + return {"extracted_info": extracted_info} + + +# ============================================================================ +# TTS module (subprocess) +# ============================================================================ + +def tts_worker_process(text, output_path, result_queue, reduce_volume=False): + """Worker function for TTS subprocess - loads model, synthesizes speech, returns result.""" + try: + import time + import subprocess as _sp + from paddlespeech.cli.tts.infer import TTSExecutor + from scipy.io.wavfile import read as wav_read + + # Load TTS model + print("[TTS Worker] Loading TTS model (PaddleSpeech)...") + start = time.perf_counter() + tts_model = TTSExecutor() + elapsed = time.perf_counter() - start + print(f"[TTS Worker] TTS model loaded, elapsed: {elapsed:.2f}s") + + # Synthesize speech + print(f"[TTS Worker] Synthesis start, input text length: {len(text)}") + tts_model(text=text, output=output_path) + + if reduce_volume: + temp_path = output_path + ".tmp.wav" + os.rename(output_path, temp_path) + print("[TTS Worker] Reducing volume by -90dB via ffmpeg...") + _sp.run( + ["ffmpeg", "-i", temp_path, "-af", "volume=-90dB", output_path], + check=True, capture_output=True, + ) + os.remove(temp_path) + print("[TTS Worker] Volume reduction done") + + # Read audio data + sr, wav_data = wav_read(output_path) + + if wav_data is not None: + audio_duration = len(wav_data) / sr + print(f"[TTS Worker] Synthesis done, audio duration: {audio_duration:.2f}s, sample rate: {sr} Hz") + result_queue.put(("success", (sr, wav_data.tolist()))) # Convert to list for serialization + else: + print("[TTS Worker] Synthesis failed") + result_queue.put(("error", "TTS synthesis failed")) + + # Clean up + del tts_model + import gc + gc.collect() + print("[TTS Worker] TTS model released") + + except Exception as e: + import traceback + result_queue.put(("error", str(e) + "\n" + traceback.format_exc())) + + +def tts_step( + text, + output_path="output.wav", + reduce_volume=False, +): + """Execute the TTS synthesis step in a subprocess.""" + step_start = time.perf_counter() + logger.info("[TTS Step] TTS synthesis...") + + # Create subprocess for TTS + logger.info("[TTS Step] Starting TTS subprocess...") + result_queue = Queue() + tts_process = Process( + target=tts_worker_process, + args=(text, output_path, result_queue, reduce_volume) + ) + tts_process.start() + + # Wait for result + status, result = result_queue.get() + tts_process.join() + tts_process.close() + + if status == "error": + logger.error("[TTS Step] TTS subprocess failed: %s", result) + logger.warning("[TTS Step] TTS synthesis failed") + return {"audio": None} + + sr, wav_data_list = result + wav_data = np.array(wav_data_list, dtype=np.int16) # Convert back from list + + audio_duration = len(wav_data) / sr + logger.info("[TTS Step] TTS synthesis done, audio duration: %.2fs, elapsed: %.2fs", audio_duration, time.perf_counter() - step_start) + + return {"audio": (sr, wav_data)} + + +# ============================================================================ +# Pipeline +# ============================================================================ + +def drug_ocr_pipeline( + ocr_model_dir, + llm_model_dir, + image_path, + enable_split=True, + num_splits=4, + overlap_ratio=0.1, + ocr_max_new_tokens=5120, + llm_max_new_tokens=1024, + tensor_parallel_size=2, + reduce_volume=False, +): + """Drug instruction leaflet intelligent recognition and voice broadcast pipeline. + + Uses subprocess for each model to ensure proper memory cleanup. + """ + pipeline_start = time.perf_counter() + logger.info("=" * 60) + logger.info("Drug OCR pipeline started (subprocess mode)") + logger.info(" Image path: %s", image_path) + logger.info(" Image split: %s (num_splits=%d, overlap=%.2f)", enable_split, num_splits, overlap_ratio) + logger.info("=" * 60) + + result = {} + + # Step 1: OCR (runs in subprocess, automatically cleaned up) + ocr_result = ocr_step( + ocr_model_dir=ocr_model_dir, + image_path=image_path, + enable_split=enable_split, + num_splits=num_splits, + overlap_ratio=overlap_ratio, + max_new_tokens=ocr_max_new_tokens, + ) + result["ocr_text"] = ocr_result["ocr_text"] + + # Step 2: LLM extraction (runs in subprocess, automatically cleaned up) + llm_result = llm_step( + llm_model_dir=llm_model_dir, + ocr_text=ocr_result["ocr_text"], + max_new_tokens=llm_max_new_tokens, + tensor_parallel_size=tensor_parallel_size, + ) + result["extracted_info"] = llm_result["extracted_info"] + + # Step 3: TTS synthesis (runs in subprocess, automatically cleaned up) + tts_result = tts_step( + text=llm_result["extracted_info"], + reduce_volume=reduce_volume, + ) + result["audio"] = tts_result["audio"] + + pipeline_elapsed = time.perf_counter() - pipeline_start + logger.info("=" * 60) + logger.info("Pipeline complete, total elapsed: %.2fs", pipeline_elapsed) + logger.info("=" * 60) + + return result + + +# ============================================================================ +# CLI entry point +# ============================================================================ + +def main(): + parser = argparse.ArgumentParser( + description="Drug instruction leaflet intelligent recognition and voice broadcast pipeline", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog="""Examples: + python drug_ocr_cli.py --image resource/1.jpg + python drug_ocr_cli.py --image resource/1.jpg --no-split + python drug_ocr_cli.py --image resource/1.jpg --num-splits 9 --overlap 0.15 + python drug_ocr_cli.py --image resource/1.jpg --ocr-tokens 5120 --llm-tokens 1024 +""", + ) + parser.add_argument("--image", required=True, help="Path to the drug instruction leaflet image") + parser.add_argument("--ocr-model", default="baidu/PaddleOCR-VL-1.5", help="OCR model directory (default: baidu/PaddleOCR-VL-1.5)") + parser.add_argument("--llm-model", default="baidu/ERNIE-4.5-0.3B-Paddle", help="LLM model directory (default: baidu/ERNIE-4.5-0.3B-Paddle)") + parser.add_argument("--no-split", dest="enable_split", action="store_false", help="Disable image splitting") + parser.add_argument("--num-splits", type=int, default=4, choices=[4, 9, 16], help="Number of image splits (must be perfect square, default: 4)") + parser.add_argument("--overlap", type=float, default=0.1, help="Overlap ratio for image splits (default: 0.1)") + parser.add_argument("--ocr-tokens", type=int, default=5120, help="OCR max new tokens (default: 5120)") + parser.add_argument("--llm-tokens", type=int, default=1024, help="LLM max new tokens (default: 1024)") + parser.add_argument("--tensor-parallel-size", type=int, default=2, choices=[1, 2], help="Tensor parallel size for LLM (default: 2)") + parser.add_argument("--reduce-volume", action="store_true", help="Apply ffmpeg volume=-90dB to TTS output audio") + parser.add_argument("--output-audio", default=None, help="Output audio file path (default: output.wav in current directory)") + parser.add_argument("--output-text", default=None, help="Output extracted text file path (default: print to stdout only)") + + args = parser.parse_args() + + # Configure logging + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s [%(name)s] %(levelname)s: %(message)s", + datefmt="%H:%M:%S", + ) + + # Validate image path + if not os.path.isfile(args.image): + print(f"Error: Image file not found: {args.image}", file=sys.stderr) + sys.exit(1) + + # Validate model directories + # if not Path(args.ocr_model).exists(): + # print(f"Error: OCR model directory not found: {args.ocr_model}", file=sys.stderr) + # sys.exit(1) + # if not Path(args.llm_model).exists(): + # print(f"Error: LLM model directory not found: {args.llm_model}", file=sys.stderr) + # sys.exit(1) + + # Run pipeline + result = drug_ocr_pipeline( + ocr_model_dir=args.ocr_model, + llm_model_dir=args.llm_model, + image_path=args.image, + enable_split=args.enable_split, + num_splits=args.num_splits, + overlap_ratio=args.overlap, + ocr_max_new_tokens=args.ocr_tokens, + llm_max_new_tokens=args.llm_tokens, + tensor_parallel_size=args.tensor_parallel_size, + reduce_volume=args.reduce_volume, + ) + + # Print results + print("\n" + "=" * 60) + print("OCR Result:") + print("=" * 60) + print(result["ocr_text"]) + + print("\n" + "=" * 60) + print("Extracted Info:") + print("=" * 60) + print(result["extracted_info"]) + + # Save extracted text if requested + if args.output_text: + with open(args.output_text, "w", encoding="utf-8") as f: + f.write(result["extracted_info"]) + print(f"\nExtracted text saved to: {args.output_text}") + + # Save audio + if result["audio"] is not None: + sr, wav_data = result["audio"] + audio_path = args.output_audio or "output.wav" + wav_write(audio_path, sr, wav_data.astype(np.float32)) + audio_duration = len(wav_data) / sr + print(f"\nAudio saved to: {audio_path} (duration: {audio_duration:.2f}s)") + else: + print("\nTTS synthesis failed, no audio output.") + + +if __name__ == "__main__": + main() + diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_ok.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_ok.png new file mode 100644 index 00000000..d1b3e20f Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_ok.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_prompt.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_prompt.png new file mode 100644 index 00000000..d2efd019 Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/cli_prompt.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/ernie_28b.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/ernie_28b.png new file mode 100644 index 00000000..07a5aadc Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/ernie_28b.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook.png new file mode 100644 index 00000000..1363c70a Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_input.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_input.png new file mode 100644 index 00000000..84853809 Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_input.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_output.png b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_output.png new file mode 100644 index 00000000..d2f6e2af Binary files /dev/null and b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/images/notebook_output.png differ diff --git a/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/medical_pipeline_20260503.ipynb b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/medical_pipeline_20260503.ipynb new file mode 100644 index 00000000..ec34bd1e --- /dev/null +++ b/WeeklyReports/Hackathon_10th/ERNIEPartner/15_megemini/medical_pipeline_20260503.ipynb @@ -0,0 +1,1512 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a1b2c3d4", + "metadata": {}, + "source": [ + "# 药品说明书智能识别与语音播报系统\n", + "\n", + "## 项目说明\n", + "\n", + "针对药品说明书字体太小、老年人看不清读不懂的问题,本项目通过以下三个步骤,将药品说明书中的重点内容识别提取并语音播报:\n", + "\n", + "1. **OCR 识别**:使用 PaddleOCR-VL-1.5 模型对药品说明书图片进行文字识别\n", + "2. **大模型整理**:使用 ERNIE-4.5 大模型对识别的文字进行整理,提取关键信息\n", + "3. **语音合成播报**:使用 PaddleSpeech 语音合成模型将整理后的文字转为音频文件\n", + "\n", + "### 提取的关键信息包括:\n", + "1. 药品名称\n", + "2. 药品适应症\n", + "3. 药品的用法与用量\n", + "4. 药品的禁忌\n", + "5. 药品的不良反应\n", + "\n", + "### 技术栈:\n", + "- OCR: PaddleOCR-VL-1.5\n", + "- LLM: ERNIE-4.5-0.3B-Paddle\n", + "- TTS: PaddleSpeech bert-base-chinese\n", + "\n", + "### 内存优化(子进程模式):\n", + "为确保内存完全释放,本系统采用**子进程模式**运行每个模型:\n", + "- 每个模型在独立的子进程中加载和执行\n", + "- 子进程完成后自动销毁,确保内存完全释放\n", + "- 主进程仅负责数据传递和流程控制,不加载模型\n", + "- 例如:OCR 在子进程运行,完成后子进程销毁,再启动 LLM 子进程\n", + "\n", + "#### 目录:\n", + "- [模型下载与检查](#模型下载与检查)\n", + "- [生成参数设置](#生成参数设置)\n", + "- [OCR 模块](#OCR-模块)\n", + "- [LLM 模块](#LLM-模块)\n", + "- [TTS 模块](#TTS-模块)\n", + "- [管线编排与模型管理](#管线编排与模型管理)\n", + "- [主流程](#主流程)\n", + "- [Gradio 交互界面](#Gradio-交互界面)" + ] + }, + { + "cell_type": "markdown", + "id": "088dfe7b-8df9-47d3-b94d-70db4eb1a2a9", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:08:00.811267Z", + "iopub.status.busy": "2026-05-03T06:08:00.811134Z" + } + }, + "source": [ + "%pip install -r requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "bdb8d7d5", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:11:06.332147Z", + "iopub.status.busy": "2026-05-03T06:11:06.332023Z", + "iopub.status.idle": "2026-05-03T06:11:11.425006Z", + "shell.execute_reply": "2026-05-03T06:11:11.423507Z", + "shell.execute_reply.started": "2026-05-03T06:11:06.332128Z" + }, + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found existing installation: opencc-python-reimplemented 0.1.6\r\n", + "Uninstalling opencc-python-reimplemented-0.1.6:\r\n", + " Successfully uninstalled opencc-python-reimplemented-0.1.6\r\n", + "Note: you may need to restart the kernel to use updated packages.\r\n", + "Looking in indexes: http://mirrors.baidubce.com/pypi/simple/\r\n", + "Collecting opencc-python-reimplemented==0.1.6\r\n", + " Using cached opencc_python_reimplemented-0.1.6-py2.py3-none-any.whl\r\n", + "Installing collected packages: opencc-python-reimplemented\r\n", + "Successfully installed opencc-python-reimplemented-0.1.6\r\n", + "Note: you may need to restart the kernel to use updated packages.\r\n", + "Found existing installation: aistudio-sdk 0.3.8\r\n", + "Uninstalling aistudio-sdk-0.3.8:\r\n", + " Successfully uninstalled aistudio-sdk-0.3.8\r\n", + "Note: you may need to restart the kernel to use updated packages.\r\n", + "Looking in indexes: http://mirrors.baidubce.com/pypi/simple/\r\n", + "Collecting aistudio-sdk==0.3.8\r\n", + " Using cached http://mirrors.baidubce.com/pypi/packages/cb/77/cd71a481bb7a76b0e9d0b6bf47711c627b1dd079001ea246893f19a9d04c/aistudio_sdk-0.3.8-py3-none-any.whl (62 kB)\r\n", + "Requirement already satisfied: psutil in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (7.2.1)\r\n", + "Requirement already satisfied: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (2.32.5)\r\n", + "Requirement already satisfied: tqdm in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (4.67.1)\r\n", + "Requirement already satisfied: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (0.9.59)\r\n", + "Requirement already satisfied: prettytable in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (3.17.0)\r\n", + "Requirement already satisfied: click in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from aistudio-sdk==0.3.8) (8.3.1)\r\n", + "Requirement already satisfied: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from bce-python-sdk->aistudio-sdk==0.3.8) (3.23.0)\r\n", + "Requirement already satisfied: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from bce-python-sdk->aistudio-sdk==0.3.8) (1.0.0)\r\n", + "Requirement already satisfied: six>=1.4.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from bce-python-sdk->aistudio-sdk==0.3.8) (1.17.0)\r\n", + "Requirement already satisfied: wcwidth in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from prettytable->aistudio-sdk==0.3.8) (0.2.14)\r\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from requests->aistudio-sdk==0.3.8) (3.4.4)\r\n", + "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from requests->aistudio-sdk==0.3.8) (3.11)\r\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in ./external-libraries/lib/python3.10/site-packages (from requests->aistudio-sdk==0.3.8) (1.26.20)\r\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from requests->aistudio-sdk==0.3.8) (2026.1.4)\r\n", + "Installing collected packages: aistudio-sdk\r\n", + "\u001b[33m WARNING: The script aistudio is installed in '/home/aistudio/external-libraries/bin' which is not on PATH.\r\n", + " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\r\n", + "\u001b[0mSuccessfully installed aistudio-sdk-0.3.8\r\n", + "Note: you may need to restart the kernel to use updated packages.\r\n" + ] + } + ], + "source": [ + "%pip uninstall opencc-python-reimplemented -y\n", + "%pip install opencc-python-reimplemented==0.1.6\n", + "%pip uninstall aistudio-sdk -y\n", + "%pip install aistudio-sdk==0.3.8\n", + "# PaddleSpeech use 0.2.6 with should be patched" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "71b16cd1", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:11:11.426579Z", + "iopub.status.busy": "2026-05-03T06:11:11.426259Z", + "iopub.status.idle": "2026-05-03T06:11:11.436812Z", + "shell.execute_reply": "2026-05-03T06:11:11.435719Z", + "shell.execute_reply.started": "2026-05-03T06:11:11.426550Z" + }, + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File already patched.\r\n" + ] + } + ], + "source": [ + "\"\"\"Patch script to fix aistudio_sdk import in paddlenlp.\n", + "\n", + "Uses importlib.util.find_spec to locate paddlenlp WITHOUT importing it,\n", + "so this can be run before paddlenlp is imported to prevent the ImportError.\n", + "\"\"\"\n", + "\n", + "import importlib.util\n", + "import os\n", + "import subprocess\n", + "\n", + "\n", + "def _find_paddlenlp_dir():\n", + " # Method 1: find_spec (no import, just metadata)\n", + " spec = importlib.util.find_spec(\"paddlenlp\")\n", + " if spec and spec.origin:\n", + " return os.path.dirname(spec.origin)\n", + "\n", + " # Method 2: pip show as fallback\n", + " result = subprocess.run(\n", + " [\"pip\", \"show\", \"paddlenlp\"],\n", + " capture_output=True, text=True,\n", + " )\n", + " for line in result.stdout.splitlines():\n", + " if line.startswith(\"Location:\"):\n", + " return os.path.join(line.split(\":\", 1)[1].strip(), \"paddlenlp\")\n", + "\n", + " raise RuntimeError(\"Cannot locate paddlenlp installation directory\")\n", + "\n", + "\n", + "def patch_aistudio_utils():\n", + " pkg_dir = _find_paddlenlp_dir()\n", + " target_file = os.path.join(pkg_dir, \"transformers\", \"aistudio_utils.py\")\n", + "\n", + " if not os.path.isfile(target_file):\n", + " raise FileNotFoundError(f\"Target file not found: {target_file}\")\n", + "\n", + " old_line = \"from aistudio_sdk.hub import download\"\n", + " new_line = \"from aistudio_sdk import snapshot_download as download\"\n", + "\n", + " with open(target_file, \"r\", encoding=\"utf-8\") as f:\n", + " content = f.read()\n", + "\n", + " if old_line not in content:\n", + " if new_line in content:\n", + " print(\"File already patched.\")\n", + " else:\n", + " print(f\"Target import not found in {target_file}\")\n", + " return\n", + "\n", + " patched = content.replace(old_line, new_line)\n", + "\n", + " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", + " f.write(patched)\n", + "\n", + " print(f\"Patched: {target_file}\")\n", + " print(f\" {old_line} => {new_line}\")\n", + "\n", + "\n", + "patch_aistudio_utils()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9d0e1f2", + "metadata": {}, + "source": [ + "## 模型下载与检查\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "从 AIStudio 下载三个模型(如果已存在则跳过),并检查模型文件是否完整。\n", + "\n", + "> **注意**:此步骤仅下载和检查模型,**不加载模型到内存**。模型将在管线运行时按需加载。" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "a3b4c5d6", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:12.765664Z", + "iopub.status.busy": "2026-05-03T06:14:12.765530Z", + "iopub.status.idle": "2026-05-03T06:14:12.771812Z", + "shell.execute_reply": "2026-05-03T06:14:12.770795Z", + "shell.execute_reply.started": "2026-05-03T06:14:12.765642Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OCR 模型已存在: baidu/PaddleOCR-VL-1.5,跳过下载\r\n", + "LLM 模型已存在: baidu/ERNIE-4.5-0.3B-Paddle,跳过下载\r\n", + "TTS 模型将在首次使用时自动下载\r\n" + ] + } + ], + "source": [ + "from pathlib import Path\n", + "import subprocess\n", + "\n", + "# --- OCR 模型 ---\n", + "ocr_model_dir = Path(\"baidu/PaddleOCR-VL-1.5\")\n", + "\n", + "if not ocr_model_dir.exists():\n", + " subprocess.run([\"aistudio\", \"download\", \"--model\", \"PaddlePaddle/PaddleOCR-VL-1.5\", \"--local_dir\", str(ocr_model_dir)], check=True)\n", + " print(f\"OCR 模型已下载到: {ocr_model_dir}\")\n", + "else:\n", + " print(f\"OCR 模型已存在: {ocr_model_dir},跳过下载\")\n", + "\n", + "# --- LLM 模型 ---\n", + "llm_model_dir = Path(\"baidu/ERNIE-4.5-0.3B-Paddle\")\n", + "\n", + "if not llm_model_dir.exists():\n", + " subprocess.run([\"aistudio\", \"download\", \"--model\", \"PaddlePaddle/ERNIE-4.5-0.3B-Paddle\", \"--local_dir\", str(llm_model_dir)], check=True)\n", + " print(f\"LLM 模型已下载到: {llm_model_dir}\")\n", + "else:\n", + " print(f\"LLM 模型已存在: {llm_model_dir},跳过下载\")\n", + "\n", + "# --- TTS 模型 ---\n", + "# PaddleSpeech bert-base-chinese 会在首次使用时自动下载\n", + "print(\"TTS 模型将在首次使用时自动下载\")" + ] + }, + { + "cell_type": "markdown", + "id": "e7f8a9b0", + "metadata": {}, + "source": [ + "## 生成参数设置\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "设置模型的 `max_new_tokens` 参数,控制每个模型生成的最大 token 数量:\n", + "- **OCR max_new_tokens**:PaddleOCR-VL 识别文字时的最大生成长度,说明书内容多时建议调大\n", + "- **LLM max_new_tokens**:ERNIE 提取信息时的最大生成长度,需要更详细整理时可调大" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c1d2e3f4", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:12.773035Z", + "iopub.status.busy": "2026-05-03T06:14:12.772880Z", + "iopub.status.idle": "2026-05-03T06:14:12.777084Z", + "shell.execute_reply": "2026-05-03T06:14:12.775954Z", + "shell.execute_reply.started": "2026-05-03T06:14:12.773015Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OCR max_new_tokens: 200\r\n", + "LLM max_new_tokens: 200\r\n" + ] + } + ], + "source": [ + "# OCR 最大生成 token 数(说明书内容多时建议调大,默认 5120)\n", + "ocr_max_new_tokens = 200\n", + "\n", + "# LLM 最大生成 token 数(需要更详细整理时可调大,默认 1024)\n", + "llm_max_new_tokens = 200\n", + "\n", + "print(f\"OCR max_new_tokens: {ocr_max_new_tokens}\")\n", + "print(f\"LLM max_new_tokens: {llm_max_new_tokens}\")" + ] + }, + { + "cell_type": "markdown", + "id": "md_ocr_module", + "metadata": {}, + "source": [ + "## OCR 模块\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "包含图片分割、OCR 子进程工作函数,以及可独立执行的 `ocr_step`。\n", + "\n", + "**子进程模式**:OCR 模型在独立子进程中加载和执行,完成后子进程自动销毁,确保内存完全释放。" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "code_ocr_module", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:12.777863Z", + "iopub.status.busy": "2026-05-03T06:14:12.777725Z", + "iopub.status.idle": "2026-05-03T06:14:12.993437Z", + "shell.execute_reply": "2026-05-03T06:14:12.992174Z", + "shell.execute_reply.started": "2026-05-03T06:14:12.777846Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ OCR 模块定义完成 (子进程模式)\r\n" + ] + } + ], + "source": [ + "import base64\n", + "import gc\n", + "import io\n", + "import logging\n", + "import math\n", + "import time\n", + "import multiprocessing as mp\n", + "from multiprocessing import Process, Queue\n", + "\n", + "from PIL import Image\n", + "\n", + "logger = logging.getLogger(\"drug_ocr\")\n", + "\n", + "\n", + "# ---- 图片分割 ----\n", + "\n", + "def split_image(image, num_splits=4, overlap_ratio=0.1):\n", + " \"\"\"Split an image into num_splits parts (NxN grid) with overlap.\"\"\"\n", + " grid_size = int(math.sqrt(num_splits))\n", + " if grid_size * grid_size != num_splits:\n", + " raise ValueError(f\"num_splits must be a perfect square (e.g. 4, 9, 16), got: {num_splits}\")\n", + "\n", + " w, h = image.size\n", + " cell_w = w / grid_size\n", + " cell_h = h / grid_size\n", + " overlap_w = cell_w * overlap_ratio\n", + " overlap_h = cell_h * overlap_ratio\n", + "\n", + " sub_images = []\n", + " for row in range(grid_size):\n", + " for col in range(grid_size):\n", + " left = max(0, col * cell_w - overlap_w)\n", + " upper = max(0, row * cell_h - overlap_h)\n", + " right = min(w, (col + 1) * cell_w + overlap_w)\n", + " lower = min(h, (row + 1) * cell_h + overlap_h)\n", + " sub_img = image.crop((int(left), int(upper), int(right), int(lower)))\n", + " sub_images.append(sub_img)\n", + "\n", + " return sub_images\n", + "\n", + "\n", + "# ---- OCR 子进程工作函数 ----\n", + "\n", + "def ocr_worker_process(ocr_model_dir, image_data_list, max_new_tokens, result_queue):\n", + " \"\"\"Worker function for OCR subprocess - loads model, performs OCR, returns result.\"\"\"\n", + " try:\n", + " import time\n", + " import base64\n", + " import io\n", + " from PIL import Image\n", + " from fastdeploy import LLM, SamplingParams\n", + "\n", + " # Load OCR model\n", + " print(\"[OCR Worker] 加载 OCR 模型 (PaddleOCR-VL)...\")\n", + " start = time.perf_counter()\n", + " ocr_model = LLM(\n", + " model=ocr_model_dir,\n", + " tensor_parallel_size=1,\n", + " max_model_len=8192,\n", + " block_size=16,\n", + " quantization=\"wint8\",\n", + " graph_optimization_config={\"use_cudagraph\": False},\n", + " )\n", + " elapsed = time.perf_counter() - start\n", + " print(f\"[OCR Worker] OCR 模型加载完成, 耗时: {elapsed:.2f}s\")\n", + "\n", + " # Process each image\n", + " all_ocr_texts = []\n", + " for i, img_bytes in enumerate(image_data_list):\n", + " image = Image.open(io.BytesIO(img_bytes)).convert(\"RGB\")\n", + " print(f\"[OCR Worker] 识别图片 {i+1}/{len(image_data_list)}, 尺寸: {image.size}\")\n", + "\n", + " # Prepare image for OCR\n", + " buf = io.BytesIO()\n", + " image.save(buf, format=\"PNG\")\n", + " base64_image = base64.b64encode(buf.getvalue()).decode(\"utf-8\")\n", + " image_url = f\"data:image/png;base64,{base64_image}\"\n", + "\n", + " prompts = [{\n", + " \"messages\": [{\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", + " {\"type\": \"text\", \"text\": \"OCR:\"},\n", + " ],\n", + " }]\n", + " }]\n", + " sampling_params = SamplingParams(\n", + " temperature=0.8, top_p=0.95, max_tokens=max_new_tokens,\n", + " )\n", + " outputs = ocr_model.generate(prompts, sampling_params)\n", + " response = outputs[0].outputs.text\n", + " all_ocr_texts.append(response)\n", + " print(f\"[OCR Worker] 图片 {i+1} 识别完成, 文字长度: {len(response)}\")\n", + "\n", + " # Combine results\n", + " combined_text = \"\\n\\n\".join(all_ocr_texts)\n", + " print(f\"[OCR Worker] 全部识别完成, 总文字长度: {len(combined_text)}\")\n", + "\n", + " # Put result in queue\n", + " result_queue.put((\"success\", combined_text))\n", + "\n", + " # Clean up\n", + " del ocr_model\n", + " import gc\n", + " gc.collect()\n", + " print(\"[OCR Worker] OCR 模型已释放\")\n", + "\n", + " except Exception as e:\n", + " import traceback\n", + " result_queue.put((\"error\", str(e) + \"\\n\" + traceback.format_exc()))\n", + "\n", + "\n", + "# ---- 独立 OCR 步骤 (使用子进程) ----\n", + "\n", + "def ocr_step(\n", + " ocr_model_dir,\n", + " image_path,\n", + " enable_split=True,\n", + " num_splits=4,\n", + " overlap_ratio=0.1,\n", + " max_new_tokens=5120,\n", + "):\n", + " \"\"\"Execute the OCR step in a subprocess: load image, optionally split, and run OCR.\"\"\"\n", + " step_start = time.perf_counter()\n", + " logger.info(\"[OCR Step] 加载图片...\")\n", + " image = Image.open(image_path).convert(\"RGB\")\n", + " logger.info(\"[OCR Step] 图片加载完成, 尺寸: %s\", image.size)\n", + "\n", + " if enable_split:\n", + " logger.info(\"[OCR Step] 图片分割 (num_splits=%d, overlap=%.2f)...\", num_splits, overlap_ratio)\n", + " sub_images = split_image(image, num_splits=num_splits, overlap_ratio=overlap_ratio)\n", + " ocr_images = [image] + sub_images\n", + " logger.info(\"[OCR Step] 图片分割完成, 原始1张 + 分割%d张 = 共%d张\", len(sub_images), len(ocr_images))\n", + " else:\n", + " logger.info(\"[OCR Step] 跳过图片分割\")\n", + " ocr_images = [image]\n", + "\n", + " # Serialize images to bytes for subprocess\n", + " image_data_list = []\n", + " for img in ocr_images:\n", + " buf = io.BytesIO()\n", + " img.save(buf, format=\"PNG\")\n", + " image_data_list.append(buf.getvalue())\n", + "\n", + " # Create subprocess for OCR\n", + " logger.info(\"[OCR Step] 启动 OCR 子进程...\")\n", + " result_queue = Queue()\n", + " ocr_process = Process(\n", + " target=ocr_worker_process,\n", + " args=(str(ocr_model_dir), image_data_list, max_new_tokens, result_queue)\n", + " )\n", + " ocr_process.start()\n", + "\n", + " # Wait for result\n", + " status, result = result_queue.get()\n", + " ocr_process.join()\n", + " ocr_process.close()\n", + "\n", + " if status == \"error\":\n", + " logger.error(\"[OCR Step] OCR 子进程执行失败: %s\", result)\n", + " raise RuntimeError(f\"OCR subprocess failed: {result}\")\n", + "\n", + " combined_ocr_text = result\n", + " logger.info(\"[OCR Step] OCR 识别全部完成, 总文字长度: %d, 耗时: %.2fs\", len(combined_ocr_text), time.perf_counter() - step_start)\n", + "\n", + " return {\"ocr_text\": combined_ocr_text, \"ocr_images\": ocr_images}\n", + "\n", + "print(\"✅ OCR 模块定义完成 (子进程模式)\")" + ] + }, + { + "cell_type": "markdown", + "id": "md_llm_module", + "metadata": {}, + "source": [ + "## LLM 模块\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "包含文本清洗(`clean_for_tts`)、LLM 子进程工作函数,以及可独立执行的 `llm_step`。\n", + "\n", + "**子进程模式**:LLM 模型在独立子进程中加载和执行,完成后子进程自动销毁,确保内存完全释放。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "code_llm_module", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:12.994715Z", + "iopub.status.busy": "2026-05-03T06:14:12.994434Z", + "iopub.status.idle": "2026-05-03T06:14:13.009434Z", + "shell.execute_reply": "2026-05-03T06:14:13.008415Z", + "shell.execute_reply.started": "2026-05-03T06:14:12.994692Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ LLM 模块定义完成 (子进程模式)\r\n" + ] + } + ], + "source": [ + "import re\n", + "from multiprocessing import Process, Queue\n", + "\n", + "\n", + "# ---- 文本清洗 ----\n", + "\n", + "def clean_for_tts(text):\n", + " \"\"\"Clean text for TTS synthesis by removing emojis and markdown formatting.\"\"\"\n", + " # Remove emojis (Unicode ranges for common emojis)\n", + " # NOTE: Must avoid ranges that overlap with CJK characters (U+4E00-U+9FFF)\n", + " text = re.sub(\n", + " r\"[\\U0001F600-\\U0001F64F\" # emoticons\n", + " r\"\\U0001F300-\\U0001F5FF\" # symbols & pictographs\n", + " r\"\\U0001F680-\\U0001F6FF\" # transport & map\n", + " r\"\\U0001F1E0-\\U0001F1FF\" # flags\n", + " r\"\\U00002702-\\U000027B0\" # dingbats\n", + " r\"\\U000024C2-\\U0000324F\" # enclosed alphanumerics (stop before CJK)\n", + " r\"\\U0001F200-\\U0001F251\" # enclosed CJK supplement (above CJK range)\n", + " r\"\\U0001F900-\\U0001F9FF\" # supplemental symbols\n", + " r\"\\U0001FA00-\\U0001FA6F\" # chess symbols\n", + " r\"\\U0001FA70-\\U0001FAFF\" # symbols extended-A\n", + " r\"\\U00002600-\\U000026FF\" # misc symbols\n", + " r\"\\U0000FE00-\\U0000FE0F\" # variation selectors\n", + " r\"\\U0000200D\" # zero-width joiner\n", + " r\"]+\",\n", + " \"\",\n", + " text,\n", + " )\n", + " # Remove markdown code blocks (```...```)\n", + " text = re.sub(r\"```.*?```\", \"\", text, flags=re.DOTALL)\n", + " # Remove inline code (`...`) -> content\n", + " text = re.sub(r\"`([^`\\n]+)`\", r\"\\1\", text)\n", + " # Remove markdown headers (# ## ### etc.) at line start\n", + " text = re.sub(r\"^#{1,6}\\s+\", \"\", text, flags=re.MULTILINE)\n", + " # Remove markdown bold (**text**) -> text\n", + " text = re.sub(r\"\\*\\*([^*\\n]+?)\\*\\*\", r\"\\1\", text)\n", + " # Remove markdown bold (__text__) -> text\n", + " text = re.sub(r\"__([^_\\n]+?)__\", r\"\\1\", text)\n", + " # Remove markdown italic (*text*) -> text\n", + " text = re.sub(r\"\\*([^*\\n]+?)\\*\", r\"\\1\", text)\n", + " # Remove markdown italic (_text_) -> text (only when _ is at word boundary)\n", + " text = re.sub(r\"(? text\n", + " text = re.sub(r\"\\[([^\\]]+)\\]\\([^)]+\\)\", r\"\\1\", text)\n", + " # Remove markdown images ![alt](url)\n", + " text = re.sub(r\"!\\[[^\\]]*\\]\\([^)]+\\)\", \"\", text)\n", + " # Remove markdown horizontal rules (---, ***, ___)\n", + " text = re.sub(r\"^[-*_]{3,}\\s*$\", \"\", text, flags=re.MULTILINE)\n", + " # Remove markdown bullet list markers (- , * , + ) at line start, keep content\n", + " text = re.sub(r\"^(\\s*)[-*+]\\s+\", r\"\\1\", text, flags=re.MULTILINE)\n", + " # Remove markdown numbered list markers (1. 2. etc.) at line start, keep content\n", + " text = re.sub(r\"^(\\s*)\\d+\\.\\s+\", r\"\\1\", text, flags=re.MULTILINE)\n", + " # Remove markdown table pipes\n", + " text = re.sub(r\"\\|\", \" \", text)\n", + " # Remove markdown table separator lines (---:---:---)\n", + " text = re.sub(r\"^[-: ]+$\", \"\", text, flags=re.MULTILINE)\n", + " # Collapse multiple blank lines into one\n", + " text = re.sub(r\"\\n{3,}\", \"\\n\\n\", text)\n", + " # Strip leading/trailing whitespace per line\n", + " lines = [line.strip() for line in text.splitlines()]\n", + " text = \"\\n\".join(lines)\n", + " # Remove leading/trailing whitespace overall\n", + " text = text.strip()\n", + " return text\n", + "\n", + "\n", + "# ---- LLM 子进程工作函数 ----\n", + "\n", + "def llm_worker_process(llm_model_dir, ocr_text, max_new_tokens, result_queue):\n", + " \"\"\"Worker function for LLM subprocess - loads model, extracts info, returns result.\"\"\"\n", + " try:\n", + " import time\n", + " from fastdeploy import LLM, SamplingParams\n", + "\n", + " # Load LLM model\n", + " print(\"[LLM Worker] 加载 LLM 模型 (ERNIE)...\")\n", + " start = time.perf_counter()\n", + " llm_model = LLM(\n", + " model=llm_model_dir,\n", + " tensor_parallel_size=1,\n", + " max_model_len=8192,\n", + " block_size=16,\n", + " quantization=\"wint8\",\n", + " graph_optimization_config={\"use_cudagraph\": False},\n", + " )\n", + " elapsed = time.perf_counter() - start\n", + " print(f\"[LLM Worker] LLM 模型加载完成, 耗时: {elapsed:.2f}s\")\n", + "\n", + " # Prepare prompt\n", + " prompt_text = f\"\"\"以下是药品说明书的 OCR 识别结果,供参考:\n", + "\n", + "{ocr_text}\n", + "\n", + "请根据以上 OCR 识别结果,提取并整理以下关键信息,用清晰易懂的语言重新表述,方便老年人阅读理解:\n", + "\n", + "1. 药品名称\n", + "2. 药品适应症(这个药治什么病)\n", + "3. 药品的用法与用量(怎么吃、吃多少)\n", + "4. 药品的禁忌(什么人不能吃、什么情况不能吃)\n", + "5. 药品的不良反应(吃药后可能出现的不舒服)\n", + "\n", + "要求:\n", + "- 只输出整理后的关键信息,不要重复或复述 OCR 原文\n", + "- 用简洁、通俗的语言回答,避免使用专业术语\n", + "- 不要使用表情符号、emoji\n", + "- 不要使用markdown格式符号(如#、**、-等),直接用纯文本输出\n", + "- 用自然流畅的口语化表达,方便语音播报\n", + "- 总字数控制在 {max_new_tokens} 字以内\"\"\"\n", + "\n", + "\n", + "\n", + " # todo\n", + " prompt_text = \"你是谁\"\n", + "\n", + " prompts = [prompt_text]\n", + " sampling_params = SamplingParams(\n", + " temperature=0.8, top_p=0.95, max_tokens=max_new_tokens,\n", + " )\n", + "\n", + " print(f\"[LLM Worker] 正在生成回复 (max_new_tokens={max_new_tokens})...\")\n", + " gen_start = time.perf_counter()\n", + " outputs = llm_model.generate(prompts, sampling_params)\n", + " result = outputs[0].outputs.text\n", + " gen_elapsed = time.perf_counter() - gen_start\n", + "\n", + " # Clean result\n", + " result = clean_for_tts(result)\n", + " print(f\"[LLM Worker] 信息提取完成, 生成耗时: {gen_elapsed:.2f}s, 结果长度: {len(result)}\")\n", + "\n", + " # todo\n", + " print(\">>>\", result)\n", + "\n", + " # Put result in queue\n", + " result_queue.put((\"success\", result))\n", + "\n", + " # Clean up\n", + " del llm_model\n", + " import gc\n", + " gc.collect()\n", + " print(\"[LLM Worker] LLM 模型已释放\")\n", + "\n", + " except Exception as e:\n", + " import traceback\n", + " result_queue.put((\"error\", str(e) + \"\\n\" + traceback.format_exc()))\n", + "\n", + "\n", + "# ---- 独立 LLM 步骤 (使用子进程) ----\n", + "\n", + "def llm_step(\n", + " llm_model_dir,\n", + " ocr_text,\n", + " max_new_tokens=1024,\n", + "):\n", + " \"\"\"Execute the LLM extraction step in a subprocess.\"\"\"\n", + " step_start = time.perf_counter()\n", + " logger.info(\"[LLM Step] LLM 大模型信息提取...\")\n", + "\n", + " # Create subprocess for LLM\n", + " logger.info(\"[LLM Step] 启动 LLM 子进程...\")\n", + " result_queue = Queue()\n", + " llm_process = Process(\n", + " target=llm_worker_process,\n", + " args=(str(llm_model_dir), ocr_text, max_new_tokens, result_queue)\n", + " )\n", + " llm_process.start()\n", + "\n", + " # Wait for result\n", + " status, result = result_queue.get()\n", + " llm_process.join()\n", + " llm_process.close()\n", + "\n", + " if status == \"error\":\n", + " logger.error(\"[LLM Step] LLM 子进程执行失败: %s\", result)\n", + " raise RuntimeError(f\"LLM subprocess failed: {result}\")\n", + "\n", + " extracted_info = result\n", + " logger.info(\"[LLM Step] LLM 信息提取完成, 结果长度: %d, 耗时: %.2fs\", len(extracted_info), time.perf_counter() - step_start)\n", + "\n", + " return {\"extracted_info\": extracted_info}\n", + "\n", + "print(\"✅ LLM 模块定义完成 (子进程模式)\")" + ] + }, + { + "cell_type": "markdown", + "id": "md_tts_module", + "metadata": {}, + "source": [ + "## TTS 模块\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "包含 TTS 子进程工作函数,以及可独立执行的 `tts_step`。\n", + "\n", + "**子进程模式**:TTS 模型在独立子进程中加载和执行,完成后子进程自动销毁,确保内存完全释放。" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "code_tts_module", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:13.010352Z", + "iopub.status.busy": "2026-05-03T06:14:13.010203Z", + "iopub.status.idle": "2026-05-03T06:14:13.390794Z", + "shell.execute_reply": "2026-05-03T06:14:13.389438Z", + "shell.execute_reply.started": "2026-05-03T06:14:13.010334Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ TTS 模块定义完成 (子进程模式)\r\n" + ] + } + ], + "source": [ + "from multiprocessing import Process, Queue\n", + "from scipy.io.wavfile import read as wav_read\n", + "\n", + "\n", + "# ---- TTS 子进程工作函数 ----\n", + "\n", + "def tts_worker_process(text, output_path, result_queue):\n", + " \"\"\"Worker function for TTS subprocess - loads model, synthesizes speech, returns result.\"\"\"\n", + " try:\n", + " import time\n", + " from paddlespeech.cli.tts.infer import TTSExecutor\n", + " from scipy.io.wavfile import read as wav_read\n", + "\n", + " # Load TTS model\n", + " print(\"[TTS Worker] 加载 TTS 模型 (PaddleSpeech)...\")\n", + " start = time.perf_counter()\n", + " tts_model = TTSExecutor()\n", + " elapsed = time.perf_counter() - start\n", + " print(f\"[TTS Worker] TTS 模型加载完成, 耗时: {elapsed:.2f}s\")\n", + "\n", + " # Synthesize speech\n", + " print(f\"[TTS Worker] 语音合成开始, 输入文字长度: {len(text)}\")\n", + " tts_model(text=text, output=output_path)\n", + "\n", + " # Read audio data\n", + " sr, wav_data = wav_read(output_path)\n", + "\n", + " if wav_data is not None:\n", + " audio_duration = len(wav_data) / sr\n", + " print(f\"[TTS Worker] 语音合成完成, 音频时长: {audio_duration:.2f}s, 采样率: {sr} Hz\")\n", + " result_queue.put((\"success\", (sr, wav_data.tolist()))) # Convert to list for serialization\n", + " else:\n", + " print(\"[TTS Worker] 语音合成失败\")\n", + " result_queue.put((\"error\", \"TTS synthesis failed\"))\n", + "\n", + " # Clean up\n", + " del tts_model\n", + " import gc\n", + " gc.collect()\n", + " print(\"[TTS Worker] TTS 模型已释放\")\n", + "\n", + " except Exception as e:\n", + " import traceback\n", + " result_queue.put((\"error\", str(e) + \"\\n\" + traceback.format_exc()))\n", + "\n", + "\n", + "# ---- 独立 TTS 步骤 (使用子进程) ----\n", + "\n", + "def tts_step(\n", + " text,\n", + " output_path=\"output.wav\",\n", + "):\n", + " \"\"\"Execute the TTS synthesis step in a subprocess.\"\"\"\n", + " step_start = time.perf_counter()\n", + " logger.info(\"[TTS Step] TTS 语音合成...\")\n", + "\n", + " # Create subprocess for TTS\n", + " logger.info(\"[TTS Step] 启动 TTS 子进程...\")\n", + " result_queue = Queue()\n", + " tts_process = Process(\n", + " target=tts_worker_process,\n", + " args=(text, output_path, result_queue)\n", + " )\n", + " tts_process.start()\n", + "\n", + " # Wait for result\n", + " status, result = result_queue.get()\n", + " tts_process.join()\n", + " tts_process.close()\n", + "\n", + " if status == \"error\":\n", + " logger.error(\"[TTS Step] TTS 子进程执行失败: %s\", result)\n", + " logger.warning(\"[TTS Step] TTS 语音合成失败\")\n", + " return {\"audio\": None}\n", + "\n", + " sr, wav_data_list = result\n", + " import numpy as np\n", + " wav_data = np.array(wav_data_list, dtype=np.int16) # Convert back from list\n", + "\n", + " audio_duration = len(wav_data) / sr\n", + " logger.info(\"[TTS Step] TTS 语音合成完成, 音频时长: %.2fs, 耗时: %.2fs\", audio_duration, time.perf_counter() - step_start)\n", + "\n", + " return {\"audio\": (sr, wav_data)}\n", + "\n", + "print(\"✅ TTS 模块定义完成 (子进程模式)\")" + ] + }, + { + "cell_type": "markdown", + "id": "md_orchestration", + "metadata": {}, + "source": [ + "## 管线编排\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "`drug_ocr_pipeline` 串联 OCR → LLM → TTS 三个步骤,每个步骤在独立子进程中执行,`make_demo` 构建 Gradio 界面。" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "code_orchestration", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:13.674124Z", + "iopub.status.busy": "2026-05-03T06:14:13.673982Z", + "iopub.status.idle": "2026-05-03T06:14:16.051569Z", + "shell.execute_reply": "2026-05-03T06:14:16.050351Z", + "shell.execute_reply.started": "2026-05-03T06:14:13.674106Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ 管线编排与 Gradio 界面定义完成 (子进程模式)\r\n" + ] + } + ], + "source": [ + "import tempfile\n", + "\n", + "import numpy as np\n", + "import gradio as gr\n", + "from scipy.io.wavfile import write as wav_write\n", + "\n", + "\n", + "def drug_ocr_pipeline(\n", + " ocr_model_dir,\n", + " llm_model_dir,\n", + " image_path,\n", + " enable_split=True,\n", + " num_splits=4,\n", + " overlap_ratio=0.1,\n", + " ocr_max_new_tokens=5120,\n", + " llm_max_new_tokens=1024,\n", + "):\n", + " \"\"\"Drug instruction leaflet intelligent recognition and voice broadcast pipeline.\n", + " \n", + " Uses subprocess for each model to ensure proper memory cleanup.\n", + " \"\"\"\n", + " pipeline_start = time.perf_counter()\n", + " logger.info(\"=\" * 60)\n", + " logger.info(\"药品说明书识别管线启动 (子进程模式)\")\n", + " logger.info(\" 图片路径: %s\", image_path)\n", + " logger.info(\" 图片分割: %s (num_splits=%d, overlap=%.2f)\", enable_split, num_splits, overlap_ratio)\n", + " logger.info(\"=\" * 60)\n", + "\n", + " result = {}\n", + "\n", + " # Step 1: OCR (runs in subprocess, automatically cleaned up)\n", + " ocr_result = ocr_step(\n", + " ocr_model_dir=ocr_model_dir,\n", + " image_path=image_path,\n", + " enable_split=enable_split,\n", + " num_splits=num_splits,\n", + " overlap_ratio=overlap_ratio,\n", + " max_new_tokens=ocr_max_new_tokens,\n", + " )\n", + " result[\"ocr_text\"] = ocr_result[\"ocr_text\"]\n", + "\n", + " # Step 2: LLM extraction (runs in subprocess, automatically cleaned up)\n", + " llm_result = llm_step(\n", + " llm_model_dir=llm_model_dir,\n", + " ocr_text=ocr_result[\"ocr_text\"],\n", + " max_new_tokens=llm_max_new_tokens,\n", + " )\n", + " result[\"extracted_info\"] = llm_result[\"extracted_info\"]\n", + "\n", + " # Step 3: TTS synthesis (runs in subprocess, automatically cleaned up)\n", + " tts_result = tts_step(\n", + " text=llm_result[\"extracted_info\"],\n", + " )\n", + " result[\"audio\"] = tts_result[\"audio\"]\n", + "\n", + " pipeline_elapsed = time.perf_counter() - pipeline_start\n", + " logger.info(\"=\" * 60)\n", + " logger.info(\"管线执行完成, 总耗时: %.2fs\", pipeline_elapsed)\n", + " logger.info(\"=\" * 60)\n", + "\n", + " return result\n", + "\n", + "\n", + "def make_demo(ocr_model_dir, llm_model_dir, ocr_max_new_tokens=5120, llm_max_new_tokens=1024):\n", + " \"\"\"Create Gradio demo for Drug OCR Pipeline.\"\"\"\n", + "\n", + " def gradio_pipeline(\n", + " image_input,\n", + " enable_split,\n", + " num_splits,\n", + " overlap_ratio,\n", + " ocr_max_tokens,\n", + " llm_max_tokens,\n", + " progress=gr.Progress(track_tqdm=True),\n", + " ):\n", + " \"\"\"Gradio interface main processing function\"\"\"\n", + " if image_input is None:\n", + " return \"请上传药品说明书图片\", \"\", None\n", + "\n", + " # Convert uploaded image to PIL Image\n", + " if isinstance(image_input, str):\n", + " image = Image.open(image_input).convert(\"RGB\")\n", + " else:\n", + " image = Image.fromarray(image_input).convert(\"RGB\") if not isinstance(image_input, Image.Image) else image_input\n", + "\n", + " # Save as temp file for pipeline\n", + " with tempfile.NamedTemporaryFile(suffix=\".jpg\", delete=False) as tmp:\n", + " image.save(tmp.name)\n", + " tmp_path = tmp.name\n", + "\n", + " try:\n", + " result = drug_ocr_pipeline(\n", + " ocr_model_dir=ocr_model_dir,\n", + " llm_model_dir=llm_model_dir,\n", + " image_path=tmp_path,\n", + " enable_split=enable_split,\n", + " num_splits=int(num_splits),\n", + " overlap_ratio=overlap_ratio,\n", + " ocr_max_new_tokens=int(ocr_max_tokens),\n", + " llm_max_new_tokens=int(llm_max_tokens),\n", + " )\n", + "\n", + " ocr_text = result[\"ocr_text\"]\n", + " extracted_info = result[\"extracted_info\"]\n", + "\n", + " # Save audio as temp file\n", + " audio_path = None\n", + " if result[\"audio\"] is not None:\n", + " sr, wav_data = result[\"audio\"]\n", + " audio_tmp = tempfile.NamedTemporaryFile(suffix=\".wav\", delete=False)\n", + " wav_write(audio_tmp.name, sr, wav_data.astype(np.float32))\n", + " audio_path = audio_tmp.name\n", + "\n", + " return ocr_text, extracted_info, audio_path\n", + " finally:\n", + " import os\n", + " os.unlink(tmp_path)\n", + "\n", + " with gr.Blocks(title=\"药品说明书智能识别与语音播报\") as demo:\n", + " gr.Markdown(\"# 药品说明书智能识别与语音播报系统\")\n", + " gr.Markdown(\"上传药品说明书图片,系统将自动识别文字、提取关键信息并语音播报,帮助老年人看清读懂药品说明书。\")\n", + "\n", + " with gr.Row():\n", + " with gr.Column(scale=1):\n", + " image_input = gr.Image(label=\"药品说明书图片\", type=\"filepath\")\n", + "\n", + " with gr.Accordion(\"图片分割设置\", open=True):\n", + " enable_split = gr.Checkbox(value=True, label=\"启用图片分割(文字太小时建议开启)\")\n", + " num_splits = gr.Dropdown(choices=[4, 9, 16], value=4, label=\"分割数量\")\n", + " overlap_ratio = gr.Slider(minimum=0.0, maximum=0.3, value=0.1, step=0.05, label=\"重叠比例\")\n", + "\n", + " with gr.Accordion(\"生成参数设置\", open=True):\n", + " ocr_max_tokens = gr.Slider(minimum=100, maximum=8192, value=ocr_max_new_tokens, step=1, label=\"OCR 最大生成 token 数\")\n", + " llm_max_tokens = gr.Slider(minimum=100, maximum=4096, value=llm_max_new_tokens, step=1, label=\"LLM 最大生成 token 数\")\n", + "\n", + " run_btn = gr.Button(\"开始识别\", variant=\"primary\")\n", + "\n", + " with gr.Column(scale=1):\n", + " ocr_output = gr.Textbox(label=\"OCR 识别结果\", lines=10, max_lines=20)\n", + " info_output = gr.Textbox(label=\"关键信息整理\", lines=15, max_lines=30)\n", + " audio_output = gr.Audio(label=\"语音播报\", type=\"filepath\")\n", + "\n", + " run_btn.click(\n", + " fn=gradio_pipeline,\n", + " inputs=[\n", + " image_input,\n", + " enable_split,\n", + " num_splits,\n", + " overlap_ratio,\n", + " ocr_max_tokens,\n", + " llm_max_tokens,\n", + " ],\n", + " outputs=[ocr_output, info_output, audio_output],\n", + " )\n", + "\n", + " return demo\n", + "\n", + "print(\"✅ 管线编排与 Gradio 界面定义完成 (子进程模式)\")" + ] + }, + { + "cell_type": "markdown", + "id": "a5b6c7d8", + "metadata": {}, + "source": [ + "## 主流程\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "主流程包含以下步骤:\n", + "1. 加载图片\n", + "2. 图片分割(可选,针对文字太小的说明书,将图片切割成多部分进行识别,分割的图片有重叠)\n", + "3. OCR 文字识别(**在子进程中加载模型,完成后销毁子进程**)\n", + "4. 大模型文字整理(**在子进程中加载模型,完成后销毁子进程**)\n", + "5. 语音合成(**在子进程中加载模型,完成后销毁子进程**)\n", + "\n", + "> 每个步骤在独立的子进程中执行,子进程完成后自动销毁,确保模型内存完全释放。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "e9f0a1b2", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:16.146943Z", + "iopub.status.busy": "2026-05-03T06:14:16.146807Z", + "iopub.status.idle": "2026-05-03T06:14:16.151226Z", + "shell.execute_reply": "2026-05-03T06:14:16.150312Z", + "shell.execute_reply.started": "2026-05-03T06:14:16.146924Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ 日志配置完成 (级别: INFO)\r\n" + ] + } + ], + "source": [ + "import logging\n", + "\n", + "logging.basicConfig(\n", + " level=logging.INFO,\n", + " format=\"%(asctime)s [%(name)s] %(levelname)s: %(message)s\",\n", + " datefmt=\"%H:%M:%S\",\n", + ")\n", + "\n", + "print(\"✅ 日志配置完成 (级别: INFO)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c3d4e5f6", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:16.152272Z", + "iopub.status.busy": "2026-05-03T06:14:16.152120Z", + "iopub.status.idle": "2026-05-03T06:14:16.155642Z", + "shell.execute_reply": "2026-05-03T06:14:16.154737Z", + "shell.execute_reply.started": "2026-05-03T06:14:16.152254Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ 子进程模式已启用 - 模型将在需要时自动加载和释放\r\n" + ] + } + ], + "source": [ + "# 模型管理器已移除 - 现在使用子进程模式\n", + "# 每个模型在独立的子进程中加载、执行、然后自动销毁\n", + "print(\"✅ 子进程模式已启用 - 模型将在需要时自动加载和释放\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7b8c9d0", + "metadata": { + "execution": { + "iopub.execute_input": "2026-05-03T06:14:16.156170Z", + "iopub.status.busy": "2026-05-03T06:14:16.156041Z" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:14:16 [drug_ocr] INFO: ============================================================\r\n", + "14:14:16 [drug_ocr] INFO: 药品说明书识别管线启动 (子进程模式)\r\n", + "14:14:16 [drug_ocr] INFO: 图片路径: resource/1.jpg\r\n", + "14:14:16 [drug_ocr] INFO: 图片分割: False (num_splits=4, overlap=0.10)\r\n", + "14:14:16 [drug_ocr] INFO: ============================================================\r\n", + "14:14:16 [drug_ocr] INFO: [OCR Step] 加载图片...\r\n", + "14:14:16 [drug_ocr] INFO: [OCR Step] 图片加载完成, 尺寸: (2014, 2881)\r\n", + "14:14:16 [drug_ocr] INFO: [OCR Step] 跳过图片分割\r\n", + "14:14:17 [drug_ocr] INFO: [OCR Step] 启动 OCR 子进程...\r\n", + "I0503 14:14:18.096459 1035908 init.cc:238] ENV [CUSTOM_DEVICE_ROOT]=/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device\r\n", + "I0503 14:14:18.096537 1035908 init.cc:146] Try loading custom device libs from: [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device]\r\n", + "I0503 14:14:18.217633 1035908 custom_device_load.cc:51] Succeed in loading custom runtime in lib: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device/libpaddle-iluvatar-gpu.so\r\n", + "I0503 14:14:18.217679 1035908 custom_device_load.cc:58] Skipped lib [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device/libpaddle-iluvatar-gpu.so]: no custom engine Plugin symbol in this lib.\r\n", + "I0503 14:14:18.224740 1035908 custom_kernel.cc:68] Succeed in loading 887 custom kernel(s) from loaded lib(s), will be used like native ones.\r\n", + "I0503 14:14:18.225076 1035908 init.cc:158] Finished in LoadCustomDevice with libs_path: [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device]\r\n", + "I0503 14:14:18.225135 1035908 init.cc:244] CustomDevice: iluvatar_gpu, visible devices count: 1\r\n", + "WARNING 2026-05-03 14:14:18,795 1035908 prometheus_multiprocess_setup.py[line:41] Found PROMETHEUS_MULTIPROC_DIR:/tmp/fd_prom_dad76550-a346-4423-aa82-44018eeaf3ba was set by user. you will find inaccurate metrics. Unset the variable will properly handle cleanup.\r\n", + "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\r\n", + "\u001b[33m[2026-05-03 14:14:19,226] [ WARNING]\u001b[0m - Due to potential compatibility issues between PaddlePaddle and PyTorch in PaddleFormers, PaddleFormers defaults `transformers.utils.import_utils.is_torch_available` and `transformers.utils.import_utils.is_torchvision_available` to False. If you need to use PyTorch in transformers or torchvision, please add `del sys.modules['transformers']` before using them.\u001b[0m\r\n", + "WARNING 2026-05-03 14:14:19,740 1035908 prometheus_multiprocess_setup.py[line:41] Found PROMETHEUS_MULTIPROC_DIR:/tmp/fd_prom_dad76550-a346-4423-aa82-44018eeaf3ba was set by user. you will find inaccurate metrics. Unset the variable will properly handle cleanup.\r\n", + "WARNING 2026-05-03 14:14:19,750 1035908 ops.py[line:125] Failed to import cache manager ops: Prefix cache ops only supported CUDA nor XPU platform \r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[OCR Worker] 加载 OCR 模型 (PaddleOCR-VL)...\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO 2026-05-03 14:14:21,132 1035908 args_utils.py[line:639] Parameter `engine_worker_queue_port` is not specified, found available ports for possible use: [28305]\r\n", + "INFO 2026-05-03 14:14:21,134 1035908 args_utils.py[line:639] Parameter `cache_queue_port` is not specified, found available ports for possible use: [38724]\r\n", + "INFO 2026-05-03 14:14:21,136 1035908 args_utils.py[line:639] Parameter `rdma_comm_ports` is not specified, found available ports for possible use: [14751]\r\n", + "INFO 2026-05-03 14:14:21,139 1035908 args_utils.py[line:639] Parameter `pd_comm_port` is not specified, found available ports for possible use: [19484]\r\n", + "INFO 2026-05-03 14:14:21,140 1035908 download.py[line:142] Using download source: huggingface\r\n", + "INFO 2026-05-03 14:14:21,141 1035908 configuration_utils.py[line:1215] Loading configuration file baidu/PaddleOCR-VL-1.5/config.json\r\n", + "WARNING 2026-05-03 14:14:21,143 1035908 configuration_utils.py[line:1246] You are using a model of type paddleocr_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\r\n", + "WARNING 2026-05-03 14:14:21,144 1035908 configuration_utils.py[line:1246] You are using a model of type paddleocr_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\r\n", + "INFO 2026-05-03 14:14:22,130 1035908 flash_attn_backend.py[line:105] Only support CUDA version flash attention.\r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "current sm_version=71\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING 2026-05-03 14:14:22,285 1035908 moe.py[line:41] import noaux_tc Failed!\r\n", + "INFO 2026-05-03 14:14:24,261 1035908 download.py[line:142] Using download source: huggingface\r\n", + "INFO 2026-05-03 14:14:24,264 1035908 configuration_utils.py[line:425] Loading configuration file baidu/PaddleOCR-VL-1.5/generation_config.json\r\n", + "INFO 2026-05-03 14:14:24,284 1035908 tokenizer_utils.py[line:257] Using download source: huggingface\r\n", + "INFO 2026-05-03 14:14:25,941 1035908 engine.py[line:151] Waiting for worker processes to be ready...\r\n", + "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\r\n", + "To disable this warning, you can either:\r\n", + "\t- Avoid using `tokenizers` before the fork if possible\r\n", + "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\r\n", + "Loading Weights: 100%|██████████| 100/100 [00:07<00:00, 13.23it/s] \r\n", + "Loading Layers: 100%|██████████| 100/100 [00:00<00:00, 198.88it/s] \r\n", + "INFO 2026-05-03 14:14:39,443 1035908 engine.py[line:209] Worker processes are launched with 16.92835831642151 seconds.\r\n", + "INFO 2026-05-03 14:14:39,445 1035908 engine.py[line:220] Detected 10922 gpu blocks and 0 cpu blocks in cache (block size: 16).\r\n", + "INFO 2026-05-03 14:14:39,446 1035908 engine.py[line:223] FastDeploy will be serving 8 running requests if each sequence reaches its maximum length: 8192\r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[OCR Worker] OCR 模型加载完成, 耗时: 18.32s\r\n", + "[OCR Worker] 识别图片 1/1, 尺寸: (2014, 2881)\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processed prompts: 0%| | 0/1 [00:00= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\r\n", + "\u001b[33m[2026-05-03 14:15:03,510] [ WARNING]\u001b[0m - Due to potential compatibility issues between PaddlePaddle and PyTorch in PaddleFormers, PaddleFormers defaults `transformers.utils.import_utils.is_torch_available` and `transformers.utils.import_utils.is_torchvision_available` to False. If you need to use PyTorch in transformers or torchvision, please add `del sys.modules['transformers']` before using them.\u001b[0m\r\n", + "WARNING 2026-05-03 14:15:03,878 1076624 prometheus_multiprocess_setup.py[line:41] Found PROMETHEUS_MULTIPROC_DIR:/tmp/fd_prom_24ad65e3-2498-460c-97d2-9a88e46fe8f6 was set by user. you will find inaccurate metrics. Unset the variable will properly handle cleanup.\r\n", + "WARNING 2026-05-03 14:15:03,886 1076624 ops.py[line:125] Failed to import cache manager ops: Prefix cache ops only supported CUDA nor XPU platform \r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[LLM Worker] 加载 LLM 模型 (ERNIE)...\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO 2026-05-03 14:15:04,961 1076624 args_utils.py[line:639] Parameter `engine_worker_queue_port` is not specified, found available ports for possible use: [58094]\r\n", + "INFO 2026-05-03 14:15:04,964 1076624 args_utils.py[line:639] Parameter `cache_queue_port` is not specified, found available ports for possible use: [56896]\r\n", + "INFO 2026-05-03 14:15:04,967 1076624 args_utils.py[line:639] Parameter `rdma_comm_ports` is not specified, found available ports for possible use: [41390]\r\n", + "INFO 2026-05-03 14:15:04,970 1076624 args_utils.py[line:639] Parameter `pd_comm_port` is not specified, found available ports for possible use: [19643]\r\n", + "INFO 2026-05-03 14:15:04,972 1076624 download.py[line:142] Using download source: huggingface\r\n", + "INFO 2026-05-03 14:15:04,973 1076624 configuration_utils.py[line:1215] Loading configuration file baidu/ERNIE-4.5-0.3B-Paddle/config.json\r\n", + "WARNING 2026-05-03 14:15:04,975 1076624 configuration_utils.py[line:1246] You are using a model of type ernie4_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\r\n", + "INFO 2026-05-03 14:15:06,143 1076624 flash_attn_backend.py[line:105] Only support CUDA version flash attention.\r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "current sm_version=71\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING 2026-05-03 14:15:06,307 1076624 moe.py[line:41] import noaux_tc Failed!\r\n", + "INFO 2026-05-03 14:15:07,224 1076624 download.py[line:142] Using download source: huggingface\r\n", + "INFO 2026-05-03 14:15:07,227 1076624 configuration_utils.py[line:425] Loading configuration file baidu/ERNIE-4.5-0.3B-Paddle/generation_config.json\r\n", + "WARNING 2026-05-03 14:15:07,229 1076624 log.py[line:135] PretrainedTokenizer will be deprecated and removed in the next major release. Please migrate to Hugging Face's transformers.PreTrainedTokenizer. use class QWenTokenizer(PaddleTokenizerMixin, hf.PreTrainedTokenizer) to support multisource download and Paddle tokenizer operations.\r\n", + "INFO 2026-05-03 14:15:09,492 1076624 engine.py[line:151] Waiting for worker processes to be ready...\r\n", + "Loading Weights: 100%|██████████| 100/100 [00:04<00:00, 24.85it/s] \r\n", + "Loading Layers: 100%|██████████| 100/100 [00:00<00:00, 199.46it/s] \r\n", + "INFO 2026-05-03 14:15:20,035 1076624 engine.py[line:209] Worker processes are launched with 13.396349906921387 seconds.\r\n", + "INFO 2026-05-03 14:15:20,036 1076624 engine.py[line:220] Detected 10922 gpu blocks and 0 cpu blocks in cache (block size: 16).\r\n", + "INFO 2026-05-03 14:15:20,037 1076624 engine.py[line:223] FastDeploy will be serving 8 running requests if each sequence reaches its maximum length: 8192\r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[LLM Worker] LLM 模型加载完成, 耗时: 15.08s\r\n", + "[LLM Worker] 正在生成回复 (max_new_tokens=200)...\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processed prompts: 0%| | 0/1 [00:00>> 在这样一支粉色的手指往前一拉,我像一只蝴蝶似的飞到了你的身边\r\n", + "\r\n", + "你轻轻地将我的手贴在脸颊,柔软的触感瞬间让我一下子陷了进去\r\n", + "\r\n", + "“喜欢就好,别舍不得,我们一起去海边好不好?”\r\n", + "\r\n", + "我微微一笑,眼神带着一丝甜蜜,嘴角不自觉地扬起了\r\n", + "\r\n", + "“好,那就一起去,我保证不弄疼你,我们一起海边,好不好?”\r\n", + "\r\n", + "我环住你,紧紧地靠在你的身上,感受着你的温度和怀抱的柔软\r\n", + "\r\n", + "你轻轻地将我搂入怀中,仿佛一只受伤的小动物,任由我紧紧地依靠着你\r\n", + "\r\n", + "随着一阵海风轻拂,我们来到了海边\r\n", + "\r\n", + "风轻轻掀起了我的长发,海浪一波一波地涌来\r\n", + "\r\n", + "我仰头看着那片广阔无垠的蓝,心中满是向往\r\n", + "\r\n", + "“这就是我想要的,这是我第一次来这里\r\n", + "[LLM Worker] LLM 模型已释放\r\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:15:30 [drug_ocr] INFO: [LLM Step] LLM 信息提取完成, 结果长度: 289, 耗时: 28.24s\r\n", + "14:15:30 [drug_ocr] INFO: [TTS Step] TTS 语音合成...\r\n", + "14:15:30 [drug_ocr] INFO: [TTS Step] 启动 TTS 子进程...\r\n", + "I0503 14:15:30.627210 1088334 init.cc:238] ENV [CUSTOM_DEVICE_ROOT]=/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device\r\n", + "I0503 14:15:30.627287 1088334 init.cc:146] Try loading custom device libs from: [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device]\r\n", + "I0503 14:15:30.751516 1088334 custom_device_load.cc:51] Succeed in loading custom runtime in lib: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device/libpaddle-iluvatar-gpu.so\r\n", + "I0503 14:15:30.751560 1088334 custom_device_load.cc:58] Skipped lib [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device/libpaddle-iluvatar-gpu.so]: no custom engine Plugin symbol in this lib.\r\n", + "I0503 14:15:30.759230 1088334 custom_kernel.cc:68] Succeed in loading 887 custom kernel(s) from loaded lib(s), will be used like native ones.\r\n", + "I0503 14:15:30.759569 1088334 init.cc:158] Finished in LoadCustomDevice with libs_path: [/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle_custom_device]\r\n", + "I0503 14:15:30.759625 1088334 init.cc:244] CustomDevice: iluvatar_gpu, visible devices count: 1\r\n", + "\u001b[0;93m2026-05-03 14:15:34.944031381 [W:onnxruntime:Default, cpuid_info.cc:91 LogEarlyWarning] Unknown CPU vendor. cpuinfo_vendor value: 16\u001b[m\r\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[TTS Worker] 加载 TTS 模型 (PaddleSpeech)...\r\n", + "[TTS Worker] TTS 模型加载完成, 耗时: 0.00s\r\n", + "[TTS Worker] 语音合成开始, 输入文字长度: 289\r\n" + ] + } + ], + "source": [ + "from pathlib import Path\n", + "\n", + "sample_image_path = str(Path(\"resource/1.jpg\"))\n", + "\n", + "result = drug_ocr_pipeline(\n", + " ocr_model_dir=ocr_model_dir,\n", + " llm_model_dir=llm_model_dir,\n", + " image_path=sample_image_path,\n", + " enable_split=False,\n", + " num_splits=4,\n", + " overlap_ratio=0.1,\n", + " ocr_max_new_tokens=ocr_max_new_tokens,\n", + " llm_max_new_tokens=llm_max_new_tokens,\n", + ")\n", + "\n", + "print(\"\\n\" + \"=\" * 60)\n", + "print(\"📋 OCR 识别结果:\")\n", + "print(\"=\" * 60)\n", + "print(result[\"ocr_text\"][:500] + \"...\" if len(result[\"ocr_text\"]) > 500 else result[\"ocr_text\"])\n", + "\n", + "print(\"\\n\" + \"=\" * 60)\n", + "print(\"📝 大模型整理结果:\")\n", + "print(\"=\" * 60)\n", + "print(result[\"extracted_info\"])\n", + "\n", + "# 播放音频\n", + "if result[\"audio\"] is not None:\n", + " import IPython.display as ipd\n", + " sr, wav_data = result[\"audio\"]\n", + " print(\"\\n🔊 播放语音...\")\n", + " ipd.display(ipd.Audio(wav_data, rate=sr))" + ] + }, + { + "cell_type": "markdown", + "id": "e1f2a3b4", + "metadata": {}, + "source": [ + "## Gradio 交互界面\n", + "[返回目录 ⬆️](#目录:)\n", + "\n", + "通过 Gradio 界面,用户可以:\n", + "- 上传药品说明书图片\n", + "- 设置是否启用图片分割及分割数量\n", + "- 调整各模型的生成参数(max_new_tokens)\n", + "- 查看识别和整理结果\n", + "- 播放语音合成的音频\n", + "\n", + "> 每次点击\"开始识别\"时,各模型在独立子进程中执行,完成后自动销毁子进程释放内存。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5d6e7f8", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "demo = make_demo(\n", + " ocr_model_dir=ocr_model_dir,\n", + " llm_model_dir=llm_model_dir,\n", + " ocr_max_new_tokens=ocr_max_new_tokens,\n", + " llm_max_new_tokens=llm_max_new_tokens,\n", + ")\n", + "\n", + "try:\n", + " demo.launch(server_name=\"0.0.0.0\", server_port=7860, debug=True)\n", + "except Exception:\n", + " demo.launch(debug=True, share=True)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "py35-paddle1.2.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}