Skip to content

feat: add single-operator subgraph dataset generation script#635

Merged
lixinqi merged 7 commits intoPaddlePaddle:developfrom
ywh555hhh:develop
Feb 9, 2026
Merged

feat: add single-operator subgraph dataset generation script#635
lixinqi merged 7 commits intoPaddlePaddle:developfrom
ywh555hhh:develop

Conversation

@ywh555hhh
Copy link
Contributor

PR Category

Feature Enhancement

Summary

This PR adds a new shell script generate_single_op_dataset.sh to support the generation of single-operator subgraphs. This is a critical step for building the single-op dataset used in kernel benchmarking.

Key Changes

  • New Script: generate_single_op_dataset.sh
  • Workflow:
    1. Generation: Uses multiprocessing to extract single-op subgraphs from models defined in the input list.
    2. Renaming: Standardizes graph variable names to ensure consistent hashing.
    3. Deduplication: Removes structurally identical subgraphs.

Test Plan

Ran the script on 10 sample models (small scale).

  • Command: bash generate_single_op_dataset.sh
  • Results:
    • Input: ~2088 raw subgraphs.
    • Output: 206 unique subgraphs.
    • Directory structure validated.
    • Log files confirm correct handling of individual model failures without breaking the pipeline.

Checklist

  • Script executable permissions set.
  • Warning block added for hardcoded paths.
  • Verified deduplication logic.>

This commit introduces `generate_single_op_dataset.sh` to automate the workflow for generating single-operator subgraph datasets.
@paddle-bot
Copy link

paddle-bot bot commented Feb 5, 2026

Thanks for your contribution!

# Virtual Environment Python Executable Path
PYTHON_EXEC="/workspace/venv_graphnet/bin/python3"
# Project Root Directory
GRAPH_NET_ROOT="/workspace/GraphNet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你参考下generate_subgraph_dataset.sh,这些都不要hardcode。

"model_path_prefix": PROJECT_ROOT,
"output_dir": workspace
})
run_stage_cmd(env, PROJECT_ROOT, [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接起shell命令吧,不需要调python

Refactor script for dynamic path detection and improved error handling. Added logging and workspace setup enhancements.
Copy link
Collaborator

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本放到graph_net/tools目录下面

LOG_DIR="${WORKSPACE}/logs" # New: Dedicated log directory

export PYTHONPATH="${GRAPH_NET_ROOT}:${PYTHONPATH}"
export GRAPH_NET_ROOT PYTHON_EXEC WORKSPACE OP_NAMES_DIR RANGES_DIR RAW_SUBGRAPH_DIR RESUME LOG_DIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里多余了。L21能成功就不需要L54,另外L55什么作用?

# ==============================================================================
# Core Logic: Single Model Processing (V3: Strict Error Checking)
# ==============================================================================
process_single_model() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个步骤都是批量执行模型的,不是single_model

EOF
)"

run_step "OpNames" "$cmd_s1" || { rm -f "${tmp_list}"; return 1; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要再封装run_step函数了,这让整个脚本变得复杂,直接执行。

split -l ${lines_per_gpu} -d "${WORKSPACE}/clean_list.txt" "${WORKSPACE}/gpu_chunk_"

# 3. Parallel Execution
for (( i=0; i<NUM_GPUS; i++ )); do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本里面不必加并行执行。若要支持并行支持,应该是在apply_sample_passmodel_path_handler层面统一添加。
即使在脚本里面加,也不需要L179和L188两层循环。所有处理步骤都是接受一个model_path_list,把总的model_list拆成NUM_GPUS,每个GPU处理一个list。

exit 1
fi

grep -v "^#" "${MODEL_LIST}" | grep -v "^$" > "${WORKSPACE}/clean_list.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始的model_list有什么问题吗?

find ${RAW_SUBGRAPH_DIR} -name "model.py" \
| xargs dirname \
| xargs realpath --relative-to=${RAW_SUBGRAPH_DIR} \
> "${WORKSPACE}/raw_list.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名最好更有意义

@lixinqi lixinqi merged commit c0fd47d into PaddlePaddle:develop Feb 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants