Skip to content

Commit 4e92bd3

Browse files
committed
Add Excel LLM benchmarking tool with comprehensive run tracking and reporting
- Implemented `ExcelLLMRunRecord` and `ExcelLLMSummary` data classes for structured run data. - Created backend protocols for MCP and ZCP Excel backends, including asynchronous context management. - Developed functions for running benchmark cases, managing tool calls, and evaluating results. - Added support for progress tracking and checkpointing during benchmarking runs. - Introduced markdown and JSON report generation for summarizing benchmark results. - Implemented utility functions for tool argument parsing and duplicate tool call evaluation.
1 parent f9e3ca9 commit 4e92bd3

66 files changed

Lines changed: 12697 additions & 486 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
OPENAI_API_KEY=replace-with-your-provider-key
2+
OPENAI_BASE_URL=https://api.deepseek.com
3+
OPENAI_MODEL=deepseek-chat
4+
5+
# Optional aliases for provider-specific scripts.
6+
DEEPSEEK_API_KEY=

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
* @jiayuqi7813
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
name: Bug report
3+
about: Report a defect in the SDK, runtime, compatibility layer, or examples
4+
title: "[Bug] "
5+
labels: bug
6+
assignees: ""
7+
---
8+
9+
## Summary
10+
11+
## Environment
12+
13+
- Python version:
14+
- Package install mode:
15+
- Transport:
16+
17+
## Reproduction
18+
19+
## Expected behavior
20+
21+
## Actual behavior
22+
23+
## Logs or artifacts
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
name: Feature request
3+
about: Propose a protocol, SDK, docs, or benchmark improvement
4+
title: "[Feature] "
5+
labels: enhancement
6+
assignees: ""
7+
---
8+
9+
## Problem
10+
11+
## Proposed solution
12+
13+
## Compatibility impact
14+
15+
## Additional context

.github/dependabot.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: "pip"
4+
directory: "/"
5+
schedule:
6+
interval: "weekly"
7+
- package-ecosystem: "github-actions"
8+
directory: "/"
9+
schedule:
10+
interval: "weekly"

.github/pull_request_template.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
## Summary
2+
3+
## Validation
4+
5+
- [ ] `python -m ruff check src tests examples tools`
6+
- [ ] `PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python -m pytest -q`
7+
8+
## Compatibility Notes
9+
10+
## Benchmark / Docs Impact
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: benchmark-smoke
2+
3+
on:
4+
pull_request:
5+
workflow_dispatch:
6+
7+
jobs:
8+
smoke:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: actions/setup-python@v5
13+
with:
14+
python-version: "3.12"
15+
- name: Install
16+
run: |
17+
python -m pip install --upgrade pip
18+
python -m pip install -e ".[dev,openai,mcp]"
19+
- name: Benchmark regression tests
20+
env:
21+
PYTEST_DISABLE_PLUGIN_AUTOLOAD: "1"
22+
run: python -m pytest -q tests/test_benchmark_harness.py tests/test_excel_llm_benchmark.py

.github/workflows/ci.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: sdk-ci
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches:
7+
- main
8+
- "codex/**"
9+
10+
jobs:
11+
tests:
12+
runs-on: ubuntu-latest
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
python-version: ["3.11", "3.12"]
17+
steps:
18+
- uses: actions/checkout@v4
19+
- uses: actions/setup-python@v5
20+
with:
21+
python-version: ${{ matrix.python-version }}
22+
- name: Install
23+
run: |
24+
python -m pip install --upgrade pip
25+
python -m pip install -e ".[dev,openai,mcp]"
26+
- name: Ruff
27+
run: python -m ruff check src tests examples tools
28+
- name: Pytest
29+
env:
30+
PYTEST_DISABLE_PLUGIN_AUTOLOAD: "1"
31+
run: python -m pytest -q

.github/workflows/secret-scan.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
name: secret-scan
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches:
7+
- main
8+
- "codex/**"
9+
10+
jobs:
11+
gitleaks:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
- uses: gitleaks/gitleaks-action@v2
16+
env:
17+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66

77
__pycache__/
88
*.py[cod]
9+
*.egg-info/
910

1011
node_modules/
1112
.next/
13+
benchmark_reports/checkpoints/
14+
benchmark_reports/**/checkpoints/

0 commit comments

Comments
 (0)