Hi, I’m Peng Fei, a graduate student at Shenzhen University working around multimodal large models, speech large models, and VLA-style systems.
I use this GitHub account as a public notebook for small, runnable research-engineering projects: clean data schemas, transparent evaluation scripts, and prototypes that help me understand model behavior before scaling things up.
- Multimodal large models — image-text/audio-text understanding, instruction following, and evaluation design
- Speech large models — ASR-oriented workflows, spoken dialogue evaluation, and audio instruction following
- Vision-Language-Action systems — action schemas, grounding, simulated evaluation, and robotics-oriented interfaces
- Reproducible ML tooling — small benchmarks, dataset cards, CLI-first experiments, and readable reports
| Project | What it explores |
|---|---|
audio-scene-caption-lab |
A small sandbox for audio, speech, and visual scene captioning workflows with lightweight metrics and report generation. |
vla-action-grounding-playground |
A toy environment for instruction-to-action grounding, action schema design, and VLA-style evaluation traces. |
Python · PyTorch · Transformers · NumPy · Jupyter · Linux · Git · LaTeX
I am especially interested in projects that are easy to run, easy to inspect, and honest about their limitations. A good experiment should leave a readable trail.
- How to evaluate multimodal reasoning beyond single-number accuracy
- Speech and audio benchmarks that expose real failure modes
- Action representations for VLA agents in simulated tasks
- Better experiment organization for small research teams
Most repositories here are learning-oriented prototypes rather than production systems. I try to keep the README clear about what each project can and cannot do.