This is a multi-agent system built with LangGraph to tackle the GAIA benchmark.
This project serves as my final assignment for the Hugging Face Agents Course, specifically the Unit 4 Hands-on.
Current Performance: 70% (14/20) on Level 1
- Issues:
- 3 tasks: Missing multimodal video evaluation (haven't implemented this yet).
- 1 task: Chess game image parsing (still trying to figure out a fix for this).
- 2 tasks: Random formatting bugs across 3 tasks (e.g., truncating "Fresh basil" to "basi", pitcher before/after, or dropping "freshly squeezed" from lemon juice).
The system uses a Supervisor/Orchestrator pattern. A lightweight, fast LLM acts as the router to classify the prompt, then hands the task off to specialized sub-agents powered by a heavier reasoning model.
- Orchestrator: Qwen/Qwen2.5-7B-Instruct
- Sub-Agents & Finalizer: Gemini-2.5-Flash
- Traces LangSmith
┌────────────┐ ┌──────────────┐
│ classify │───►│ researcher │──┐
│ (router) │───►│ mathematician│──┤
│ │───►│ file_analyst │──├──► [finalizer] ---> final_answer
│ │───►│ generalist │──┘
└────────────┘ └──────────────┘
- Researcher: Built for deep web searches (Tavily) and fact retrieval (Wiki).
- Mathematician: Handles math problems using a calculator tool and dynamic Python execution.
- File Analyst: Triggered whenever a file is attached. Reads files, runs Python data scripts (like pandas for Excel), and parses audio/images.
- Generalist: The fallback agent for multi-step reasoning that doesn't fit cleanly into one bucket.
- Answer Extraction(Finalizer): Align the model output with the evaluation benchmarks
tavily_search: High-fidelity web search with raw content extraction( raw_content, advanced mode, extract).wiki_search: Knowledge retrieval formatted in clean Markdown( search -> page -> Jina AI).
calculator: Rapid evaluation of mathematical expressions.run_python: Environment for generating and executing dynamic scripts.
read_file: Direct ingestion and parsing of local file data.execute_python: Execution of predefined Python scripts and local assets.run_python: Environment for generating and executing dynamic scripts.analyze_image: vision processing for image analysis.( Gemini 2.5 flash )transcribe_audio: Neural speech-to-text processing. ( Gemini 2.5 flash )
Check out how the different sub-agents handle various tasks:
I tweaked the standard setup to fix a lot of the common formatting, context limit, and hallucination issues you usually see in these benchmarks:
-
Wiki Extraction via Jina AI
Standard Wikipedia search/loaders usually have formatting issues or aggressively cut off content. I routed Wiki searches through Jina AI to grab the entire page in clean Markdown. -
Aggressive Token Reduction
Getting entire Wiki pages is great, but it eats the context window. I wrote a custom text cleaner (_clean_wiki_content) that strips out Jina metadata, image markdown, "See also"/References sections, and inline link noise while preserving the actual text and tables.Result: Cut token usage by ~50-65% per search.
Examples:
- 1928 Summer Olympics: 17,220 → 5,891 tokens (65% reduction)
- Giganotosaurus: 13,114 → 4,442 tokens (66% reduction)
- Mercedes Sosa: 15,818 → 7,950 tokens (49% reduction)
-
Tavily Deep Search
Upgraded the Tavily tool to usesearch_depth="advanced"andinclude_raw_contentto pull full page text instead of just snippets. -
Model Upgrade
Swapped out Gemini Lite for Gemini-2.5-Flash across all sub-agents to significantly cut down on hallucinations during complex reasoning. -
Dual-mode Python subprocess execution with timeout handling
- Split Python execution into two tools:
execute_python(for running existing local files) andrun_python(for running scripts generated on-the-fly by the LLM). - Added auto-stripping for markdown backticks so generated Python code runs smoothly without syntax errors.
- Added execution timeout to avoid hanging processes.
- Split Python execution into two tools:
You'll need API keys for Google (Gemini), Hugging Face (Qwen routing), and Tavily (Search).
export HF_TOKEN="hf_your_token_here"
export GOOGLE_API_KEY="AIzaSy_your_key_here"
export TAVILY_API_KEY="tvly-your_key_here"Install the required dependencies (LangGraph, LangChain, Google GenAI, etc.).
You can hit the entry point (main.py) in a few different ways:
- Single Query (CLI)
python main.py -q "What is the population of Tokyo?"- Query with an Attached File
python main.py -q "Calculate the sum of the revenue column." -f "/path/to/financials.xlsx"- Interactive Mode (REPL)
python main.py -i(To attach a file in chat, append file:<path> to your message.)
To see the ReAct loop thinking step-by-step and watch tool execution outputs, use the -v flag:
python main.py -q "Who won the 1928 olympics?" -v






