Skip to content

v0.22.0: 22 competitor-informed bug fixes (5,271 tests)#55

Merged
johnnichev merged 44 commits intomainfrom
v0.22.0-competitor-bug-fixes
Apr 13, 2026
Merged

v0.22.0: 22 competitor-informed bug fixes (5,271 tests)#55
johnnichev merged 44 commits intomainfrom
v0.22.0-competitor-bug-fixes

Conversation

@johnnichev
Copy link
Copy Markdown
Owner

Summary

22 bugs identified by mining 95+ closed bugs from Agno (39k stars) and 60+ from PraisonAI (6.9k stars), then cross-referencing the patterns against selectools v0.21.0 source. Six were shipping blockers. All 22 are now fixed with TDD regression tests.

  • 6 HIGH severity — shipping blockers (streaming, types, async, HITL, threading)
  • 9 MEDIUM severity — correctness improvements
  • 7 LOW-MED severity — thread safety + edge case cleanups
  • +57 new regression tests in tests/agent/test_regression.py
  • 5,271 tests collected (up from 5,015 baseline), 5,020 passing non-E2E
  • Documentation updated for all user-facing API changes

Each fix has empirical fault-injection verification (test fails without fix, passes with fix) and a cross-reference comment to the original Agno/PraisonAI issue.

Bug Highlights

HIGH Severity (Shipping Blockers)

Bug Impact Cross-ref
BUG-01 Streaming dropped ToolCall objects Agents with stream=True couldn't execute native tool calls Agno #6757
BUG-02 `typing.Literal` crashed @tool() Tool registration failed entirely Agno #6720
BUG-03 `asyncio.run()` re-entry in 8 wrappers Selectools unusable in Jupyter / FastAPI PraisonAI #1165
BUG-04 HITL lost in parallel groups Interrupts silently dropped Agno #4921
BUG-05 HITL lost in subgraphs (+ silent infinite loop on resume) Caught by empirical fault injection Agno #4921
BUG-06 ConversationMemory had no lock Concurrent batch execution could corrupt history PraisonAI #1164

MEDIUM Severity (9 fixes)

think tag stripping, RAG batch limits, MCP concurrent race, str→int/float/bool coercion, Union[str, int] support, multi-interrupt generators, GraphState fail-fast validation, session namespace isolation, summary growth cap.

LOW-MED Severity (7 fixes)

cancelled-result extraction, AgentTrace lock, async observer exception logging, batch clone isolation, OTel/Langfuse observer locks, vector store search dedup, Optional[T] without default handling.

What This Buys

  • Streaming protocol complete — tool calls flow through `run/arun/astream`, `` tags stripped, args coerced
  • HITL end-to-end correct — propagates through single agents, parallel groups, subgraphs, multi-yield generators with working `resume()`
  • Thread safety end-to-end correct — every shared-state class has proper synchronization
  • Type system robust — `Literal`, `Optional[T]`, `Union[str, int]`, str coercion all work
  • Production-ready — Jupyter/FastAPI compat, RAG batch limits, session namespace isolation

Documentation Updates

  • `docs/modules/TOOLS.md` — new `Type Hint Support` section (Literal, Optional, Union, coercion)
  • `docs/modules/SESSIONS.md` — new `Namespace Isolation` section
  • `docs/modules/VECTOR_STORES.md` — new `Search Result Deduplication` section
  • `docs/QUICKSTART.md` — new `Running Selectools in Async Contexts` section
  • `README.md` + `CONTRIBUTING.md` — 11 stale test counts synced (5203 → 5271)
  • mkdocs builds clean

Methodology

Five steps documented in the comprehensive plan at `docs/superpowers/plans/2026-04-10-competitor-bug-fixes.md`:

  1. Pull competitor bug data via `gh api search/issues`
  2. Categorize bugs by failure mode (streaming, types, async, HITL, etc.)
  3. Cross-reference each category against own source via parallel research subagents
  4. TDD per bug — failing test, minimal fix, verify, commit
  5. Empirical fault injection in code review (revert fix, confirm test fails)

The fault-injection step caught the BUG-05 silent infinite loop that would have shipped with read-only review.

Test Plan

  • All 5,020 non-E2E tests pass
  • mypy strict clean
  • black, isort, flake8, bandit clean
  • Each bug has a regression test that empirically fails without the fix
  • No regressions in existing tests
  • mkdocs builds clean
  • Real-API E2E tests (run before tag/release)
  • Manual smoke test in Jupyter notebook (BUG-03 verification)
  • Manual smoke test of multi-interrupt generator workflow (BUG-12 verification)

Stats

Metric Value
Competitor bugs read 155
Selectools bugs found 22
Hit rate 14%
Subagent dispatches ~50
Plan lines 1,633
Commits on branch 26
Test count delta 5,015 → 5,271 (+256, of which 57 are new regression tests)

See the `Unreleased` section of `CHANGELOG.md` for the full per-bug breakdown with cross-references to every Agno/PraisonAI issue.

…dered backlog

Replaces sparse 12-line backlog with comprehensive priority-ordered
document containing 20+ features mined from Agno, PraisonAI, and
Superagent competitive research.

- P0: Loop detection, agentic memory, agent-as-API
- P1: LiteLLM wrapper, cost router, A2A protocol, expanded toolbox
- P2: Tool result compression, session search, memory tiering, HITL
- P3: Shadow git, multi-channel bots, ML guard models, learning system

Each item documents source, gap, spec with code examples, implementation
notes, and effort estimate — spec-ready for v0.22.0 planning.
22 bugs cross-referenced from Agno and PraisonAI bug trackers, each
with TDD task breakdown: failing test → minimal fix → regression
test passes → commit. Grouped by severity (6 HIGH, 9 MEDIUM, 7 LOW-MED).

Follows superpowers:writing-plans conventions — bite-sized steps with
exact file paths, complete code, and verification commands.
BUG-01: _streaming_call and _astreaming_call filtered chunks with
isinstance(chunk, str), silently dropping ToolCall objects yielded
by providers. Any user with AgentConfig(stream=True) calling run()
would find native provider tool calls were never executed.

Now both methods return (text, tool_calls) tuple. Caller propagates
tool_calls into the returned Message so _process_response can
dispatch them through the normal tool-execution path.

Updated test_astreaming_call_sync_fallback to unpack the new
tuple return type.

Cross-referenced from Agno #6757.
Addresses code review I1 and I2:

I1: Regression tests must live in tests/agent/test_regression.py
per tests/CLAUDE.md convention — not in a new tests/regressions/
package. The three BUG-01 tests are now appended to the canonical
file with test_bug01_ prefixes for discoverability.

I2: Added test_bug01_astreaming_sync_fallback_preserves_tool_calls
to cover the async sync-fallback branch of _astreaming_call — the
third of three structurally-identical ToolCall collection sites
that previously had no end-to-end coverage.

Plan document updated to target the canonical location for all
remaining tasks (BUG-02 through BUG-22).
…ocks

Re-review of BUG-01 found that while the authoritative Files sections
were updated in commit cf95734, 33 inline references in per-task
Step code blocks (pytest commands, git add commands, test file header
comments) still pointed to the rejected tests/regressions/ layout.

Bulk-replaced via script:
- pytest tests/regressions/test_bugNN_*.py → pytest tests/agent/test_regression.py -k bugNN
- git add tests/regressions/test_bugNN_*.py → git add tests/agent/test_regression.py
- # tests/regressions/test_bugNN_*.py → # Append to tests/agent/test_regression.py (BUG-NN)

Historical context reference on line 92 (explaining why the layout
was rejected) preserved intentionally.
BUG-02: @tool() crashed on Literal[...] parameters because
_unwrap_type() returned Literal unchanged, and then
_validate_tool_definition() rejected it as an unsupported type.

Now detects Literal (and Optional[Literal]) via new _literal_info
helper, extracts enum values, infers base type from the first value,
and auto-populates ToolParameter.enum. Supports str, int, float, and
bool literals.

Cross-referenced from Agno #6720.
BUG-03: Bare asyncio.run() in 8 sync wrappers crashed with
'cannot call asyncio.run() while another event loop is running'
when called from Jupyter, FastAPI handlers, nested async code,
or async tests.

New _async_utils.run_sync() helper detects a running loop and
offloads to a worker thread when needed. Applied to:
- AgentGraph.run / AgentGraph.resume
- SupervisorAgent.run
- PlanAndExecuteAgent / ReflectiveAgent / DebateAgent / TeamLeadAgent
- Pipeline._execute_step

Cross-referenced from PraisonAI #1165.
BUG-03 code review finding: the initial run_sync implementation
created a new ThreadPoolExecutor(max_workers=1) on every call
and tore it down on exit. This violates pitfall #20 — explicitly
enforced in tools/base.py, agent/_provider_caller.py, and
agent/_tool_executor.py with inline comments.

Now uses a module-level _RUN_SYNC_EXECUTOR singleton with
double-checked lazy init matching the reference recipe in
_provider_caller.py:11-27. max_workers=4 so concurrent sync
wrappers don't serialize; thread_name_prefix for debuggability.

All 4 BUG-03 regression tests still pass (4971 total).
BUG-04: run_child in _aexecute_parallel discarded the interrupted
boolean from _aexecute_node. If a child yielded InterruptRequest,
the signal was lost and the graph continued as if the child
completed normally — no checkpoint, no pause, HITL broken inside
parallel groups.

Now run_child returns a 4-tuple including the interrupted flag,
_aexecute_parallel returns a 3-tuple that surfaces the interrupt
marker (re-planting __pending_interrupt_key__ and preserving
_interrupt_responses on the merged state so resume works), and
both callers (arun and astream) propagate the signal through the
same checkpoint/pause path used for non-parallel nodes.

Cross-referenced from Agno #4921.
BUG-05: _aexecute_subgraph never checked sub_result.interrupted.
If a subgraph paused for HITL, the parent treated it as completed
and continued executing, losing the pause state.

Now _aexecute_subgraph returns (result, state, interrupted),
mirroring the BUG-04 parallel-group propagation pattern. When the
sub-result is interrupted, the pending-interrupt key and
_interrupt_responses are copied into the parent state with a
"{node_name}/" namespace prefix so the parent's checkpoint/resume
machinery handles the interrupt correctly and keys do not collide
with the parent's own direct-child interrupt keys.

Both callers (arun and astream) now unpack the 3-tuple and route
the interrupted flag through the same checkpoint/pause pipeline
used for non-subgraph interrupts.

Cross-referenced from Agno #4921.
BUG-05 follow-up: the initial fix (995577f) correctly surfaced
subgraph interrupts but used namespaced keys ('{node_name}/{key}')
that prevented graph.resume() from routing the stored response
back into the subgraph's generator. This caused a silent infinite
loop on every resume attempt.

Now uses flat-key propagation matching BUG-04's parallel-group
approach:
- UP: sub_state._interrupt_responses keys flow into parent._interrupt_responses
- DOWN: parent._interrupt_responses forwarded into sub_state on each
  _aexecute_subgraph invocation so the subgraph generator can find
  its response when resumed

New regression test calls outer.resume() end-to-end and verifies
the subgraph completes without re-interrupting.
BUG-06: ConversationMemory was the only shared-state class in
selectools without a lock. Concurrent add()/add_many()/get_history()
from multiple threads raced on self._messages, potentially losing
messages or corrupting the list during _enforce_limits.

All mutation and read methods now acquire self._lock (RLock for
re-entrance — add() calls _enforce_limits() which may call other
locked methods). __getstate__/__setstate__ exclude the lock from
serialization and recreate it on restore so session persistence
still works.

Cross-referenced from PraisonAI #1164, #1260.
BUG-07: Claude-compatible endpoints emit reasoning inline as
<think>...</think> blocks. These were preserved in response text
and written to conversation history, then sent back to the model
on subsequent turns — polluting context indefinitely.

New _strip_reasoning_tags() helper removes these blocks from all
text accumulation paths in complete/acomplete/stream/astream.
Safe for non-tagged text (fast-path check). Multi-block support.

The streaming paths use a small _consume_think_buffer() state
machine so <think>...</think> blocks that span chunk boundaries
are still suppressed before yielding to the consumer.

Cross-referenced from Agno #6878.
BUG-08: add_documents() called upsert with the entire document
list at once. ChromaDB has an internal batch limit (~5461 docs),
Pinecone's upsert limit is 100 vectors, and Qdrant also has
practical limits on payload size. Large ingestions crashed.

Each store now chunks the upsert into _batch_size groups:
- Chroma: 5000
- Pinecone: 100
- Qdrant: 1000

Cross-referenced from Agno #7030.
BUG-09: MCPClient._call_tool had no concurrency control. Concurrent
calls could interleave writes on the shared stdio pipe or HTTP
connection, corrupting the protocol or mixing responses. Circuit
breaker state updates (_failure_count, _circuit_open_until) and
auto_reconnect could also race.

Now wraps the entire _call_tool body in an asyncio.Lock so all session
I/O, circuit breaker reads/updates, and auto_reconnect logic are
serialized per client. Lock is lazy-initialized on first call because
asyncio.Lock binds to the running loop and MCPClient may be
constructed outside any loop (sync entry points).

Cross-referenced from Agno #6073.
BUG-10: LLMs (especially smaller/local models) sometimes return
numeric values as strings in tool call JSON. Tool.validate rejected
these with ToolValidationError instead of coercing. A new
_coerce_value helper now attempts safe str->int/float/bool coercion
before validation and writes the coerced value back into the params
dict so execute()/aexecute() use the typed value.

BUG-11: _unwrap_type only unwrapped Optional (Union[T, None]).
Multi-type unions like Union[str, int] fell through to
_validate_tool_definition which rejected them as unsupported. Now
multi-type unions default to str (in both typing.Union and
types.UnionType branches); runtime coercion (BUG-10) handles the
actual values.

The two fixes are co-dependent: BUG-11's str fallback relies on
BUG-10's coercion to handle int values passed to a Union[str, int]
parameter.

One existing test in tests/core/test_better_errors.py was updated
to reflect the new (correct) coercion behavior — numeric strings
now flow through to int parameters, only unparseable strings raise.
A complementary positive test (test_numeric_string_coerced_to_int)
was added in the same class.

Cross-referenced from PraisonAI #410 and Agno #6720.
…rowth

BUG-13: GraphState.to_dict() claimed to be JSON-safe but only
deep-copied data. Non-serializable values silently corrupted
checkpoints. Now round-trips data through json.dumps/loads and
raises ValueError on failure.

BUG-15: _maybe_summarize_trim concatenated each new summary to
the existing one without bound. Over long sessions, the summary
grew linearly and eventually exceeded the model's context window.
New _append_summary helper caps combined length at 4000 chars
(~1000 tokens), keeping the most recent content.

Cross-referenced from Agno #7365 and Agno #5011.
BUG-12: Generator nodes with 2+ InterruptRequest yields silently
skipped subsequent interrupts. After gen.asend(response) advanced
past the first yield, its return value was discarded, and the
subsequent __anext__() advanced past the next yield — sending None
as the response to whoever was waiting.

Also the interrupt_index counter was being reset on every
non-interrupt yield, breaking the key-mapping for generators with
interleaved data and interrupts.

Now the iteration uses a single loop where each fetch (asend or
__anext__) returns an item that is dispatched in the same code
path. interrupt_index only increments on actual InterruptRequest
yields so resume keys remain stable. Additionally, previously
resolved interrupt responses are no longer deleted after consumption
— generators are restarted from scratch on every resume and must be
able to deterministically replay past their already-answered gates.

Applied to both _aexecute_generator_node (async) and
_execute_generator_node_sync (sync).

Cross-referenced from Agno #4921.
BUG-14: Sessions were keyed solely by session_id. Two agents with
the same session_id (e.g. Agent + Team sharing an ID) would
overwrite each other's ConversationMemory.

All three session stores (JsonFile, SQLite, Redis) now accept an
optional namespace parameter on save/load/delete/exists. The
namespace is prepended to the key as '{namespace}:{session_id}' for
isolation. Sessions saved without a namespace remain loadable
for backward compatibility.

Cross-referenced from Agno #6275.
…servers

BUG-17: AgentTrace.add() was self.steps.append with no lock.
Parallel graph branches share the same trace object and can race
when child nodes execute sync callables via run_in_executor.
AgentTrace now carries a threading.Lock initialized in __post_init__
and wraps add(), filter(), duration properties, to_dict(), timeline(),
and to_otel_spans() so callers observe a consistent step snapshot.
__getstate__/__setstate__ drop and restore the lock so copy.copy
and similar operations still work.

BUG-20: OTelObserver and LangfuseObserver mutated their internal
dicts (_spans, _llm_starts, _llm_counter, _traces, _generations)
without locks. Agent.batch() shares observer instances across
thread-pool workers; concurrent on_llm_start/on_llm_end calls
raced on counters and could lose spans or double-count.
Both observers now take a threading.Lock in __init__ and wrap
every mutation/read of the internal dicts and counters. The
counter snapshot is captured under the lock and reused as the
local LLM span key, so two concurrent on_llm_start calls can no
longer produce duplicate keys.

The existing OTel and Langfuse unit tests bypass __init__ via
__new__ to inject mocks, so they now also seed obs._lock to match
the production invariant.

Cross-referenced from Agno #5847 and PraisonAI #1260.
BUG-18: async observer callbacks fired via asyncio.ensure_future
had no done-callback. If the coroutine raised, the exception
became an unhandled-exception warning on the event loop and was
effectively lost — users had no visibility that their observer
had failed. Now attaches a done-callback that logs exceptions
via logger.warning without crashing the agent loop.

BUG-19: _clone_for_isolation shallow-copied the Agent so batch
clones shared the same config.observers list. While BUG-17 and
BUG-20 made individual observers thread-safe, clones still
shouldn't share a mutable list — appending/swapping on one clone
would bleed into siblings. Now shallow-copies the config and
creates a fresh observer list per clone. Observer instances
remain shared because they're already thread-safe.

Cross-referenced from Agno #6236 and PraisonAI #1260.
BUG-16: _build_cancelled_result called _session_save but was
missing _extract_entities and _extract_kg_triples. When a run
was cancelled via CancellationToken, any entities/KG triples
that had been collected during the turn were silently lost.
Now mirrors _build_max_iterations_result and _build_budget_
exceeded_result which call all three persistence methods.

BUG-22: @tool() treated Optional[T] without a default value as
required. Some LLMs refuse to call a tool when an 'optional'
parameter has no way to represent None. Now detects Optional
types via Union[T, None] and marks them is_optional=True even
without a default value.

Cross-referenced from CLAUDE.md pitfall #23 and Agno #7066.
BUG-21: Vector store search() methods returned duplicate results
when the same document text was added multiple times (e.g. the
SQLite store uses uuid4 IDs, so re-adding the same text creates
new rows with new IDs but duplicate content). HybridSearcher
already text-deduped during fusion, but individual stores did not.

Added a dedup: bool = False parameter to the VectorStore abstract
search() signature and to all seven store implementations:
InMemoryVectorStore, SQLiteVectorStore, ChromaVectorStore,
FAISSVectorStore, PineconeVectorStore, QdrantVectorStore, and
PgVectorStore. When True, post-filters results by document text
to keep only the first (highest-scoring) occurrence, over-fetching
upstream as needed so the final list still contains up to top_k
unique results.

A shared _dedup_search_results() helper lives in vector_store.py
so every backend uses identical semantics. Default is False for
backward compatibility — users who want dedup must opt in
explicitly.

Cross-referenced from Agno #7047.
22 bugs identified by cross-referencing 95+ Agno bugs and 60+ PraisonAI
bugs against selectools v0.21.0. Each fix has a TDD regression test
in tests/agent/test_regression.py with empirical fail-then-pass
verification.

- 6 HIGH severity (shipping blockers): streaming tool calls, Literal
  types, asyncio.run() re-entry, parallel/subgraph HITL, ConversationMemory
  thread safety
- 9 MEDIUM severity: think tags, RAG batches, MCP races, type coercion,
  Union types, multi-interrupts, GraphState validation, session namespaces,
  summary cap
- 7 LOW-MED severity: cancelled extraction, AgentTrace lock, observer
  exceptions, clone isolation, OTel/Langfuse locks, vector dedup,
  Optional handling

Test count: 5015 -> 5020 (+57 regression tests)
Documentation updates for the 22 competitor-informed bug fixes:

- TOOLS.md: documented new typing.Literal support (BUG-02),
  Optional[T] without default handling (BUG-22), Union[str, int]
  fallback (BUG-11), and tool argument str->int/float/bool coercion
  (BUG-10)
- SESSIONS.md: documented namespace parameter for session isolation
  (BUG-14)
- VECTOR_STORES.md: documented dedup parameter on all 7 stores (BUG-21)
- QUICKSTART.md: documented Jupyter/FastAPI compatibility for sync
  wrappers (BUG-03)

Audit drift fixes:
- README.md: 5203 -> 5271 tests (1 instance), 4612 -> 5271 (1 instance)
- CONTRIBUTING.md + docs/CONTRIBUTING.md: 5203 -> 5271 tests (8 instances total)
- All hardcoded test counts now match live count from
  pytest --collect-only
Bumps version to 0.22.0 in __init__.py and pyproject.toml.
Promotes the [Unreleased] CHANGELOG section to [0.22.0] - 2026-04-10.
Adds v0.22.0 What's New section to README.md with code examples
showing the new user-facing API surface from BUG-02, BUG-03, BUG-14,
and BUG-21.

Release contents:
- 6 HIGH severity shipping blockers (streaming, types, async,
  HITL, threading)
- 9 MEDIUM severity correctness fixes
- 7 LOW-MED thread safety + edge case cleanups
- 57 new regression tests in tests/agent/test_regression.py
- 4 new documentation sections + 12 stale count references synced

Tests: 5,271 collected (up from 5,015 baseline)
Lint clean: black, isort, flake8, mypy, bandit
mkdocs builds clean

Tag NOT pushed yet — awaiting explicit user approval per release skill.
…, gate qdrant test

Three regression tests were failing in CI:

- test_bug19_clone_isolates_observer_list / test_bug19_clone_without_observers_does_not_crash:
  Both passed `LocalProvider(responses=["hello"])`, but `LocalProvider.__init__`
  takes no arguments — the `responses=` mock pattern documented in tests/CLAUDE.md
  was never implemented on this stub. `_clone_for_isolation()` doesn't invoke the
  provider, so calling `LocalProvider()` with no args is sufficient.

- test_bug08_qdrant_batches_large_upsert: `store.add_documents()` triggers a lazy
  import of `qdrant_client`, which isn't in the `[dev]` extras CI installs. Added
  `pytest.importorskip("qdrant_client", ...)` matching the guard already used in
  tests/rag/test_e2e_qdrant_store.py and tests/test_e2e_v0_21_0_*.py.
Source: LlamaIndex #20880 (same class: `alpha = query.alpha or 0.5`
swallowed `alpha=0.0`). CohereReranker.rerank used `top_n=top_k or len(results)`
which silently promoted `top_k=0` (user asking for no results) to the full
list. Round-1 pitfall #22 class, new instance in the rag/ module.

Fix: `top_n=top_k if top_k is not None else len(results)`.

Also adds docs/superpowers/plans/2026-04-11-round2-quickwins.md — the
4-bug round-2 plan derived from the LangChain/LlamaIndex competitive-mining
research report.
Source: LlamaIndex #21033. Sync recursive retrieval dedup keyed on node.hash
while async used (hash, ref_doc_id); legitimately-distinct nodes were dropped.

Selectools' _dedup_search_results keyed only on r.document.text. Two
documents with identical text but different sources (same snippet ingested
from two files — common in legal, academic, and regulatory corpora) would
collapse into one result and the citation for the second source was lost.

Fix: key on (text, document.metadata.get("source")). Falls back to
text-only dedup when no "source" key is present. Backward compat:
same text + same source still collapses to one (highest-scoring first).
…ount fields

Source: LangChain #36500. `token_usage.get("total_tokens") or fallback`
silently replaces provider-reported 0 for cached completions. Round-1
pitfall #22 instance not yet swept in providers/.

gemini_provider.py used `(usage.prompt_token_count or 0) if usage else 0`
in both sync complete() (lines 158-159) and stream path (lines 505-506).
If the Gemini API ever returns prompt_token_count=None alongside a real
candidates_token_count, the `or 0` conflates "unknown" with "zero" and
under-reports total_tokens.

Fix: use `x if x is not None else 0` guard pattern on both paths.
Grep confirmed the `or 0` pattern only appears on gemini_provider.py
token fields — no other provider affected.
… values

Source: LlamaIndex #20246/#20237. Qdrant silently returned an empty filter
for unsupported operators (CONTAINS, ANY, ALL), matching ALL documents
(security-adjacent: permission-filter bypass).

Selectools' in-memory `_matches_filter` had the mirror-image bug: when a user
passed `{"user_id": {"$in": [1, 2]}}`, the equality check `metadata.get(key)
!= value` failed for every document → zero results returned with no
indication of user error. Either direction is wrong.

Fix: added `_validate_filter` helper in rag/vector_store.py that detects
operator-dict values (dict values with $-prefixed keys) and raises
`NotImplementedError` with a clear message pointing users to backend-
specific vector stores (Chroma, Pinecone, Qdrant, pgvector). Literal dict
metadata values without $-prefixed keys still pass through the equality
check for backward compatibility.

Wired into both call sites:
- InMemoryVectorStore.search() — before the scoring loop
- BM25.search() — after top_k validation, before the snapshot loop
…G-26)

Documents the 4 round-2 fixes shipped on this branch (LangChain + LlamaIndex
cross-references), updates test count to 5031, and records the bug sources
by upstream project.

Round 2 mined LangChain (~92k stars), LangGraph (~10k), CrewAI (~25k),
n8n (~70k), LlamaIndex (~37k), and AutoGen (~35k) — ~270k combined stars.
Top-15 unverified candidates from the round are parked for v0.23.0.
…lete

Source: LiteLLM #25530. FallbackProvider._is_retriable used
`r"\b(429|500|502|503)\b"` for HTTP status codes and `("timeout", "rate limit",
"connection")` for substrings. Multiple production-relevant errors were treated
as non-retriable and raised to users instead of falling over:

- 529 Anthropic Overloaded (extremely common on US-West traffic)
- 504 Gateway Timeout (passed via `"timeout"` substring but now explicit)
- 408 Request Timeout (same)
- 522/524 Cloudflare origin timeouts
- `rate_limit_exceeded` — OpenAI/Mistral underscore form (substring was
  `"rate limit"` with a space, which doesn't match the underscore variant)
- `overloaded_error` / `Overloaded` — Anthropic body text accompanying 529
- `service_unavailable` — underscore form

Fix: extended regex to `(408|429|500|502|503|504|522|524|529)` and added
underscore variants + `"overloaded"` + `"service unavailable"`/`"service_unavailable"`
to the substring list.

6 regression tests in tests/agent/test_regression.py.
…etection

Source: LiteLLM #13515. Azure OpenAI deployments use user-chosen names
(e.g., "prod-chat", "my-reasoning"), NOT the underlying model's family
prefix. AzureOpenAIProvider inherited _get_token_key from OpenAIProvider,
which calls `model.startswith("gpt-5")` etc. with the deployment name. An
Azure deployment of gpt-5-mini under name "prod-chat" therefore received
`max_tokens` instead of `max_completion_tokens` and hit
`BadRequestError: Unsupported parameter: 'max_tokens'`. Azure variant of
round-1 pitfall #3 — the direct OpenAI path was fixed but the Azure
subclass bypassed family detection entirely.

Fix: added `model_family: str | None = None` kwarg to
AzureOpenAIProvider.__init__. When set, overrides the deployment-name-based
family detection so users can explicitly tell selectools what family their
deployment is. Backward compatible: model_family=None falls back to the
original deployment-name prefix matching.

Usage:

    AzureOpenAIProvider(
        azure_deployment="prod-chat",  # user-chosen deployment name
        model_family="gpt-5",          # underlying family
    )

3 regression tests in tests/agent/test_regression.py.
…items/properties

Source: Pydantic AI PRs #4544, #4474, #4479, #4461, #3712. OpenAI strict
mode REJECTS `{"type": "array"}` with no `items`. Non-strict mode, the LLM
has no way to know what the array should contain and will guess or refuse.
Same for `dict[str, str]` → `{"type": "object"}` with no `properties` or
`additionalProperties`.

Selectools' `_unwrap_type(list[str]) → list` stripped generic args entirely
before `ToolParameter.to_schema()` could emit the element type. The schema
for `def f(items: list[str])` emitted only `{"type": "array"}`.

Fix: added `ToolParameter.element_type: Optional[type] = None`; new
`_collection_element_type()` helper in decorators.py walks Optional/Union
wraps (parallel to `_unwrap_type`) and extracts the element type from
parametrized `list[T]` or the value type from `dict[K, V]`. When populated,
`to_schema()` emits `items: {type: ...}` for arrays or
`additionalProperties: {type: ...}` for dicts.

Backward compatible:
- Bare `list` / `dict` without generic args leave element_type=None and
  emit the plain type-only schema as before.
- Only supports primitive element types (str/int/float/bool) for now;
  unsupported element types fall back to bare schema.

5 regression tests: list[str]/list[int]/dict[str,str]/bare list/Optional[list[str]].
500 tools+regression tests pass; no collateral damage.
Source: Haystack PR #10549. Haystack's Pipeline.run() needed
`_deepcopy_with_exceptions(component_inputs)` because branches that
mutated their input polluted sibling branches.

Selectools' `_parallel_sync` and `_parallel_async` passed the SAME
`input` object to every branch. If any branch mutated its input
(list append, dict key set, dataclass attribute), the next branch
(sync) or interleaved sibling (async under asyncio.gather) saw the
mutation. Async is worst: branches interleave at await points, so a
shared reference produces non-deterministic state corruption.

Fix: `copy.deepcopy(input)` per branch in both sync and async paths
before invoking the branch function. Added `import copy` to pipeline.py.

3 regression tests exercise sync mutation, async interleaved mutation,
and backward-compat (branches still read the same initial values).
191 pipeline + regression tests pass.
… at 5 sites

Source: Haystack PR #9717, cross-round confirmation of CrewAI #4824/#4826.
`loop.run_in_executor(None, fn, *args)` does NOT inherit the caller's
`contextvars.Context`, so OTel active spans, Langfuse parent span, any
`ContextVar` set by _wire_fallback_observer, and cancellation tokens all
drop inside the executor-scheduled callable. Users see orphaned spans on
every sync-fallback provider call and every sync graph node.

Five grep-verified sites in selectools were affected:
- agent/_provider_caller.py:386 (sync-fallback provider call)
- agent/core.py:1286 (alternate sync-fallback path)
- orchestration/graph.py:1237 (sync generator node)
- orchestration/graph.py:1251 (plain sync callable node)
- agent/_tool_executor.py:321 (sync confirm_action)

Fix: added `run_in_executor_copyctx(loop, executor, fn)` helper in
_async_utils.py that captures `contextvars.copy_context()` before
dispatch and runs the callable via `Context.run()`. All 5 sites now use
the helper with a zero-arg closure; positional args are bound by the
closure to avoid *args double-wrapping.

First cross-round compound validation: this pattern was first surfaced by
CrewAI round-2 research and parked as "needs verification"; Haystack
round-3 research then grep-confirmed 5 live sites in selectools.

Also includes BUG-28 fixup: added class-level `_model_family: str | None =
None` default on AzureOpenAIProvider so tests that bypass __init__ via
__new__ still find the attribute.

3 regression tests: contextvar propagation, executor forwarding, and a
grep assertion that all 5 sites import the helper and have no raw
`loop.run_in_executor(` calls remaining. Full non-E2E suite: 5051 passed,
0 failed.
…pped

Source: Pydantic AI PRs #4609, #4588, #4459, #4656, #4480, #4484.
Providers caught `json.JSONDecodeError` and returned `{}` → the tool then
failed with "Missing required parameter", so the LLM learned only that
it forgot a parameter — NOT that its JSON was malformed. The same LLM
would reproduce the same malformed JSON next iteration, wasting retries.

Seven grep-verified sites: 5 in _openai_compat.py + 2 in anthropic_provider.py
+ the Ollama hook override.

Fix:

1. New shared helper `_parse_tool_args(raw: Optional[str]) -> Tuple[Dict,
   Optional[str]]` in providers/_openai_compat.py. Returns `(params, None)`
   on success, `({}, error_preview)` on JSONDecodeError or non-object JSON.
   Error preview is truncated to 200 chars and includes the parser error.

2. `_parse_tool_call_arguments()` template-method contract changed from
   `-> dict` to `-> Tuple[Dict[str, Any], Optional[str]]`. Ollama override
   updated in parallel; when arguments are already a dict it returns
   `(dict, None)`, otherwise delegates to the shared helper.

3. New field on ToolCall dataclass: `parse_error: Optional[str] = None`.
   All 7 sites construct ToolCall with this populated on malformed input.

4. Both `_execute_single_tool` (sync) and `_aexecute_single_tool` (async)
   in agent/_tool_executor.py check `tool_call.parse_error` BEFORE the
   tool lookup. When set, emit a clear retry message:
   "Tool call for 'X' had malformed arguments: <preview>. Retry with
   properly escaped JSON." — then return False without executing.

8 regression tests cover: valid JSON, empty string, None input, malformed
JSON error preview, non-object JSON rejection, ToolCall.parse_error field,
tool executor source-grep checks for parse_error handling, and a grep
assertion that the silent-drop pattern is gone from _openai_compat.py and
anthropic_provider.py.

Also updated 4 provider-coverage tests that asserted the old
`result == {}` contract to unpack the new (params, parse_error) tuple.

Full non-E2E suite: 5059 passed, 0 failed.
…ption

Source: Pydantic AI PRs #4476, #4205. `async for item in gen:` without
wrapping in a context manager leaks the provider's async generator when
the loop body raises. `gen.__aexit__` runs under GC instead of
deterministically, producing `RuntimeError: async generator raised
StopAsyncIteration` on client disconnect and orphaned HTTP connections.

Zero uses of `contextlib.aclosing` existed in selectools before this fix.

Two sites:
- `agent/core.py:1316` — arun streaming path (user-facing astream)
- `agent/_provider_caller.py:505` — _astreaming_call helper

`contextlib.aclosing` was added in Python 3.10 and selectools supports
3.9+, so we ship a drop-in `aclosing` class in `_async_utils.py` that
works on any Python version. Both sites now wrap the
`provider.astream(...)` call in `async with aclosing(...) as gen:` so a
guardrail failure, structured-validation ValueError, or caller
disconnect deterministically runs the provider generator's finally block.

2 regression tests: a source-grep assertion that both sites use aclosing,
and a functional test that verifies aclosing closes an async generator
when the consumer raises mid-iteration.
Source: Pydantic AI PRs #4956, #4940, #4692. Selectools shared ONE global
max_iterations counter between tool-execution iterations AND structured-
validation retries. An agent with max_iterations=3 and an LLM that failed
structured validation 3 times in a row would terminate before reaching
the RetryConfig.max_retries=5 ceiling. Retry config was effectively
unused for structured retries.

Fix: decouple the two budgets via a new `_RunContext.structured_retries`
counter (default 0).

1. Structured-retry branches (3 sites: run/arun/astream) now check
   `ctx.structured_retries < self.config.retry.max_retries` instead of
   `ctx.iteration < self.config.max_iterations`. Each retry increments
   `ctx.structured_retries`.

2. Outer loops in all 3 paths now use
   `while ctx.iteration < self.config.max_iterations + ctx.structured_retries`
   so structured retries extend the tool-iteration budget by 1 each,
   preserving the "max_iterations caps tool iterations" semantic without
   letting structured retries eat into it.

Effect: `max_iterations=3, retry.max_retries=5` now allows up to 3 tool
iterations plus up to 5 structured-validation retries, as users expect.

3 regression tests: dataclass field check, source-grep check that all 3
branches use the new counter, and an integration test with a stub
provider that returns invalid JSON 4 times and valid JSON on attempt 5,
asserting provider.call_count == 5 under max_iterations=3 / max_retries=5.

Full non-E2E suite: 5064 passed, 0 failed.
…G-34)

Documents the 8 round-3 fixes shipped on this branch (LiteLLM + Pydantic
AI + Haystack cross-references), updates test count to 5064 (+104 new
regression tests), and records the cross-round compound validation
(CrewAI round-2 contextvars candidate → Haystack round-3 grep-confirmed
5 live sites — first time two independent competitor sources have
converged on a single selectools bug class).

Round 3 methodology highlight: baking the "grep selectools source to
confirm live" directive into every research prompt was the single
highest-leverage improvement across three rounds. Pydantic AI yielded
4 of 5 confirmed-live — highest ratio of any research agent — because
ethos match beats star count.
…amples

Cookbook (docs/COOKBOOK.md) — 23 new recipes covering:

Round-2/3 features:
- Typed Tool Parameters (list[str], dict[str,str]) — BUG-29
- Azure OpenAI with model_family — BUG-28
- FallbackProvider with Extended Retries — BUG-27
- Structured Output with Separate Retry Budget — BUG-34
- Safe Parallel Fan-Out — BUG-30
- Multi-Tenant RAG with Permission Filters — BUG-25
- Citation-Preserving Search Dedup — BUG-24
- Reranking with Top-K Control — BUG-23
- Streaming with Safe Cleanup — BUG-33
- OTel-Correct Async (ContextVars) — BUG-32
- Malformed JSON Recovery — BUG-31
- Running Agents in Jupyter/FastAPI — BUG-03
- Session Namespace Isolation — BUG-14

General patterns:
- Hybrid Search (BM25 + Vector)
- Knowledge Graph Agent
- Conversation Branching for A/B Testing
- Cost-Optimized Provider Routing
- Supervisor with Model Split
- MCP Tool Server
- Agent Evaluation in CI
- Error Recovery with Circuit Breaker
- Guardrails Pipeline
- Entity Memory Agent
- Batch Processing with Progress
- Dynamic Tool Registration
- Multi-Hop RAG with Query Expansion
- Prompt Compression for Long Conversations
- Reasoning Strategies

6 new examples (89–94):
- 89_typed_tool_parameters.py — list[str]/dict[str,str]/list[int] schemas
- 90_fallback_extended_retries.py — demonstrates all retriable error codes
- 91_structured_retry_budget.py — max_iterations vs RetryConfig.max_retries
- 92_safe_parallel_pipeline.py — deepcopy-isolated parallel branches
- 93_multi_tenant_rag.py — filter validation + citation-preserving dedup
- 94_azure_model_family.py — Azure deployment name + model_family hint

All 6 new examples verified running with PYTHONPATH=src.
Full test suite: 5064 passed, 0 failed.
Stale counts fixed:
- docs/ARCHITECTURE.md: version 0.20.1 → 0.22.0, tools 24 → 33
- docs/llms.txt: v0.21.0/5203 → v0.22.0/5064, examples 88 → 94
- docs/llms-full.txt: examples 61 → 94
- docs/QUICKSTART.md: examples 61 → 94
- docs/CONTRIBUTING.md: examples 88 → 94
- README.md: examples badge 88 → 94
- landing/index.html: version 0.21.0 → 0.22.0, tests 5203 → 5064

Module documentation for v0.22.0 features:
- docs/modules/TOOLS.md: added "Typed Collection Parameters" section (BUG-29)
  covering list[str], dict[str,str], element_type field on ToolParameter
- docs/modules/VECTOR_STORES.md: added "Citation-Preserving Dedup" (BUG-24)
  and "Metadata Filter Validation" (BUG-25) sections
- docs/modules/PROVIDERS.md: added "Model Family Override" for Azure (BUG-28)
  and updated "Failure Conditions" with extended retry codes (BUG-27)
- docs/modules/AGENT.md: added "Structured Retry Budget" section (BUG-34)
- docs/modules/PIPELINE.md: added "Branch Isolation" note for parallel (BUG-30)

CLAUDE.md: added pitfalls 27-30 covering aclosing, contextvars
propagation, malformed JSON recovery, and structured retry budget.

13 files touched. Audit-driven update, no code changes.
@johnnichev johnnichev merged commit 2238a2f into main Apr 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant