-
Notifications
You must be signed in to change notification settings - Fork 76
DuckDuckGo fact-checking + response caching layer #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DuckDuckGo fact-checking + response caching layer #135
Conversation
- Replace Google Custom Search with DuckDuckGo for fact-checking - Add in-memory response caching with TTL (24h default) - Implement parallel sentiment/fact-check analysis - Convert LangGraph pipeline to async (ainvoke) - Add cache hit auto-save to local JSON files - Fix compatibility for new fact format in generate_perspective and chunk_rag_data - Fix scraper headers bug in extractor.py
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughIntroduces an in-memory TTL LRU cache and route cache endpoints, adds a DuckDuckGo-backed async fact-checking pipeline, refactors LangGraph to run sentiment and fact-check analysis in parallel, updates multiple LLM model identifiers, and makes the LangGraph invocation asynchronous. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Routes as routes.py
participant Cache as URLCache
participant Pipeline as pipeline.py
participant LangGraph as LangGraph
participant Sentiment as sentiment.py
participant FactCheck as fact_check_tool.py
participant LLM as Groq LLM
Client->>Routes: POST /bias or /process (state)
Routes->>Cache: get(endpoint, url)
alt Cache Hit
Cache-->>Routes: cached response + _cache metadata
Routes-->>Client: 200 (cached)
else Cache Miss
Routes->>Pipeline: await run_langgraph_workflow(state)
Pipeline->>LangGraph: ainvoke(state)
par Parallel analysis
LangGraph->>Sentiment: run_parallel_analysis(state)
rect rgba(100,150,255,0.5)
Sentiment->>LLM: sentiment analysis request
LLM-->>Sentiment: sentiment result
end
rect rgba(100,200,100,0.5)
LangGraph->>FactCheck: run fact-check pipeline
FactCheck->>FactCheck: extract_claims -> plan_searches
FactCheck->>FactCheck: execute DuckDuckGo searches (parallel)
FactCheck->>LLM: verify facts
LLM-->>FactCheck: verification results
FactCheck-->>LangGraph: facts + search_results
end
end
LangGraph-->>Pipeline: merged results
Pipeline-->>Routes: workflow result
Routes->>Cache: set(endpoint, url, response)
Routes-->>Client: 200 (fresh)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/modules/facts_check/llm_processing.py (2)
143-148:⚠️ Potential issue | 🔴 CriticalBug:
parsedis undefined or stale when JSON parsing fails.If
json.loadsfails, the except block only logs and doesn't assign a fallback toparsed. On the first iteration this causes anUnboundLocalError; on subsequent iterations it silently appends the previous iteration's result again.🐛 Proposed fix
try: parsed = json.loads(content) except Exception as parse_err: logger.error(f"LLM JSON parse error: {parse_err}") + parsed = { + "verdict": "Unknown", + "explanation": f"Failed to parse LLM response: {parse_err}", + "original_claim": claim, + "source_link": source, + } results_list.append(parsed)
150-154:⚠️ Potential issue | 🟠 MajorBug:
claimon Line 151 only reflects the last iteration's value.After the
forloop,claimholds the value from the final iteration. Ifsearch_resultscontains multiple claims, the returned dictionary misrepresents which claim was processed. Additionally, ifsearch_resultsis empty,claimis undefined and this will raise aNameError.🐛 Proposed fix — return all claims or remove the misleading key
return { - "claim": claim, "verifications": results_list, "status": "success", }
🤖 Fix all issues with AI agents
In `@backend/app/cache/cache.py`:
- Around line 139-160: The set() method in cache.py is storing the caller's
value by reference which allows external mutation to corrupt cached entries;
update set(endpoint, url, value) to store a deep copy of value (e.g., via
copy.deepcopy) when creating the CacheEntry so the cache is isolated from caller
mutations; locate the set method and change the line that assigns
self._cache[key] = CacheEntry(value, self._ttl_seconds) to use a deep-copied
value, keeping the rest of the eviction and key-generation logic (see
_generate_key, CacheEntry, and get for context).
In `@backend/app/modules/fact_check_tool.py`:
- Around line 140-147: The loop that processes LLM results in fact_check_tool.py
assumes item["claim_id"] is an int and uses c_id < len(claims), which can raise
TypeError if the LLM returns a string; update the code around where c_id is set
(the loop building context) to coerce c_id = int(item.get("claim_id")) inside a
small try/except (ValueError/TypeError), skip the item if conversion fails, and
only perform the bounds check against len(claims) after successful conversion;
also preserve the existing evidence handling (item.get('result')[:300] or 'No
evidence') so invalid claim_id values do not cause the entire verification to
silently fail.
In `@backend/app/routes/routes.py`:
- Around line 114-121: The cached dict is mutated after storing
(bias_score["_cache"] = ...) which corrupts the cached entry; fix by making
cache.set in cache.py store a deep copy of the value (and likewise return a deep
copy from cache.get if it returns references) so callers like the routes
functions (where bias_score is passed to cache.set) cannot mutate the stored
object; update cache.set to import copy and call copy.deepcopy(value) before
persisting and ensure cache.get returns a deepcopy as well.
- Around line 71-99: Add bounded retention to cache files by introducing a
MAX_CACHE_FILES constant and pruning oldest files before writing in
_save_cache_response; compute url_hash/filename as now, then list
CACHE_RESPONSES_DIR/*.json (use file mtime) and if count >= MAX_CACHE_FILES
delete the oldest files until under the limit (perform file operations in
asyncio.to_thread to avoid blocking), then write the new JSON as before; keep
json_serializer and logging unchanged and ensure deletions are logged and
exceptions handled.
- Around line 159-190: Protect the destructive cache endpoints by adding an auth
check: modify cache_clear and cache_delete to require an admin/API key or
internal-only access before performing operations; e.g., add a FastAPI
dependency or explicit header check (validate an "X-API-Key" or bearer token)
and return 401/403 if invalid, or alternatively verify request.client.host is
localhost if you opt for internal-only access. Update router signatures for
cache_clear and cache_delete to accept the auth dependency (or request for host
check), perform the credential validation early, and only call cache.clear() or
cache.delete() when the check passes; log unauthorized attempts in the same
functions (cache_clear, cache_delete) and keep cache_stats read-only if you want
it public.
🧹 Nitpick comments (9)
backend/app/modules/pipeline.py (1)
41-41: Unusedasyncioimport.
asynciois imported but not used anywhere in this file —ainvokeis a method on the LangGraph workflow object.Proposed fix
-import asynciobackend/app/modules/langgraph_nodes/generate_perspective.py (1)
59-71: Flexible fact parsing looks good, but note inconsistency withchunk_rag_data.py.This file uses
f.get('claim', f.get('original_claim', 'Unknown'))(key-existence fallback), whilechunk_rag_data.pyusesfact.get("claim") or fact.get("original_claim") or "Unknown claim"(truthiness-based fallback). The two approaches behave differently when a key exists but has an empty/falsy value —getwith a default returns"", while theorchain skips to the next fallback.Consider aligning both files on the same strategy for consistency.
backend/app/modules/langgraph_builder.py (1)
47-57: Minor type annotation inconsistency.
factsuses the built-inlist[dict](Line 49), while the new fields usetyping.List[str]/List[Any](Lines 55–57). Consider using one style consistently — preferably the built-in generics (list[str],list[Any]) since this project targets Python 3.9+.♻️ Suggested fix
-from typing import List, Any from langgraph.graph import StateGraph from typing_extensions import TypedDict ... class MyState(TypedDict): cleaned_text: str facts: list[dict] sentiment: str perspective: str score: int retries: int status: str - claims: List[str] - search_queries: List[Any] - search_results: List[Any] + claims: list[str] + search_queries: list + search_results: listbackend/app/cache/cache.py (3)
153-157: Unnecessary eviction when updating an existing key.If the key already exists in the cache (e.g., re-caching the same URL), the
whileloop will still evict the oldest entry even though the total count won't increase. Check for existing key first.♻️ Suggested fix
with self._lock: - # Evict oldest entries if at max size - while len(self._cache) >= self._max_size: - evicted_key, _ = self._cache.popitem(last=False) - logger.debug(f"Evicted cache entry: {evicted_key}") + # Evict oldest entries if at max size (only if this is a new key) + if key not in self._cache: + while len(self._cache) >= self._max_size: + evicted_key, _ = self._cache.popitem(last=False) + logger.debug(f"Evicted cache entry: {evicted_key}") self._cache[key] = CacheEntry(value, self._ttl_seconds)
216-225: Singletonget_cache()is not thread-safe.Two threads calling
get_cache()concurrently before any import of__init__.pycould each create a separate instance. In practice this is likely fine since the module-levelcache = get_cache()in__init__.pyruns at import time (single-threaded), but adding a lock would make this robust.♻️ Suggested fix
_cache_instance: Optional[URLCache] = None +_cache_lock = threading.Lock() def get_cache() -> URLCache: """Get or create the singleton cache instance.""" global _cache_instance - if _cache_instance is None: - _cache_instance = URLCache() + if _cache_instance is None: + with _cache_lock: + if _cache_instance is None: + _cache_instance = URLCache() return _cache_instance
46-56: Replace deprecateddatetime.utcnow()withdatetime.now(timezone.utc)to support Python 3.12+.
datetime.utcnow()is deprecated since Python 3.12 because it returns a naive datetime object, which can lead to timezone-related bugs. Usedatetime.now(timezone.utc)instead to get a timezone-aware UTC datetime.♻️ Suggested fix
-from datetime import datetime, timedelta +from datetime import datetime, timedelta, timezone ... def __init__(self, value: Any, ttl_seconds: int): self.value = value - self.cached_at = datetime.utcnow() + self.cached_at = datetime.now(timezone.utc) self.expires_at = self.cached_at + timedelta(seconds=ttl_seconds) def is_expired(self) -> bool: """Check if this cache entry has expired.""" - return datetime.utcnow() > self.expires_at + return datetime.now(timezone.utc) > self.expires_atbackend/app/modules/langgraph_nodes/sentiment.py (2)
77-89: If both branches fail, only the sentiment error is surfaced.The sequential
ifchecks mean that if bothsentiment_resultandfact_check_resultreport errors, only the sentiment error is returned. Consider logging the other failure or combining them.♻️ Possible improvement
+ errors = [] + if sentiment_result.get("status") == "error": + errors.append(("sentiment_analysis", sentiment_result.get("message", "Unknown error"))) + if fact_check_result.get("status") == "error": + errors.append(("fact_checking", fact_check_result.get("message", "Unknown error"))) + + if errors: + return { + "status": "error", + "error_from": errors[0][0], + "message": "; ".join(f"[{src}] {msg}" for src, msg in errors) + }
62-68: Redundant exception object inlogging.exceptioncall.
logger.exception()already includes the traceback and exception info. Passingein the f-string is redundant.♻️ Suggested fix
- logger.exception(f"Error in fact_check_pipeline: {e}") + logger.exception("Error in fact_check_pipeline")backend/app/modules/fact_check_tool.py (1)
66-68: Uselogger.exceptioninstead oflogger.errorin exception handlers for traceback visibility.All four nodes use
logger.error(f"...")in theirexceptblocks.logger.exception()automatically includes the traceback, which is much more useful for debugging production issues.♻️ Example fix (apply to all four handlers)
except Exception as e: - logger.error(f"Error extracting claims: {e}") + logger.exception("Error extracting claims") return {"claims": []}Also applies to: 101-103, 122-124, 184-186
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@backend/app/modules/fact_check_tool.py`:
- Around line 106-129: In execute_searches_node (and its nested run_one_search),
guard against q.get("query") being None before calling search_tool.invoke: check
query_str = q.get("query") and if it's falsy/None, log a warning including the
claim id (c_id from q.get("claim_id")), return a predictable result object
(e.g., {"claim_id": c_id, "result": "No query provided"}) and skip calling
asyncio.to_thread(search_tool.invoke, ...); otherwise proceed with the existing
try/except flow so you never pass None into search_tool.invoke.
- Around line 26-27: The DuckDuckGo search calls lack asyncio-level timeouts and
can hang; wrap each invocation of run_one_search (the function used with
asyncio.to_thread and asyncio.gather) with asyncio.wait_for or pass a timeout to
asyncio.gather so each search has an explicit timeout (e.g.,
asyncio.wait_for(run_one_search(q), timeout=10)). Update the calls around
asyncio.to_thread(...) that invoke run_one_search and the asyncio.gather(...)
call that aggregates multiple run_one_search tasks to enforce a per-call timeout
(or a gather-level timeout) and handle asyncio.TimeoutError where appropriate to
avoid blocking the entire fact-check flow when DuckDuckGoSearchRun hangs.
In `@backend/app/routes/routes.py`:
- Around line 198-222: In cache_delete, when the endpoint is not one of
["process","bias"] (see function cache_delete and parameter endpoint), replace
the current plain JSON return with raising an HTTPException(status_code=400,
detail="Invalid endpoint. Use 'process' or 'bias'") so the handler returns a
proper 400 Bad Request instead of an implicit 200; keep the error text
consistent with other HTTPException usage in the function.
🧹 Nitpick comments (2)
backend/app/routes/routes.py (2)
80-93: Blocking file I/O in async function.
_cleanup_old_cache_filescalls.glob(),.stat(), and.unlink()synchronously. Since it'sasyncand called from the request path (via_save_cache_response), this blocks the event loop. Wrap the heavy work inasyncio.to_threadlike you already do for the file write in_save_cache_response.♻️ Proposed fix
async def _cleanup_old_cache_files() -> None: """Remove oldest cache files if exceeding max limit.""" try: - files = list(CACHE_RESPONSES_DIR.glob("*.json")) - if len(files) > CACHE_RESPONSES_MAX_FILES: - # Sort by modification time (oldest first) - files.sort(key=lambda f: f.stat().st_mtime) - # Remove oldest files to stay under limit - files_to_remove = files[:len(files) - CACHE_RESPONSES_MAX_FILES] - for f in files_to_remove: - f.unlink() - logger.debug(f"Evicted old cache file: {f.name}") + def _do_cleanup(): + files = list(CACHE_RESPONSES_DIR.glob("*.json")) + if len(files) > CACHE_RESPONSES_MAX_FILES: + files.sort(key=lambda f: f.stat().st_mtime) + for f in files[:len(files) - CACHE_RESPONSES_MAX_FILES]: + f.unlink() + logger.debug(f"Evicted old cache file: {f.name}") + + await asyncio.to_thread(_do_cleanup) except Exception as e: - logger.error(f"Failed to cleanup cache files: {e}") + logger.exception(f"Failed to cleanup cache files: {e}")
188-195: Admin key protection looks correct, but emptyADMIN_API_KEYpermanently locks these endpoints.When
ADMIN_API_KEYenv var is unset, it defaults to""(Line 58), andnot ADMIN_API_KEYisTrue, so Line 191 always raises 403. This is secure-by-default, which is good, but it means the cache management endpoints are unusable unless the env var is explicitly configured. Consider logging a warning at startup ifADMIN_API_KEYis not set, so operators know these endpoints are disabled.
Summary
This PR improves the Perspective API performance and reliability by:
Key Changes
🦆 DuckDuckGo Fact-Checking (
app/modules/fact_check_tool.py)extract_claims→plan_searches→execute_searches→verify_factslangchain_community.tools.DuckDuckGoSearchRunfor free web searchesasyncio.gather🚀 Response Caching (
app/cache/cache.py)OrderedDictcache with TTL (24 hours default)endpoint + urlCACHE_ENABLED,CACHE_TTL_SECONDS,CACHE_MAX_SIZE⚡ Async Parallel Pipeline
run_parallel_analysis()runs sentiment and fact-checking concurrentlyainvoke()for non-blocking execution📁 Cache Hit Auto-Export
backend/cache_responses/🐛 Bug Fixes
headersargument bug inextractor.py(line 44)generate_perspective.pyandchunk_rag_data.pyto handle new fact formatPerformance Results
Testing
Related PRs
Summary by CodeRabbit
New Features
Improvements
Bug Fixes