Improvising Chatbot's Semantic Accuracy#239
Improvising Chatbot's Semantic Accuracy#2396vam4arya wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughThe chatbot's input processing pipeline was upgraded from a basic bag-of-words approach to semantic embeddings using SentenceTransformer. Lemmatization via NLTK's WordNetLemmatizer was introduced for text preprocessing. The chat route now encodes user messages into fixed 384-dimensional vectors, replacing tokenization and bag-of-words logic. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
chatbot/app.py (1)
10-19: Remove legacy commented-out BoW code.Keeping the old pipeline as commented code increases noise and makes the runtime path harder to audit. Prefer deleting it and relying on VCS history.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 10 - 19, Remove the legacy commented-out bag-of-words helper functions to reduce noise: delete the commented definitions for tokenize, stem, and bag_of_words from chatbot/app.py so only active code remains (rely on VCS for history); ensure no other code references these function names so you don't break imports or runtime paths, and run tests to confirm nothing depends on tokenize, stem, or bag_of_words before committing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chatbot/app.py`:
- Around line 33-35: clean_and_lemmatize currently filters tokens with if
w.strip() before removing punctuation, so punctuation-only tokens become empty
after re.sub and still get lemmatized; fix by cleaning each token first then
filtering and lemmatizing: in clean_and_lemmatize (which uses nltk.word_tokenize
and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce cleaned, then if
cleaned (non-empty) pass cleaned to lemmatizer and return that; ensure
tokenization, cleaning, filtering, and lemmatization order is token -> cleaned =
re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).
- Around line 91-93: The code calls embedder.encode(user_message) with unbounded
user_message; add a request-size guard before encoding to prevent huge inputs
(e.g., check length/byte-size of user_message and reject or truncate beyond a
safe max, such as MAX_INPUT_CHARS) and return an appropriate error/400 response.
Implement this check in the same scope where embedder.encode and X are used
(referencing user_message, embedder.encode, and X) so you either truncate
user_message before calling embedder.encode or short-circuit with a clear error
when the input exceeds the configured limit.
- Around line 25-32: Module-level network calls (nltk.download and
SentenceTransformer model load) must be moved into an explicit startup
initializer: set lemmatizer and embedder to None at module scope, add an
init_nlp_resources() function that calls nltk.download('punkt'),
nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and
loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear
errors if any step fails; ensure init_nlp_resources() is invoked during
application startup (before handling requests) rather than at import time so
functions/classes that reference lemmatizer/embedder will see initialized
objects.
- Around line 91-94: Inference currently embeds raw user_message via
embedder.encode, bypassing the existing clean_and_lemmatize pipeline; call
clean_and_lemmatize(user_message) and use its return value when computing
embedding (i.e., replace embedder.encode(user_message) with
embedder.encode(cleaned_message)) before converting to tensor X and moving to
device, ensuring cleaned_message is a str and any None/empty result is handled
consistently with downstream logic.
- Around line 62-65: Replace the hardcoded input_size = 384 with reading the
saved value from the checkpoint (use data["input_size"] or
data.get("input_size", 384)) and ensure you validate that this loaded input_size
matches the embedder's output dimension before constructing the model; locate
the variables data and input_size in chatbot/app.py and compare the loaded
input_size against the embedder's output/embedding dimension (e.g.,
embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if
they differ.
---
Nitpick comments:
In `@chatbot/app.py`:
- Around line 10-19: Remove the legacy commented-out bag-of-words helper
functions to reduce noise: delete the commented definitions for tokenize, stem,
and bag_of_words from chatbot/app.py so only active code remains (rely on VCS
for history); ensure no other code references these function names so you don't
break imports or runtime paths, and run tests to confirm nothing depends on
tokenize, stem, or bag_of_words before committing.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
client/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (1)
chatbot/app.py
| # Downloading resources and init Lemmatizer | ||
| nltk.download('punkt') | ||
| nltk.download('wordnet') | ||
| lemmatizer = WordNetLemmatizer() | ||
|
|
||
| # Loading the Embedding Model | ||
| embedder = SentenceTransformer('all-MiniLM-L6-v2') | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files | grep -E "(chatbot|app\.py)" | head -20Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 237
🏁 Script executed:
find . -name "*.py" -type f 2>/dev/null | head -20Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 106
🏁 Script executed:
cat -n chatbot/app.py | sed -n '20,40p'Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 853
Move network-dependent initialization out of module-level code.
Lines 26–31 execute network operations (nltk.download(), SentenceTransformer model loading) at import time, causing app startup to hang, fail in offline/containerized environments, and block deployment pipelines. Move this to an explicit initialization function called during startup, and provide clear error messages if resources fail to load.
Suggested approach: Initialize lemmatizer and embedder as None at module level, create an init_nlp_resources() function to perform the downloads/loading, and ensure it's called during app startup (e.g., before handling requests).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chatbot/app.py` around lines 25 - 32, Module-level network calls
(nltk.download and SentenceTransformer model load) must be moved into an
explicit startup initializer: set lemmatizer and embedder to None at module
scope, add an init_nlp_resources() function that calls nltk.download('punkt'),
nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and
loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear
errors if any step fails; ensure init_nlp_resources() is invoked during
application startup (before handling requests) rather than at import time so
functions/classes that reference lemmatizer/embedder will see initialized
objects.
| def clean_and_lemmatize(sentence): | ||
| words = nltk.word_tokenize(sentence.lower()) | ||
| return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()] |
There was a problem hiding this comment.
Filter empty tokens after regex cleanup.
Line 35 checks emptiness before punctuation removal, so punctuation-only tokens can become empty strings and still be lemmatized/returned.
Proposed fix
def clean_and_lemmatize(sentence):
words = nltk.word_tokenize(sentence.lower())
- return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()]
+ cleaned = []
+ for w in words:
+ token = re.sub(r'[^\w\s]', '', w).strip()
+ if token:
+ cleaned.append(lemmatizer.lemmatize(token))
+ return cleaned🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chatbot/app.py` around lines 33 - 35, clean_and_lemmatize currently filters
tokens with if w.strip() before removing punctuation, so punctuation-only tokens
become empty after re.sub and still get lemmatized; fix by cleaning each token
first then filtering and lemmatizing: in clean_and_lemmatize (which uses
nltk.word_tokenize and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce
cleaned, then if cleaned (non-empty) pass cleaned to lemmatizer and return that;
ensure tokenization, cleaning, filtering, and lemmatization order is token ->
cleaned = re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).
| # size is fixed | ||
| input_size = 384 | ||
| tags = data['tags'] | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the chatbot/app.py file
find . -name "app.py" -path "*/chatbot/*" -type fRepository: AOSSIE-Org/Agora-Blockchain
Length of output: 87
🏁 Script executed:
# Check the repository structure
git ls-files | grep -E "chatbot|checkpoint" | head -20Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 237
🏁 Script executed:
cat -n ./chatbot/app.py | head -80Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 2857
🏁 Script executed:
cat -n ./chatbot/train.pyRepository: AOSSIE-Org/Agora-Blockchain
Length of output: 5367
Read input_size from the checkpoint instead of hardcoding.
Line 63 hardcodes input_size = 384, but the checkpoint at line 60 already contains this value (saved by train.py line 146). If the model is retrained with a different input dimension, this hardcoded value will cause the model to be initialized incorrectly and fail during inference.
Use input_size = data.get("input_size", 384) to read the saved value, or simply data["input_size"] since it's always present. Additionally, validate the loaded dimension matches the embedder's output:
Proposed guard
-# size is fixed
-input_size = 384
+input_size = data.get("input_size", 384)
+expected_dim = embedder.get_sentence_embedding_dimension()
+if input_size != expected_dim:
+ raise ValueError(f"Checkpoint input_size={input_size} does not match embedder dim={expected_dim}")
tags = data['tags']🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chatbot/app.py` around lines 62 - 65, Replace the hardcoded input_size = 384
with reading the saved value from the checkpoint (use data["input_size"] or
data.get("input_size", 384)) and ensure you validate that this loaded input_size
matches the embedder's output dimension before constructing the model; locate
the variables data and input_size in chatbot/app.py and compare the loaded
input_size against the embedder's output/embedding dimension (e.g.,
embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if
they differ.
| # Getting semantic embedding (384-dimensional vector) | ||
| embedding = embedder.encode(user_message) | ||
| X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device) |
There was a problem hiding this comment.
Add a request-size guard before transformer encoding.
User input is unbounded before Line 92. Very large payloads can create avoidable latency/memory pressure per request.
Proposed guardrail
+MAX_MESSAGE_CHARS = 2000
+
def chat():
try:
request_data = request.get_json()
user_message = request_data.get('message', '')
+ if len(user_message) > MAX_MESSAGE_CHARS:
+ return jsonify({"message": "Message too long"}), 413🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chatbot/app.py` around lines 91 - 93, The code calls
embedder.encode(user_message) with unbounded user_message; add a request-size
guard before encoding to prevent huge inputs (e.g., check length/byte-size of
user_message and reject or truncate beyond a safe max, such as MAX_INPUT_CHARS)
and return an appropriate error/400 response. Implement this check in the same
scope where embedder.encode and X are used (referencing user_message,
embedder.encode, and X) so you either truncate user_message before calling
embedder.encode or short-circuit with a clear error when the input exceeds the
configured limit.
| # Getting semantic embedding (384-dimensional vector) | ||
| embedding = embedder.encode(user_message) | ||
| X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the structure around the embedding code
find . -type f -name "*.py" | head -20Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 106
🏁 Script executed:
# Look for the clean_and_lemmatize function
rg "clean_and_lemmatize" -B 3 -A 10Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 834
🏁 Script executed:
# Check the specific file and lines mentioned
cd chatbot && wc -l app.pyRepository: AOSSIE-Org/Agora-Blockchain
Length of output: 81
🏁 Script executed:
# Read the relevant section of chatbot/app.py
cat -n chatbot/app.py | sed -n '80,110p'Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 1522
🏁 Script executed:
# Check train.py for preprocessing usage
cat -n chatbot/train.py | head -100Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 3447
🏁 Script executed:
# Find all uses of clean_and_lemmatize in the codebase
rg "clean_and_lemmatize" -B 2 -A 5Repository: AOSSIE-Org/Agora-Blockchain
Length of output: 449
🏁 Script executed:
# Check the entire app.py to see the full context
cat -n chatbot/app.pyRepository: AOSSIE-Org/Agora-Blockchain
Length of output: 4515
Apply text preprocessing before embedding to maintain consistency with the clean_and_lemmatize pipeline.
The clean_and_lemmatize function is defined at line 33-35 but is never called in the inference path. Line 92 embeds raw user_message directly, bypassing the tokenization, lowercasing, and lemmatization preprocessing. This creates an inconsistency since the function was added to the codebase but is unused during inference.
Proposed fix
-# Getting semantic embedding (384-dimensional vector)
-embedding = embedder.encode(user_message)
+# Normalize text before embedding
+normalized = " ".join(clean_and_lemmatize(user_message)).strip()
+text_for_embedding = normalized if normalized else user_message
+embedding = embedder.encode(text_for_embedding)
X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device) 📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Getting semantic embedding (384-dimensional vector) | |
| embedding = embedder.encode(user_message) | |
| X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device) | |
| # Normalize text before embedding | |
| normalized = " ".join(clean_and_lemmatize(user_message)).strip() | |
| text_for_embedding = normalized if normalized else user_message | |
| embedding = embedder.encode(text_for_embedding) | |
| X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chatbot/app.py` around lines 91 - 94, Inference currently embeds raw
user_message via embedder.encode, bypassing the existing clean_and_lemmatize
pipeline; call clean_and_lemmatize(user_message) and use its return value when
computing embedding (i.e., replace embedder.encode(user_message) with
embedder.encode(cleaned_message)) before converting to tensor X and moving to
device, ensuring cleaned_message is a str and any None/empty result is handled
consistently with downstream logic.
Description
Dependencies
To run the upgraded chatbot, run the following command in terminall:
pip install -U sentence-transformers nltk torch
Fixes #238
Type of change
Please mark the options that are relevant.
Checklist:
Summary by CodeRabbit