Skip to content

Improvising Chatbot's Semantic Accuracy#239

Open
6vam4arya wants to merge 1 commit intoAOSSIE-Org:mainfrom
6vam4arya:chatbot
Open

Improvising Chatbot's Semantic Accuracy#239
6vam4arya wants to merge 1 commit intoAOSSIE-Org:mainfrom
6vam4arya:chatbot

Conversation

@6vam4arya
Copy link

@6vam4arya 6vam4arya commented Mar 3, 2026

Description

  • Upgraded the chatbot from a keyword-based Bag of Words model to a semantic-aware system using Transformer-based embeddings.
  • Standardized input representation to a fixed 384-dimensional embedding, improving scalability and maintainability.
  • Implemented NLTK tokenization and WordNet lemmatization to improve text normalization and preserve contextual meaning.
  • Enhanced overall intent classification accuracy and contextual understanding.

Dependencies

To run the upgraded chatbot, run the following command in terminall:
pip install -U sentence-transformers nltk torch

Fixes #238

Type of change

Please mark the options that are relevant.

  • Updated UI/UX
  • Improved the business logic of code
  • Added new feature
  • Other

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

Summary by CodeRabbit

  • New Features
    • Improved semantic understanding of user messages through advanced text processing
    • Enhanced natural language processing capabilities for more contextual chatbot interactions
    • Updated input handling methodology for refined message interpretation

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

The chatbot's input processing pipeline was upgraded from a basic bag-of-words approach to semantic embeddings using SentenceTransformer. Lemmatization via NLTK's WordNetLemmatizer was introduced for text preprocessing. The chat route now encodes user messages into fixed 384-dimensional vectors, replacing tokenization and bag-of-words logic.

Changes

Cohort / File(s) Summary
Semantic Embedding & Lemmatization Integration
chatbot/app.py
Added clean_and_lemmatize() function for text preprocessing using NLTK WordNetLemmatizer. Introduced global embedder (SentenceTransformer 'all-MiniLM-L6-v2') and lemmatizer instances. Modified chat route to encode user messages into 384-dimensional embeddings instead of bag-of-words tokenization. Removed previous gibberish check logic. Updated input_size to fixed value 384.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Hopping through vectors so dense and so bright,
No more the words split on spaces so light,
With lemmas and embeddings, meaning runs deep,
This semantic transformation will help your bot leap!
From bag-of-words chains to transformers so grand—
A curious rabbit approves of this plan!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improvising Chatbot's Semantic Accuracy' directly relates to the main change: replacing Bag-of-Words with transformer-based semantic embeddings to improve accuracy.
Linked Issues check ✅ Passed All coding objectives from issue #238 are met: NLTK lemmatization pipeline added, SentenceTransformer embeddings implemented, fixed 384-dimensional embeddings standardized, and intent classification improved via semantic understanding.
Out of Scope Changes check ✅ Passed All changes align with issue #238 objectives: lemmatization pipeline, embedding model, and semantic input processing are directly required and in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
chatbot/app.py (1)

10-19: Remove legacy commented-out BoW code.

Keeping the old pipeline as commented code increases noise and makes the runtime path harder to audit. Prefer deleting it and relying on VCS history.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 10 - 19, Remove the legacy commented-out
bag-of-words helper functions to reduce noise: delete the commented definitions
for tokenize, stem, and bag_of_words from chatbot/app.py so only active code
remains (rely on VCS for history); ensure no other code references these
function names so you don't break imports or runtime paths, and run tests to
confirm nothing depends on tokenize, stem, or bag_of_words before committing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chatbot/app.py`:
- Around line 33-35: clean_and_lemmatize currently filters tokens with if
w.strip() before removing punctuation, so punctuation-only tokens become empty
after re.sub and still get lemmatized; fix by cleaning each token first then
filtering and lemmatizing: in clean_and_lemmatize (which uses nltk.word_tokenize
and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce cleaned, then if
cleaned (non-empty) pass cleaned to lemmatizer and return that; ensure
tokenization, cleaning, filtering, and lemmatization order is token -> cleaned =
re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).
- Around line 91-93: The code calls embedder.encode(user_message) with unbounded
user_message; add a request-size guard before encoding to prevent huge inputs
(e.g., check length/byte-size of user_message and reject or truncate beyond a
safe max, such as MAX_INPUT_CHARS) and return an appropriate error/400 response.
Implement this check in the same scope where embedder.encode and X are used
(referencing user_message, embedder.encode, and X) so you either truncate
user_message before calling embedder.encode or short-circuit with a clear error
when the input exceeds the configured limit.
- Around line 25-32: Module-level network calls (nltk.download and
SentenceTransformer model load) must be moved into an explicit startup
initializer: set lemmatizer and embedder to None at module scope, add an
init_nlp_resources() function that calls nltk.download('punkt'),
nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and
loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear
errors if any step fails; ensure init_nlp_resources() is invoked during
application startup (before handling requests) rather than at import time so
functions/classes that reference lemmatizer/embedder will see initialized
objects.
- Around line 91-94: Inference currently embeds raw user_message via
embedder.encode, bypassing the existing clean_and_lemmatize pipeline; call
clean_and_lemmatize(user_message) and use its return value when computing
embedding (i.e., replace embedder.encode(user_message) with
embedder.encode(cleaned_message)) before converting to tensor X and moving to
device, ensuring cleaned_message is a str and any None/empty result is handled
consistently with downstream logic.
- Around line 62-65: Replace the hardcoded input_size = 384 with reading the
saved value from the checkpoint (use data["input_size"] or
data.get("input_size", 384)) and ensure you validate that this loaded input_size
matches the embedder's output dimension before constructing the model; locate
the variables data and input_size in chatbot/app.py and compare the loaded
input_size against the embedder's output/embedding dimension (e.g.,
embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if
they differ.

---

Nitpick comments:
In `@chatbot/app.py`:
- Around line 10-19: Remove the legacy commented-out bag-of-words helper
functions to reduce noise: delete the commented definitions for tokenize, stem,
and bag_of_words from chatbot/app.py so only active code remains (rely on VCS
for history); ensure no other code references these function names so you don't
break imports or runtime paths, and run tests to confirm nothing depends on
tokenize, stem, or bag_of_words before committing.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4df55f and c90fe77.

⛔ Files ignored due to path filters (1)
  • client/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (1)
  • chatbot/app.py

Comment on lines +25 to +32
# Downloading resources and init Lemmatizer
nltk.download('punkt')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

# Loading the Embedding Model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "(chatbot|app\.py)" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 237


🏁 Script executed:

find . -name "*.py" -type f 2>/dev/null | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 106


🏁 Script executed:

cat -n chatbot/app.py | sed -n '20,40p'

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 853


Move network-dependent initialization out of module-level code.

Lines 26–31 execute network operations (nltk.download(), SentenceTransformer model loading) at import time, causing app startup to hang, fail in offline/containerized environments, and block deployment pipelines. Move this to an explicit initialization function called during startup, and provide clear error messages if resources fail to load.

Suggested approach: Initialize lemmatizer and embedder as None at module level, create an init_nlp_resources() function to perform the downloads/loading, and ensure it's called during app startup (e.g., before handling requests).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 25 - 32, Module-level network calls
(nltk.download and SentenceTransformer model load) must be moved into an
explicit startup initializer: set lemmatizer and embedder to None at module
scope, add an init_nlp_resources() function that calls nltk.download('punkt'),
nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and
loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear
errors if any step fails; ensure init_nlp_resources() is invoked during
application startup (before handling requests) rather than at import time so
functions/classes that reference lemmatizer/embedder will see initialized
objects.

Comment on lines +33 to +35
def clean_and_lemmatize(sentence):
words = nltk.word_tokenize(sentence.lower())
return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Filter empty tokens after regex cleanup.

Line 35 checks emptiness before punctuation removal, so punctuation-only tokens can become empty strings and still be lemmatized/returned.

Proposed fix
 def clean_and_lemmatize(sentence):    
     words = nltk.word_tokenize(sentence.lower())
-    return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()]
+    cleaned = []
+    for w in words:
+        token = re.sub(r'[^\w\s]', '', w).strip()
+        if token:
+            cleaned.append(lemmatizer.lemmatize(token))
+    return cleaned
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 33 - 35, clean_and_lemmatize currently filters
tokens with if w.strip() before removing punctuation, so punctuation-only tokens
become empty after re.sub and still get lemmatized; fix by cleaning each token
first then filtering and lemmatizing: in clean_and_lemmatize (which uses
nltk.word_tokenize and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce
cleaned, then if cleaned (non-empty) pass cleaned to lemmatizer and return that;
ensure tokenization, cleaning, filtering, and lemmatization order is token ->
cleaned = re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).

Comment on lines +62 to +65
# size is fixed
input_size = 384
tags = data['tags']

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the chatbot/app.py file
find . -name "app.py" -path "*/chatbot/*" -type f

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 87


🏁 Script executed:

# Check the repository structure
git ls-files | grep -E "chatbot|checkpoint" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 237


🏁 Script executed:

cat -n ./chatbot/app.py | head -80

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 2857


🏁 Script executed:

cat -n ./chatbot/train.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 5367


Read input_size from the checkpoint instead of hardcoding.

Line 63 hardcodes input_size = 384, but the checkpoint at line 60 already contains this value (saved by train.py line 146). If the model is retrained with a different input dimension, this hardcoded value will cause the model to be initialized incorrectly and fail during inference.

Use input_size = data.get("input_size", 384) to read the saved value, or simply data["input_size"] since it's always present. Additionally, validate the loaded dimension matches the embedder's output:

Proposed guard
-# size is fixed
-input_size = 384 
+input_size = data.get("input_size", 384)
+expected_dim = embedder.get_sentence_embedding_dimension()
+if input_size != expected_dim:
+    raise ValueError(f"Checkpoint input_size={input_size} does not match embedder dim={expected_dim}")
 tags = data['tags']
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 62 - 65, Replace the hardcoded input_size = 384
with reading the saved value from the checkpoint (use data["input_size"] or
data.get("input_size", 384)) and ensure you validate that this loaded input_size
matches the embedder's output dimension before constructing the model; locate
the variables data and input_size in chatbot/app.py and compare the loaded
input_size against the embedder's output/embedding dimension (e.g.,
embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if
they differ.

Comment on lines +91 to +93
# Getting semantic embedding (384-dimensional vector)
embedding = embedder.encode(user_message)
X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a request-size guard before transformer encoding.

User input is unbounded before Line 92. Very large payloads can create avoidable latency/memory pressure per request.

Proposed guardrail
+MAX_MESSAGE_CHARS = 2000
+
 def chat():
     try:
         request_data = request.get_json()
         user_message = request_data.get('message', '')
+        if len(user_message) > MAX_MESSAGE_CHARS:
+            return jsonify({"message": "Message too long"}), 413
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 91 - 93, The code calls
embedder.encode(user_message) with unbounded user_message; add a request-size
guard before encoding to prevent huge inputs (e.g., check length/byte-size of
user_message and reject or truncate beyond a safe max, such as MAX_INPUT_CHARS)
and return an appropriate error/400 response. Implement this check in the same
scope where embedder.encode and X are used (referencing user_message,
embedder.encode, and X) so you either truncate user_message before calling
embedder.encode or short-circuit with a clear error when the input exceeds the
configured limit.

Comment on lines +91 to 94
# Getting semantic embedding (384-dimensional vector)
embedding = embedder.encode(user_message)
X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the structure around the embedding code
find . -type f -name "*.py" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 106


🏁 Script executed:

# Look for the clean_and_lemmatize function
rg "clean_and_lemmatize" -B 3 -A 10

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 834


🏁 Script executed:

# Check the specific file and lines mentioned
cd chatbot && wc -l app.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 81


🏁 Script executed:

# Read the relevant section of chatbot/app.py
cat -n chatbot/app.py | sed -n '80,110p'

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 1522


🏁 Script executed:

# Check train.py for preprocessing usage
cat -n chatbot/train.py | head -100

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 3447


🏁 Script executed:

# Find all uses of clean_and_lemmatize in the codebase
rg "clean_and_lemmatize" -B 2 -A 5

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 449


🏁 Script executed:

# Check the entire app.py to see the full context
cat -n chatbot/app.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 4515


Apply text preprocessing before embedding to maintain consistency with the clean_and_lemmatize pipeline.

The clean_and_lemmatize function is defined at line 33-35 but is never called in the inference path. Line 92 embeds raw user_message directly, bypassing the tokenization, lowercasing, and lemmatization preprocessing. This creates an inconsistency since the function was added to the codebase but is unused during inference.

Proposed fix
-# Getting semantic embedding (384-dimensional vector)
-embedding = embedder.encode(user_message)
+# Normalize text before embedding
+normalized = " ".join(clean_and_lemmatize(user_message)).strip()
+text_for_embedding = normalized if normalized else user_message
+embedding = embedder.encode(text_for_embedding)
 X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)        
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Getting semantic embedding (384-dimensional vector)
embedding = embedder.encode(user_message)
X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)
# Normalize text before embedding
normalized = " ".join(clean_and_lemmatize(user_message)).strip()
text_for_embedding = normalized if normalized else user_message
embedding = embedder.encode(text_for_embedding)
X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 91 - 94, Inference currently embeds raw
user_message via embedder.encode, bypassing the existing clean_and_lemmatize
pipeline; call clean_and_lemmatize(user_message) and use its return value when
computing embedding (i.e., replace embedder.encode(user_message) with
embedder.encode(cleaned_message)) before converting to tensor X and moving to
device, ensuring cleaned_message is a str and any None/empty result is handled
consistently with downstream logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: Transition to Semantic Embeddings and NLTK Lemmatization

1 participant