Improvising Chatbot's Semantic Accuracy by 6vam4arya · Pull Request #239 · AOSSIE-Org/Agora-Blockchain

6vam4arya · 2026-03-03T22:13:32Z

Description

Upgraded the chatbot from a keyword-based Bag of Words model to a semantic-aware system using Transformer-based embeddings.
Standardized input representation to a fixed 384-dimensional embedding, improving scalability and maintainability.
Implemented NLTK tokenization and WordNet lemmatization to improve text normalization and preserve contextual meaning.
Enhanced overall intent classification accuracy and contextual understanding.

Dependencies

To run the upgraded chatbot, run the following command in terminall:
pip install -U sentence-transformers nltk torch

Fixes #238

Type of change

Please mark the options that are relevant.

Updated UI/UX
Improved the business logic of code
Added new feature
Other

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

Summary by CodeRabbit

New Features
- Improved semantic understanding of user messages through advanced text processing
- Enhanced natural language processing capabilities for more contextual chatbot interactions
- Updated input handling methodology for refined message interpretation

coderabbitai · 2026-03-03T22:13:54Z

📝 Walkthrough

Walkthrough

The chatbot's input processing pipeline was upgraded from a basic bag-of-words approach to semantic embeddings using SentenceTransformer. Lemmatization via NLTK's WordNetLemmatizer was introduced for text preprocessing. The chat route now encodes user messages into fixed 384-dimensional vectors, replacing tokenization and bag-of-words logic.

Changes

Cohort / File(s)	Summary
Semantic Embedding & Lemmatization Integration `chatbot/app.py`	Added `clean_and_lemmatize()` function for text preprocessing using NLTK WordNetLemmatizer. Introduced global `embedder` (SentenceTransformer 'all-MiniLM-L6-v2') and `lemmatizer` instances. Modified chat route to encode user messages into 384-dimensional embeddings instead of bag-of-words tokenization. Removed previous gibberish check logic. Updated `input_size` to fixed value 384.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Hopping through vectors so dense and so bright,
No more the words split on spaces so light,
With lemmas and embeddings, meaning runs deep,
This semantic transformation will help your bot leap!
From bag-of-words chains to transformers so grand—
A curious rabbit approves of this plan! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Improvising Chatbot's Semantic Accuracy' directly relates to the main change: replacing Bag-of-Words with transformer-based semantic embeddings to improve accuracy.
Linked Issues check	✅ Passed	All coding objectives from issue `#238` are met: NLTK lemmatization pipeline added, SentenceTransformer embeddings implemented, fixed 384-dimensional embeddings standardized, and intent classification improved via semantic understanding.
Out of Scope Changes check	✅ Passed	All changes align with issue `#238` objectives: lemmatization pipeline, embedding model, and semantic input processing are directly required and in scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

chatbot/app.py (1)
10-19: Remove legacy commented-out BoW code.

Keeping the old pipeline as commented code increases noise and makes the runtime path harder to audit. Prefer deleting it and relying on VCS history.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` around lines 10 - 19, Remove the legacy commented-out
bag-of-words helper functions to reduce noise: delete the commented definitions
for tokenize, stem, and bag_of_words from chatbot/app.py so only active code
remains (rely on VCS for history); ensure no other code references these
function names so you don't break imports or runtime paths, and run tests to
confirm nothing depends on tokenize, stem, or bag_of_words before committing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chatbot/app.py`:
- Around line 33-35: clean_and_lemmatize currently filters tokens with if
w.strip() before removing punctuation, so punctuation-only tokens become empty
after re.sub and still get lemmatized; fix by cleaning each token first then
filtering and lemmatizing: in clean_and_lemmatize (which uses nltk.word_tokenize
and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce cleaned, then if
cleaned (non-empty) pass cleaned to lemmatizer and return that; ensure
tokenization, cleaning, filtering, and lemmatization order is token -> cleaned =
re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).
- Around line 91-93: The code calls embedder.encode(user_message) with unbounded
user_message; add a request-size guard before encoding to prevent huge inputs
(e.g., check length/byte-size of user_message and reject or truncate beyond a
safe max, such as MAX_INPUT_CHARS) and return an appropriate error/400 response.
Implement this check in the same scope where embedder.encode and X are used
(referencing user_message, embedder.encode, and X) so you either truncate
user_message before calling embedder.encode or short-circuit with a clear error
when the input exceeds the configured limit.
- Around line 25-32: Module-level network calls (nltk.download and
SentenceTransformer model load) must be moved into an explicit startup
initializer: set lemmatizer and embedder to None at module scope, add an
init_nlp_resources() function that calls nltk.download('punkt'),
nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and
loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear
errors if any step fails; ensure init_nlp_resources() is invoked during
application startup (before handling requests) rather than at import time so
functions/classes that reference lemmatizer/embedder will see initialized
objects.
- Around line 91-94: Inference currently embeds raw user_message via
embedder.encode, bypassing the existing clean_and_lemmatize pipeline; call
clean_and_lemmatize(user_message) and use its return value when computing
embedding (i.e., replace embedder.encode(user_message) with
embedder.encode(cleaned_message)) before converting to tensor X and moving to
device, ensuring cleaned_message is a str and any None/empty result is handled
consistently with downstream logic.
- Around line 62-65: Replace the hardcoded input_size = 384 with reading the
saved value from the checkpoint (use data["input_size"] or
data.get("input_size", 384)) and ensure you validate that this loaded input_size
matches the embedder's output dimension before constructing the model; locate
the variables data and input_size in chatbot/app.py and compare the loaded
input_size against the embedder's output/embedding dimension (e.g.,
embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if
they differ.

---

Nitpick comments:
In `@chatbot/app.py`:
- Around line 10-19: Remove the legacy commented-out bag-of-words helper
functions to reduce noise: delete the commented definitions for tokenize, stem,
and bag_of_words from chatbot/app.py so only active code remains (rely on VCS
for history); ensure no other code references these function names so you don't
break imports or runtime paths, and run tests to confirm nothing depends on
tokenize, stem, or bag_of_words before committing.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4df55f and c90fe77.

⛔ Files ignored due to path filters (1)

client/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (1)

chatbot/app.py

coderabbitai · 2026-03-03T22:18:23Z

chatbot/app.py

+# Downloading resources and init Lemmatizer
+nltk.download('punkt')
+nltk.download('wordnet')
+lemmatizer = WordNetLemmatizer()
+
+# Loading the Embedding Model
+embedder = SentenceTransformer('all-MiniLM-L6-v2')
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E "(chatbot|app\.py)" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 237

🏁 Script executed:

find . -name "*.py" -type f 2>/dev/null | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 106

🏁 Script executed:

cat -n chatbot/app.py | sed -n '20,40p'

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 853

Move network-dependent initialization out of module-level code.

Lines 26–31 execute network operations (nltk.download(), SentenceTransformer model loading) at import time, causing app startup to hang, fail in offline/containerized environments, and block deployment pipelines. Move this to an explicit initialization function called during startup, and provide clear error messages if resources fail to load.

Suggested approach: Initialize lemmatizer and embedder as None at module level, create an init_nlp_resources() function to perform the downloads/loading, and ensure it's called during app startup (e.g., before handling requests).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 25 - 32, Module-level network calls (nltk.download and SentenceTransformer model load) must be moved into an explicit startup initializer: set lemmatizer and embedder to None at module scope, add an init_nlp_resources() function that calls nltk.download('punkt'), nltk.download('wordnet'), instantiates WordNetLemmatizer into lemmatizer and loads SentenceTransformer('all-MiniLM-L6-v2') into embedder, and raise/log clear errors if any step fails; ensure init_nlp_resources() is invoked during application startup (before handling requests) rather than at import time so functions/classes that reference lemmatizer/embedder will see initialized objects.

coderabbitai · 2026-03-03T22:18:23Z

chatbot/app.py

+def clean_and_lemmatize(sentence):    
+    words = nltk.word_tokenize(sentence.lower())
+    return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()]


⚠️ Potential issue | 🟡 Minor

Filter empty tokens after regex cleanup.

Line 35 checks emptiness before punctuation removal, so punctuation-only tokens can become empty strings and still be lemmatized/returned.

Proposed fix

def clean_and_lemmatize(sentence): words = nltk.word_tokenize(sentence.lower()) - return [lemmatizer.lemmatize(re.sub(r'[^\w\s]', '', w)) for w in words if w.strip()] + cleaned = [] + for w in words: + token = re.sub(r'[^\w\s]', '', w).strip() + if token: + cleaned.append(lemmatizer.lemmatize(token)) + return cleaned

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 33 - 35, clean_and_lemmatize currently filters tokens with if w.strip() before removing punctuation, so punctuation-only tokens become empty after re.sub and still get lemmatized; fix by cleaning each token first then filtering and lemmatizing: in clean_and_lemmatize (which uses nltk.word_tokenize and lemmatizer) call re.sub(r'[^\w\s]', '', w) to produce cleaned, then if cleaned (non-empty) pass cleaned to lemmatizer and return that; ensure tokenization, cleaning, filtering, and lemmatization order is token -> cleaned = re.sub(...) -> if cleaned -> lemmatizer.lemmatize(cleaned).

coderabbitai · 2026-03-03T22:18:23Z

chatbot/app.py

+# size is fixed
+input_size = 384 
+tags = data['tags']
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the chatbot/app.py file find . -name "app.py" -path "*/chatbot/*" -type f

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 87

🏁 Script executed:

# Check the repository structure git ls-files | grep -E "chatbot|checkpoint" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 237

🏁 Script executed:

cat -n ./chatbot/app.py | head -80

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 2857

🏁 Script executed:

cat -n ./chatbot/train.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 5367

Read input_size from the checkpoint instead of hardcoding.

Line 63 hardcodes input_size = 384, but the checkpoint at line 60 already contains this value (saved by train.py line 146). If the model is retrained with a different input dimension, this hardcoded value will cause the model to be initialized incorrectly and fail during inference.

Use input_size = data.get("input_size", 384) to read the saved value, or simply data["input_size"] since it's always present. Additionally, validate the loaded dimension matches the embedder's output:

Proposed guard

-# size is fixed -input_size = 384 +input_size = data.get("input_size", 384) +expected_dim = embedder.get_sentence_embedding_dimension() +if input_size != expected_dim: + raise ValueError(f"Checkpoint input_size={input_size} does not match embedder dim={expected_dim}") tags = data['tags']

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 62 - 65, Replace the hardcoded input_size = 384 with reading the saved value from the checkpoint (use data["input_size"] or data.get("input_size", 384)) and ensure you validate that this loaded input_size matches the embedder's output dimension before constructing the model; locate the variables data and input_size in chatbot/app.py and compare the loaded input_size against the embedder's output/embedding dimension (e.g., embedder.output_dim or embedder.embedding_dim) and raise/log a clear error if they differ.

coderabbitai · 2026-03-03T22:18:23Z

chatbot/app.py

+        # Getting semantic embedding (384-dimensional vector)
+        embedding = embedder.encode(user_message)
+        X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)        


⚠️ Potential issue | 🟠 Major

Add a request-size guard before transformer encoding.

User input is unbounded before Line 92. Very large payloads can create avoidable latency/memory pressure per request.

Proposed guardrail

+MAX_MESSAGE_CHARS = 2000 + def chat(): try: request_data = request.get_json() user_message = request_data.get('message', '') + if len(user_message) > MAX_MESSAGE_CHARS: + return jsonify({"message": "Message too long"}), 413

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 91 - 93, The code calls embedder.encode(user_message) with unbounded user_message; add a request-size guard before encoding to prevent huge inputs (e.g., check length/byte-size of user_message and reject or truncate beyond a safe max, such as MAX_INPUT_CHARS) and return an appropriate error/400 response. Implement this check in the same scope where embedder.encode and X are used (referencing user_message, embedder.encode, and X) so you either truncate user_message before calling embedder.encode or short-circuit with a clear error when the input exceeds the configured limit.

coderabbitai · 2026-03-03T22:18:23Z

chatbot/app.py

+        # Getting semantic embedding (384-dimensional vector)
+        embedding = embedder.encode(user_message)
+        X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)        



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the structure around the embedding code find . -type f -name "*.py" | head -20

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 106

🏁 Script executed:

# Look for the clean_and_lemmatize function rg "clean_and_lemmatize" -B 3 -A 10

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 834

🏁 Script executed:

# Check the specific file and lines mentioned cd chatbot && wc -l app.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 81

🏁 Script executed:

# Read the relevant section of chatbot/app.py cat -n chatbot/app.py | sed -n '80,110p'

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 1522

🏁 Script executed:

# Check train.py for preprocessing usage cat -n chatbot/train.py | head -100

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 3447

🏁 Script executed:

# Find all uses of clean_and_lemmatize in the codebase rg "clean_and_lemmatize" -B 2 -A 5

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 449

🏁 Script executed:

# Check the entire app.py to see the full context cat -n chatbot/app.py

Repository: AOSSIE-Org/Agora-Blockchain

Length of output: 4515

Apply text preprocessing before embedding to maintain consistency with the clean_and_lemmatize pipeline.

The clean_and_lemmatize function is defined at line 33-35 but is never called in the inference path. Line 92 embeds raw user_message directly, bypassing the tokenization, lowercasing, and lemmatization preprocessing. This creates an inconsistency since the function was added to the codebase but is unused during inference.

Proposed fix

-# Getting semantic embedding (384-dimensional vector) -embedding = embedder.encode(user_message) +# Normalize text before embedding +normalized = " ".join(clean_and_lemmatize(user_message)).strip() +text_for_embedding = normalized if normalized else user_message +embedding = embedder.encode(text_for_embedding) X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Getting semantic embedding (384-dimensional vector)

embedding = embedder.encode(user_message)

X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)

# Normalize text before embedding

normalized = " ".join(clean_and_lemmatize(user_message)).strip()

text_for_embedding = normalized if normalized else user_message

embedding = embedder.encode(text_for_embedding)

X = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` around lines 91 - 94, Inference currently embeds raw user_message via embedder.encode, bypassing the existing clean_and_lemmatize pipeline; call clean_and_lemmatize(user_message) and use its return value when computing embedding (i.e., replace embedder.encode(user_message) with embedder.encode(cleaned_message)) before converting to tensor X and moving to device, ensuring cleaned_message is a str and any None/empty result is handled consistently with downstream logic.

Improvising Chatbot's Semantic Accuracy

c90fe77

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvising Chatbot's Semantic Accuracy#239

Improvising Chatbot's Semantic Accuracy#239
6vam4arya wants to merge 1 commit intoAOSSIE-Org:mainfrom
6vam4arya:chatbot

6vam4arya commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

6vam4arya commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Type of change

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

6vam4arya commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading