Skip to content

Improve chatbot NLP and training stability#232

Open
Rav1Chauhan wants to merge 5 commits intoAOSSIE-Org:mainfrom
Rav1Chauhan:improve-chatbot-nlp
Open

Improve chatbot NLP and training stability#232
Rav1Chauhan wants to merge 5 commits intoAOSSIE-Org:mainfrom
Rav1Chauhan:improve-chatbot-nlp

Conversation

@Rav1Chauhan
Copy link

@Rav1Chauhan Rav1Chauhan commented Feb 4, 2026

Description

Possible improvements that can make the chatbot model perform better. This proposal aims to enhance the chatbot's stability, generalization, and performance by standardizing the training and deployment processes.

NLP Standardization
The project uses very simple and naive tokenizer and stemmer. I it better by using the standard ones.

Used nltk.word_tokenize for superior tokenization.
Used PorterStemmer for effective stemming.
Training Refinements (Efficiency & Monitoring)

Logging: Calculate and print the Average Epoch Loss (instead of just the final batch loss).
Performance: Set num_workers > 0 in the DataLoader for parallel data fetching.

Fixes # (#198)

Type of change

What changed

  • Replaced naive tokenizer with NLTK tokenizer
  • Added PorterStemmer
  • Improved training logging
  • Fixed Windows multiprocessing issues

Why

  • Better generalization
  • Improved stability on Windows
  • Cleaner training process

Summary by CodeRabbit

  • New Features

    • Improved NLP preprocessing for more accurate replies, confidence-based response selection with a clearer fallback, and standardized POST /chat that returns {"response": ...}.
  • Bug Fixes

    • Rejects missing/invalid message payloads with a 400 error; startup is now configurable via environment variables.
  • Refactor

    • Reorganized training and runtime to use file-backed intents and model data with a structured dataset/training pipeline.

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Replaces simple tokenization with NLTK tokenization + PorterStemmer, introduces bag-of-words numpy vectors, refactors training into a Dataset/DataLoader + NeuralNet, adds file-backed intents/model paths, and changes the Flask inference endpoint to POST /chat with JSON validation, softmax confidence gating, and model.eval() inference.

Changes

Cohort / File(s) Summary
Application / Inference
chatbot/app.py
Rewrote Flask endpoint to POST /chat; require JSON body with message; validate input and return 400 on bad requests; load intents.json and data.pth via path constants; run model under model.eval()/torch.no_grad(); compute confidence with softmax and apply a 0.75 threshold; response field renamed to response.
Training & Dataset
chatbot/train.py
Refactored training pipeline: ChatDataset (PyTorch Dataset), DataLoader, device-aware training loop, vocabulary/intents-driven preprocessing, save model state to MODEL_PATH, and use INTENTS_PATH for intents file.
NLP Utilities & Model
chatbot/app.py, chatbot/train.py
Replaced whitespace tokenization with nltk.word_tokenize, added PorterStemmer, implemented bag_of_words returning NumPy vectors, standardized NeuralNet (uses super().__init__() and updated forward path), and added ensure_nltk_tokenizers bootstrap.
Paths & Runtime Config
chatbot/app.py, chatbot/train.py
Introduced BASE_DIR, INTENTS_PATH, MODEL_PATH; removed global try/except wrapper and CORS usage; made Flask host/port/debug configurable via environment variables.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    actor Client
    end
    rect rgba(200,255,200,0.5)
    participant Endpoint as /chat Endpoint
    participant NLP as NLP Preprocessor
    participant Model as NeuralNet
    participant Intents as Intents Store
    participant ResGen as Response Generator
    end

    Client->>Endpoint: POST /chat { "message": "..." }
    Endpoint->>Endpoint: Validate JSON & message
    Endpoint->>NLP: Tokenize & stem
    NLP-->>Endpoint: bag_of_words vector
    Endpoint->>Model: Forward(vector) in eval() with no_grad
    Model-->>Endpoint: logits
    Endpoint->>Endpoint: softmax -> confidence, predicted tag
    Endpoint->>Intents: Lookup responses for tag
    alt confidence > 0.75
        Intents-->>ResGen: select response
        ResGen-->>Endpoint: response
        Endpoint->>Client: 200 { "response": ..., "tag": ..., "confidence": ... }
    else
        Endpoint->>Client: 200 { "response": "<fallback>", "confidence": ... }
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through tokens, trimmed each stem,

NumPy bags and layers hum for them,
Softmax whispers, confidence tight,
I choose a reply in morning light,
A rabbit-chat: your message — delight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improve chatbot NLP and training stability' directly aligns with the main changes: replacing naive tokenization with NLTK, adding PorterStemmer, improving training logging, and fixing Windows multiprocessing issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Around line 113-114: The current entrypoint unconditionally calls
app.run(host="0.0.0.0", port=5000, debug=True) which is unsafe for production;
change the __main__ block to read configuration from environment variables
(e.g., FLASK_DEBUG, FLASK_HOST, FLASK_PORT) and use those values with safe
defaults (debug default False, host default "127.0.0.1", port default 5000),
converting types appropriately before passing them into app.run; ensure you
still allow overriding for development but never default to debug=True or
host="0.0.0.0".

In `@chatbot/train.py`:
- Around line 65-68: Update the NLTK resource check/download logic so it handles
NLTK 3.8.2+ which uses the "punkt_tab" resource: in chatbot/train.py (and mirror
the same change in chatbot/app.py where word_tokenize is used) replace the
single check for nltk.data.find("tokenizers/punkt") with a two-step check that
first looks for "tokenizers/punkt_tab" and downloads "punkt_tab" if missing,
otherwise falls back to checking/downloading "tokenizers/punkt"; alternatively,
pin NLTK to <3.8.2 in requirements if you prefer not to change runtime logic.
Ensure the change targets the existing nltk.data.find/download block that
precedes calls to word_tokenize.
🧹 Nitpick comments (5)
chatbot/train.py (3)

1-28: Code duplication: NLP utilities and NeuralNet are duplicated between train.py and app.py.

The tokenize, stem, bag_of_words functions and NeuralNet class are defined identically in both files. This violates DRY and creates maintenance burden—any future change must be applied in both places to keep training and inference consistent.

Consider extracting shared code into a common module (e.g., chatbot/nlp_utils.py and chatbot/model.py) and importing from there.


104-109: Inefficient tensor↔numpy conversions.

bag_of_words returns a torch.tensor, which is then collected into a list and converted to np.array (line 108). The ChatDataset returns numpy arrays, and the DataLoader will convert them back to tensors. This creates unnecessary conversions.

Consider either:

  1. Return a numpy array from bag_of_words during training, or
  2. Stack tensors directly with torch.stack() and keep everything as tensors.
♻️ Suggested fix using torch.stack
-    X_train = []
-    y_train = []
-
-    for (tokens, tag) in xy:
-        X_train.append(bag_of_words(tokens, all_words))
-        y_train.append(tags.index(tag))
-
-    X_train = np.array(X_train)
-    y_train = np.array(y_train)
+    X_train = torch.stack([bag_of_words(tokens, all_words) for (tokens, tag) in xy])
+    y_train = torch.tensor([tags.index(tag) for (tokens, tag) in xy], dtype=torch.long)

73-74: Hardcoded relative paths may break when run from different directories.

The paths "intents.json" and "data.pth" assume the current working directory is the chatbot folder. Running from the repository root (e.g., python chatbot/train.py) will fail with FileNotFoundError.

♻️ Suggested fix using script-relative paths

Add near the top of main():

script_dir = os.path.dirname(os.path.abspath(__file__))
intents_path = os.path.join(script_dir, "intents.json")
model_path = os.path.join(script_dir, "data.pth")

Then use intents_path and model_path instead of the hardcoded strings.

Also applies to: 171-171

chatbot/app.py (2)

55-58: Same relative path issue as train.py.

The paths "intents.json" and "data.pth" are hardcoded. Use script-relative paths for robustness.

♻️ Suggested fix
import os

script_dir = os.path.dirname(os.path.abspath(__file__))
intents_path = os.path.join(script_dir, "intents.json")
model_path = os.path.join(script_dir, "data.pth")

with open(intents_path, "r") as f:
    intents = json.load(f)

data = torch.load(model_path, weights_only=True)

107-108: Avoid catching bare Exception.

The broad except Exception catch can mask unexpected errors and make debugging difficult. Consider catching specific exceptions (e.g., KeyError, json.JSONDecodeError) or at minimum, log the full traceback for observability.

♻️ Suggested improvement
     except Exception as e:
-        return jsonify({"error": str(e)}), 500
+        import traceback
+        traceback.print_exc()  # Log full traceback for debugging
+        return jsonify({"error": "An internal error occurred"}), 500

Alternatively, catch specific exceptions you expect (e.g., KeyError for missing JSON fields).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Line 114: The string literal returned by the handler (the return statement
using jsonify) contains a RIGHT SINGLE QUOTATION MARK (U+2019); replace it with
a standard ASCII apostrophe (') in the response text so the return
jsonify({"response": ...}) uses "Sorry, I didn't understand that." to avoid
potential encoding issues.
- Around line 91-93: The code assumes request.json is a dict and calls
.get("message"), which raises if request.json is None; change to safely parse
JSON using request.get_json(silent=True) (or check request.is_json) and handle a
None result before accessing "message": e.g., payload =
request.get_json(silent=True) and then message = (payload or {}).get("message");
if payload is None return a 400 with {"error":"JSON body required"} or if
message missing return the existing {"error":"Message is required"} response.
Ensure you update the handler where message is read (the code using
request.json.get("message")) to use payload and the two distinct error
responses.
🧹 Nitpick comments (3)
chatbot/app.py (3)

20-28: Consider catching a more specific exception for the fallback.

The broad Exception catch on line 26-27 is intentional for NLTK version compatibility, but it could mask unrelated errors (network failures, permission issues). Consider catching Exception but logging the error for debugging.

🔧 Optional: Add logging for better debugging
+import logging
+
 def ensure_nltk_tokenizers():
     try:
         nltk.data.find("tokenizers/punkt_tab/english")
     except LookupError:
         try:
             nltk.download("punkt_tab")
-        except Exception:
+        except Exception as e:
+            logging.debug(f"punkt_tab download failed ({e}), falling back to punkt")
             nltk.download("punkt")

34-48: Code duplication: NLP utilities are duplicated in train.py.

The tokenize, stem, and bag_of_words functions are identically defined in both app.py and train.py. Consider extracting these to a shared module (e.g., nlp_utils.py) to maintain a single source of truth.


59-70: NeuralNet class is duplicated in train.py.

The model architecture is defined identically in both files. If the architecture changes, both files must be updated. Extract NeuralNet to a shared module to ensure consistency between training and inference.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Around line 99-103: The gibberish branch inside the with torch.no_grad() block
currently returns jsonify({"message": "I do not understand..."}), which is
inconsistent with the rest of the endpoint that returns {"response": ...};
update that return to use the "response" key instead (i.e., return
jsonify({"response": "I do not understand..."}) ) so clients always receive the
same top-level key—this change should be made where X.sum().item() == 0 is
checked in the endpoint function.
- Around line 20-29: The ensure_nltk_tokenizers function currently ignores
nltk.download() failures because downloads return False rather than throwing;
update ensure_nltk_tokenizers to check the return values of
nltk.download("punkt_tab") and, if that returns False, attempt
nltk.download("punkt") and if that also returns False raise a RuntimeError (or
re-raise a descriptive exception) so the app fails fast; keep the initial
nltk.data.find("tokenizers/punkt_tab/english") check, call downloads only when
missing, and include the tokenizer names ("punkt_tab" and fallback "punkt") and
the function name ensure_nltk_tokenizers in your change to make the fix
discoverable.

Comment on lines +20 to +29
def ensure_nltk_tokenizers():
try:
nltk.data.find("tokenizers/punkt_tab/english")
except LookupError:
try:
nltk.download("punkt_tab")
except Exception:
nltk.download("punkt")

ensure_nltk_tokenizers()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

nltk.download() return value success failure documentation

💡 Result:

nltk.download() is an alias for nltk.downloader._downloader.download, i.e., Downloader.download(...). It returns a boolean. [1]

  • Returns True when the download completes (including the interactive UI case where info_or_id=None). [1]
  • Returns False if an ErrorMessage occurs and halt_on_error=True (the default). [1]
  • If raise_on_error=True, it raises ValueError instead of returning False on an error. [1]
  • If halt_on_error=False, it will keep going after errors and still returns True at the end (even if some packages failed). [1]

Docs / source: NLTK downloader API + implementation. [1]


Fail fast when NLTK tokenizer downloads fail.

The current code silently swallows download failures because nltk.download() returns False on error rather than raising an exception—the try/except block is ineffective. If both downloads fail, the app continues normally but crashes later on the first request that needs the tokenizer.

Check the return values and raise an error immediately if both downloads fail:

Proposed fix
 def ensure_nltk_tokenizers():
     try:
         nltk.data.find("tokenizers/punkt_tab/english")
     except LookupError:
-        try:
-            nltk.download("punkt_tab")
-        except Exception:
-            nltk.download("punkt")
+        if not nltk.download("punkt_tab"):
+            if not nltk.download("punkt"):
+                raise RuntimeError("NLTK punkt tokenizer download failed")
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 26-26: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@chatbot/app.py` around lines 20 - 29, The ensure_nltk_tokenizers function
currently ignores nltk.download() failures because downloads return False rather
than throwing; update ensure_nltk_tokenizers to check the return values of
nltk.download("punkt_tab") and, if that returns False, attempt
nltk.download("punkt") and if that also returns False raise a RuntimeError (or
re-raise a descriptive exception) so the app fails fast; keep the initial
nltk.data.find("tokenizers/punkt_tab/english") check, call downloads only when
missing, and include the tokenizer names ("punkt_tab" and fallback "punkt") and
the function name ensure_nltk_tokenizers in your change to make the fix
discoverable.

Comment on lines +99 to 103
with torch.no_grad():
# Check if input is gibberish (no known words recognized)
if X.sum().item() == 0:
return jsonify({"message": "I do not understand..."})

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a consistent response key for gibberish handling.

This branch returns {"message": ...} while the rest of the endpoint returns {"response": ...}. That breaks client expectations when gibberish is detected.

🧭 Proposed fix
-        if X.sum().item() == 0:
-            return jsonify({"message": "I do not understand..."})
+        if X.sum().item() == 0:
+            return jsonify({"response": "I do not understand..."})
🤖 Prompt for AI Agents
In `@chatbot/app.py` around lines 99 - 103, The gibberish branch inside the with
torch.no_grad() block currently returns jsonify({"message": "I do not
understand..."}), which is inconsistent with the rest of the endpoint that
returns {"response": ...}; update that return to use the "response" key instead
(i.e., return jsonify({"response": "I do not understand..."}) ) so clients
always receive the same top-level key—this change should be made where
X.sum().item() == 0 is checked in the endpoint function.

Rav1Chauhan and others added 2 commits March 26, 2026 02:47
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
chatbot/app.py (1)

91-91: Variable name shadows module-level data.

The local variable data on this line shadows the module-level data dictionary (line 57) containing model parameters. While this works correctly due to scoping, it could cause confusion. Consider renaming to request_data or payload.

♻️ Suggested rename
-    data = request.json
-    if not data:
+    payload = request.json
+    if not payload:
         return jsonify({"error": "Request body must be JSON"}), 400
-    message = data.get("message")
+    message = payload.get("message")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chatbot/app.py` at line 91, The local variable named `data` assigned from
`request.json` shadows the module-level `data` dict (the model parameters), so
rename the local `data` variable to `request_data` (or `payload`) wherever you
assign `request.json` and update all subsequent references in that
function/route handler to `request_data` to avoid confusion with the
module-level `data` dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@chatbot/app.py`:
- Line 91: The local variable named `data` assigned from `request.json` shadows
the module-level `data` dict (the model parameters), so rename the local `data`
variable to `request_data` (or `payload`) wherever you assign `request.json` and
update all subsequent references in that function/route handler to
`request_data` to avoid confusion with the module-level `data` dict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b593a76-845e-4ce8-a9f5-2b8fd68515fd

📥 Commits

Reviewing files that changed from the base of the PR and between ce87ab9 and a6d55c4.

📒 Files selected for processing (1)
  • chatbot/app.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant