Improve chatbot NLP and training stability#232
Improve chatbot NLP and training stability#232Rav1Chauhan wants to merge 5 commits intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughReplaces simple tokenization with NLTK tokenization + PorterStemmer, introduces bag-of-words numpy vectors, refactors training into a Dataset/DataLoader + NeuralNet, adds file-backed intents/model paths, and changes the Flask inference endpoint to Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,200,255,0.5)
actor Client
end
rect rgba(200,255,200,0.5)
participant Endpoint as /chat Endpoint
participant NLP as NLP Preprocessor
participant Model as NeuralNet
participant Intents as Intents Store
participant ResGen as Response Generator
end
Client->>Endpoint: POST /chat { "message": "..." }
Endpoint->>Endpoint: Validate JSON & message
Endpoint->>NLP: Tokenize & stem
NLP-->>Endpoint: bag_of_words vector
Endpoint->>Model: Forward(vector) in eval() with no_grad
Model-->>Endpoint: logits
Endpoint->>Endpoint: softmax -> confidence, predicted tag
Endpoint->>Intents: Lookup responses for tag
alt confidence > 0.75
Intents-->>ResGen: select response
ResGen-->>Endpoint: response
Endpoint->>Client: 200 { "response": ..., "tag": ..., "confidence": ... }
else
Endpoint->>Client: 200 { "response": "<fallback>", "confidence": ... }
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Around line 113-114: The current entrypoint unconditionally calls
app.run(host="0.0.0.0", port=5000, debug=True) which is unsafe for production;
change the __main__ block to read configuration from environment variables
(e.g., FLASK_DEBUG, FLASK_HOST, FLASK_PORT) and use those values with safe
defaults (debug default False, host default "127.0.0.1", port default 5000),
converting types appropriately before passing them into app.run; ensure you
still allow overriding for development but never default to debug=True or
host="0.0.0.0".
In `@chatbot/train.py`:
- Around line 65-68: Update the NLTK resource check/download logic so it handles
NLTK 3.8.2+ which uses the "punkt_tab" resource: in chatbot/train.py (and mirror
the same change in chatbot/app.py where word_tokenize is used) replace the
single check for nltk.data.find("tokenizers/punkt") with a two-step check that
first looks for "tokenizers/punkt_tab" and downloads "punkt_tab" if missing,
otherwise falls back to checking/downloading "tokenizers/punkt"; alternatively,
pin NLTK to <3.8.2 in requirements if you prefer not to change runtime logic.
Ensure the change targets the existing nltk.data.find/download block that
precedes calls to word_tokenize.
🧹 Nitpick comments (5)
chatbot/train.py (3)
1-28: Code duplication: NLP utilities and NeuralNet are duplicated betweentrain.pyandapp.py.The
tokenize,stem,bag_of_wordsfunctions andNeuralNetclass are defined identically in both files. This violates DRY and creates maintenance burden—any future change must be applied in both places to keep training and inference consistent.Consider extracting shared code into a common module (e.g.,
chatbot/nlp_utils.pyandchatbot/model.py) and importing from there.
104-109: Inefficient tensor↔numpy conversions.
bag_of_wordsreturns atorch.tensor, which is then collected into a list and converted tonp.array(line 108). TheChatDatasetreturns numpy arrays, and theDataLoaderwill convert them back to tensors. This creates unnecessary conversions.Consider either:
- Return a numpy array from
bag_of_wordsduring training, or- Stack tensors directly with
torch.stack()and keep everything as tensors.♻️ Suggested fix using torch.stack
- X_train = [] - y_train = [] - - for (tokens, tag) in xy: - X_train.append(bag_of_words(tokens, all_words)) - y_train.append(tags.index(tag)) - - X_train = np.array(X_train) - y_train = np.array(y_train) + X_train = torch.stack([bag_of_words(tokens, all_words) for (tokens, tag) in xy]) + y_train = torch.tensor([tags.index(tag) for (tokens, tag) in xy], dtype=torch.long)
73-74: Hardcoded relative paths may break when run from different directories.The paths
"intents.json"and"data.pth"assume the current working directory is thechatbotfolder. Running from the repository root (e.g.,python chatbot/train.py) will fail withFileNotFoundError.♻️ Suggested fix using script-relative paths
Add near the top of
main():script_dir = os.path.dirname(os.path.abspath(__file__)) intents_path = os.path.join(script_dir, "intents.json") model_path = os.path.join(script_dir, "data.pth")Then use
intents_pathandmodel_pathinstead of the hardcoded strings.Also applies to: 171-171
chatbot/app.py (2)
55-58: Same relative path issue astrain.py.The paths
"intents.json"and"data.pth"are hardcoded. Use script-relative paths for robustness.♻️ Suggested fix
import os script_dir = os.path.dirname(os.path.abspath(__file__)) intents_path = os.path.join(script_dir, "intents.json") model_path = os.path.join(script_dir, "data.pth") with open(intents_path, "r") as f: intents = json.load(f) data = torch.load(model_path, weights_only=True)
107-108: Avoid catching bareException.The broad
except Exceptioncatch can mask unexpected errors and make debugging difficult. Consider catching specific exceptions (e.g.,KeyError,json.JSONDecodeError) or at minimum, log the full traceback for observability.♻️ Suggested improvement
except Exception as e: - return jsonify({"error": str(e)}), 500 + import traceback + traceback.print_exc() # Log full traceback for debugging + return jsonify({"error": "An internal error occurred"}), 500Alternatively, catch specific exceptions you expect (e.g.,
KeyErrorfor missing JSON fields).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Line 114: The string literal returned by the handler (the return statement
using jsonify) contains a RIGHT SINGLE QUOTATION MARK (U+2019); replace it with
a standard ASCII apostrophe (') in the response text so the return
jsonify({"response": ...}) uses "Sorry, I didn't understand that." to avoid
potential encoding issues.
- Around line 91-93: The code assumes request.json is a dict and calls
.get("message"), which raises if request.json is None; change to safely parse
JSON using request.get_json(silent=True) (or check request.is_json) and handle a
None result before accessing "message": e.g., payload =
request.get_json(silent=True) and then message = (payload or {}).get("message");
if payload is None return a 400 with {"error":"JSON body required"} or if
message missing return the existing {"error":"Message is required"} response.
Ensure you update the handler where message is read (the code using
request.json.get("message")) to use payload and the two distinct error
responses.
🧹 Nitpick comments (3)
chatbot/app.py (3)
20-28: Consider catching a more specific exception for the fallback.The broad
Exceptioncatch on line 26-27 is intentional for NLTK version compatibility, but it could mask unrelated errors (network failures, permission issues). Consider catchingExceptionbut logging the error for debugging.🔧 Optional: Add logging for better debugging
+import logging + def ensure_nltk_tokenizers(): try: nltk.data.find("tokenizers/punkt_tab/english") except LookupError: try: nltk.download("punkt_tab") - except Exception: + except Exception as e: + logging.debug(f"punkt_tab download failed ({e}), falling back to punkt") nltk.download("punkt")
34-48: Code duplication: NLP utilities are duplicated intrain.py.The
tokenize,stem, andbag_of_wordsfunctions are identically defined in bothapp.pyandtrain.py. Consider extracting these to a shared module (e.g.,nlp_utils.py) to maintain a single source of truth.
59-70:NeuralNetclass is duplicated intrain.py.The model architecture is defined identically in both files. If the architecture changes, both files must be updated. Extract
NeuralNetto a shared module to ensure consistency between training and inference.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@chatbot/app.py`:
- Around line 99-103: The gibberish branch inside the with torch.no_grad() block
currently returns jsonify({"message": "I do not understand..."}), which is
inconsistent with the rest of the endpoint that returns {"response": ...};
update that return to use the "response" key instead (i.e., return
jsonify({"response": "I do not understand..."}) ) so clients always receive the
same top-level key—this change should be made where X.sum().item() == 0 is
checked in the endpoint function.
- Around line 20-29: The ensure_nltk_tokenizers function currently ignores
nltk.download() failures because downloads return False rather than throwing;
update ensure_nltk_tokenizers to check the return values of
nltk.download("punkt_tab") and, if that returns False, attempt
nltk.download("punkt") and if that also returns False raise a RuntimeError (or
re-raise a descriptive exception) so the app fails fast; keep the initial
nltk.data.find("tokenizers/punkt_tab/english") check, call downloads only when
missing, and include the tokenizer names ("punkt_tab" and fallback "punkt") and
the function name ensure_nltk_tokenizers in your change to make the fix
discoverable.
| def ensure_nltk_tokenizers(): | ||
| try: | ||
| nltk.data.find("tokenizers/punkt_tab/english") | ||
| except LookupError: | ||
| try: | ||
| nltk.download("punkt_tab") | ||
| except Exception: | ||
| nltk.download("punkt") | ||
|
|
||
| ensure_nltk_tokenizers() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
nltk.download() return value success failure documentation
💡 Result:
nltk.download() is an alias for nltk.downloader._downloader.download, i.e., Downloader.download(...). It returns a boolean. [1]
- Returns
Truewhen the download completes (including the interactive UI case whereinfo_or_id=None). [1] - Returns
Falseif anErrorMessageoccurs andhalt_on_error=True(the default). [1] - If
raise_on_error=True, it raisesValueErrorinstead of returningFalseon an error. [1] - If
halt_on_error=False, it will keep going after errors and still returnsTrueat the end (even if some packages failed). [1]
Docs / source: NLTK downloader API + implementation. [1]
Fail fast when NLTK tokenizer downloads fail.
The current code silently swallows download failures because nltk.download() returns False on error rather than raising an exception—the try/except block is ineffective. If both downloads fail, the app continues normally but crashes later on the first request that needs the tokenizer.
Check the return values and raise an error immediately if both downloads fail:
Proposed fix
def ensure_nltk_tokenizers():
try:
nltk.data.find("tokenizers/punkt_tab/english")
except LookupError:
- try:
- nltk.download("punkt_tab")
- except Exception:
- nltk.download("punkt")
+ if not nltk.download("punkt_tab"):
+ if not nltk.download("punkt"):
+ raise RuntimeError("NLTK punkt tokenizer download failed")🧰 Tools
🪛 Ruff (0.14.14)
[warning] 26-26: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@chatbot/app.py` around lines 20 - 29, The ensure_nltk_tokenizers function
currently ignores nltk.download() failures because downloads return False rather
than throwing; update ensure_nltk_tokenizers to check the return values of
nltk.download("punkt_tab") and, if that returns False, attempt
nltk.download("punkt") and if that also returns False raise a RuntimeError (or
re-raise a descriptive exception) so the app fails fast; keep the initial
nltk.data.find("tokenizers/punkt_tab/english") check, call downloads only when
missing, and include the tokenizer names ("punkt_tab" and fallback "punkt") and
the function name ensure_nltk_tokenizers in your change to make the fix
discoverable.
| with torch.no_grad(): | ||
| # Check if input is gibberish (no known words recognized) | ||
| if X.sum().item() == 0: | ||
| return jsonify({"message": "I do not understand..."}) | ||
|
|
There was a problem hiding this comment.
Use a consistent response key for gibberish handling.
This branch returns {"message": ...} while the rest of the endpoint returns {"response": ...}. That breaks client expectations when gibberish is detected.
🧭 Proposed fix
- if X.sum().item() == 0:
- return jsonify({"message": "I do not understand..."})
+ if X.sum().item() == 0:
+ return jsonify({"response": "I do not understand..."})🤖 Prompt for AI Agents
In `@chatbot/app.py` around lines 99 - 103, The gibberish branch inside the with
torch.no_grad() block currently returns jsonify({"message": "I do not
understand..."}), which is inconsistent with the rest of the endpoint that
returns {"response": ...}; update that return to use the "response" key instead
(i.e., return jsonify({"response": "I do not understand..."}) ) so clients
always receive the same top-level key—this change should be made where
X.sum().item() == 0 is checked in the endpoint function.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
chatbot/app.py (1)
91-91: Variable name shadows module-leveldata.The local variable
dataon this line shadows the module-leveldatadictionary (line 57) containing model parameters. While this works correctly due to scoping, it could cause confusion. Consider renaming torequest_dataorpayload.♻️ Suggested rename
- data = request.json - if not data: + payload = request.json + if not payload: return jsonify({"error": "Request body must be JSON"}), 400 - message = data.get("message") + message = payload.get("message")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chatbot/app.py` at line 91, The local variable named `data` assigned from `request.json` shadows the module-level `data` dict (the model parameters), so rename the local `data` variable to `request_data` (or `payload`) wherever you assign `request.json` and update all subsequent references in that function/route handler to `request_data` to avoid confusion with the module-level `data` dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@chatbot/app.py`:
- Line 91: The local variable named `data` assigned from `request.json` shadows
the module-level `data` dict (the model parameters), so rename the local `data`
variable to `request_data` (or `payload`) wherever you assign `request.json` and
update all subsequent references in that function/route handler to
`request_data` to avoid confusion with the module-level `data` dict.
Description
Possible improvements that can make the chatbot model perform better. This proposal aims to enhance the chatbot's stability, generalization, and performance by standardizing the training and deployment processes.
NLP Standardization
The project uses very simple and naive tokenizer and stemmer. I it better by using the standard ones.
Used nltk.word_tokenize for superior tokenization.
Used PorterStemmer for effective stemming.
Training Refinements (Efficiency & Monitoring)
Logging: Calculate and print the Average Epoch Loss (instead of just the final batch loss).
Performance: Set num_workers > 0 in the DataLoader for parallel data fetching.
Fixes # (#198)
Type of change
What changed
Why
Summary by CodeRabbit
New Features
Bug Fixes
Refactor