Skip to content

Edit: 코드 리펙토링 및 v0.0.5 업데이트#7

Merged
Gyuho-Han merged 1 commit into
mainfrom
v0.0.5
Apr 15, 2026
Merged

Edit: 코드 리펙토링 및 v0.0.5 업데이트#7
Gyuho-Han merged 1 commit into
mainfrom
v0.0.5

Conversation

@Gyuho-Han
Copy link
Copy Markdown
Owner

@Gyuho-Han Gyuho-Han commented Apr 15, 2026

v0.0.5 업데이트

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved error messaging when required data files are missing; errors now display clearly in the UI rather than causing silent failures.
  • Chores

    • Added minimum version constraints to all dependencies to ensure reproducible builds and environment stability.
    • Updated documentation to reflect streamlined data processing workflow and module organization.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

This PR refactors a monolithic app.py into modular components (config.py, text_utils.py, retriever.py, chain.py) while consolidating data processing scripts (data_cleaning.py and data_preprocessing_for_RAG.py) into a single data_pipeline.py. Configuration is centralized, dependencies are version-pinned, and documentation is updated to reflect the new structure.

Changes

Cohort / File(s) Summary
Configuration Centralization
config.py
New module defining filesystem paths, model identifiers, retriever parameters, generation hyperparameters, and chunking controls as exportable constants.
Core Utility Modules
text_utils.py, chain.py, retriever.py
Three new modules implementing text cleaning/parsing, LLM generation chain initialization with Korean prompt template, and hybrid RAG retrieval with cross-encoder reranking and document formatting.
Data Pipeline Consolidation
data_pipeline.py
New unified script merging deleted data_cleaning.py and data_preprocessing_for_RAG.py functionality; orchestrates two-stage data cleaning and RAG chunk preprocessing with centralized config paths.
App Refactoring
app.py
Removed inlined text processing, retrieval, and chain initialization logic; now imports from text_utils, retriever, and chain; added try/except FileNotFoundError handling for missing documents/vectorstore with user-facing error messages.
Vector DB Integration
ingest_vector_db.py
Delegated document loading to retriever.load_documents(); replaced hardcoded embedding model with EMBEDDING_MODEL config import; updated CLI defaults to use RAG_JSON_PATH and CHROMA_DB_DIR.
Removed Scripts
data_cleaning.py, data_preprocessing_for_RAG.py
Deleted standalone data processing modules; functionality merged into data_pipeline.py.
Manifest & Dependencies
.gitignore, requirements.txt, README.md, VERSIONS.md
Updated Python cache/env ignore patterns; pinned all dependencies to minimum versions; refreshed documentation for new module structure and workflow; released as v0.0.5 (2026-04-15).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hop-hop, the files did split apart,
From monolith to pieces smart,
Config speaks, retrievers dance,
Chains of thought and RAG's advance,
Modular magic, clean and keen,
The tidy warren I've ever seen! 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '코드 리펙토링 및 v0.0.5 업데이트' (Code refactoring and v0.0.5 update) accurately reflects the main changes in the PR, which consolidates multiple files into modular components, extracts configuration, and documents the v0.0.5 release.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v0.0.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ingest_vector_db.py (1)

12-13: ⚠️ Potential issue | 🟠 Major

Validate batch_size as a positive integer.

If batch_size is 0 (or negative), Line 33 fails at runtime. Guard this in both argument parsing and ingest(...).

🛡️ Suggested fix
 def ingest(json_path: str, persist_directory: str, batch_size: int = 100) -> None:
+    if batch_size <= 0:
+        raise ValueError("batch_size must be a positive integer")
     source_path = Path(json_path)
     parser.add_argument(
         "--batch-size",
         type=int,
         default=100,
         help="한 번에 처리할 문서의 개수 (기본값: 100)",
     )
+    if args.batch_size <= 0:
+        raise ValueError("--batch-size는 1 이상의 정수여야 합니다.")

Also applies to: 33-33, 56-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ingest_vector_db.py` around lines 12 - 13, The ingest function currently
assumes batch_size > 0 causing a runtime failure; add validation in
ingest(json_path: str, persist_directory: str, batch_size: int = 100) to check
that batch_size is an int > 0 and raise a clear ValueError if not, and also
validate the CLI parsing/argparse path (where batch_size is parsed) by supplying
a positive-int type or post-parse check to reject 0 or negative values before
calling ingest; reference the ingest function and the argument parsing code that
constructs batch_size to locate and apply these guards.
🧹 Nitpick comments (1)
app.py (1)

88-93: Narrow the fallback exception scope in retrieval path.

Line 90 catches all exceptions, which can hide non-retrieval bugs and silently degrade behavior. Catch expected retrieval failures and only fallback when base_retriever exists.

♻️ Suggested refinement
-            try:
-                retrieved_docs = st.session_state.rerank_retriever.invoke(query)
-            except Exception:
-                st.warning("⚠️ Reranking 중 오류가 발생하여 기본 검색 결과를 사용합니다.")
-                retrieved_docs = st.session_state.rerank_retriever.base_retriever.invoke(query)[:5]
+            try:
+                retrieved_docs = st.session_state.rerank_retriever.invoke(query)
+            except Exception as e:
+                base = getattr(st.session_state.rerank_retriever, "base_retriever", None)
+                if base is None:
+                    raise
+                st.warning(f"⚠️ Reranking 실패로 기본 검색 결과를 사용합니다. ({e})")
+                retrieved_docs = base.invoke(query)[:5]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app.py` around lines 88 - 93, The try/except around
st.session_state.rerank_retriever.invoke(query) is too broad; change it to only
catch the specific retrieval-related exceptions thrown by your retriever (e.g.,
RetrieverError, RetrievalException, or the library-specific exception type) and
before falling back ensure st.session_state.rerank_retriever.base_retriever is
present; if no base_retriever exists re-raise the exception so unrelated bugs
aren't swallowed, and keep the existing user warning/logging when the fallback
is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app.py`:
- Around line 123-124: The chat history is always storing final_answer_with_refs
which can be empty even when final_think (or another rendered assistant text)
was shown; change the storage logic to persist the actual rendered assistant
text by computing a rendered_text variable that uses final_answer_with_refs if
non-empty, otherwise falls back to final_think (or final_answer) so the same
text the UI displayed is saved; update any code that appends to history to use
this rendered_text and ensure refs_str/final_answer concatenation still occurs
when refs_str is present.

In `@chain.py`:
- Around line 49-55: The ChatOllama instantiation for llm lacks a request
timeout, which can cause indefinite blocking if the Ollama server is slow;
update the ChatOllama call (the llm variable) to include
client_kwargs={"timeout": 60.0} alongside the existing parameters (model,
temperature, top_p, repeat_penalty, stop) so the underlying HTTP client times
out after 60 seconds.

In `@data_pipeline.py`:
- Line 54: Remove the unnecessary f-string prefix on the print call that has no
interpolations; locate the print statement (print(f"정제 작업이 완료되었습니다!")) in
data_pipeline.py and change it to a normal string literal (remove the leading f)
so lint (F541) is not triggered.
- Around line 50-58: The percent calculation uses initial_count and will raise
on empty input; update the print block after computing final_count and
deleted_count to guard against initial_count == 0 by computing the percent only
when initial_count > 0 (e.g., set percent = deleted_count/initial_count*100 or
"N/A"/0.0 otherwise) and use that percent variable in the f-string that prints
the deletion rate; change references to initial_count, deleted_count, and
output_file in the print statements so they display a safe, explicit value when
the CSV is empty.
- Around line 92-97: The code currently directly accesses row['reference_urls']
when building the dict in processed_data.append which causes a hard failure if
the column is missing; change this to treat reference_urls as optional by
reading it with row.get('reference_urls') (or checking 'reference_urls' in row)
and defaulting to None or an empty list, and only include the key or set the
default value in the dict construction so the pipeline won't break (note
retriever.py lines 16-31 do not consume this field).

In `@README.md`:
- Line 61: Update the awkward Korean sentence in Step 2.1 (the line currently
reading "1,2 뛰어 넘어도 됨.") to the clearer phrasing "1, 2단계를 건너뛰어도 됩니다." so wording
and spacing are correct; edit the README.md Step 2.1 section to replace the old
sentence with the new, properly spaced and polite form.

In `@requirements.txt`:
- Around line 3-15: Update the minimum versions in requirements.txt to exclude
known vulnerable releases: set langchain-core>=0.3.81,
langchain-community>=0.3.27, and bump torch to a recent secure minimum (e.g.,
torch>=2.1.0 or whatever your project CI has validated); keep the other
langchain packages aligned as needed. After updating the three package lower
bounds (langchain-core, langchain-community, torch), add or update a
constraints/lock file (pip-compile/poetry lock/pip freeze) to fully pin
transitive dependencies so vulnerable versions cannot be resolved.
- Line 16: The requirements entry for bitsandbytes must be made
optional/platform-scoped to avoid failing installs on macOS/Apple Silicon;
update the requirements.txt entry for the dependency (the symbol bitsandbytes)
to include a PEP 508 environment marker such as: bitsandbytes>=0.43;
sys_platform == "linux" and platform_machine == "x86_64" (or move it into an
extras group like [extras] or requirements-dev with a similarly guarded marker),
and update the README installation section to instruct Linux x86_64 users to
install the bitsandbytes extra or platform-specific requirements when CUDA is
available.

In `@retriever.py`:
- Around line 34-39: The load_vectorstore function currently lets
OllamaEmbeddings and Chroma initialization errors propagate uncaught; wrap the
creation of OllamaEmbeddings(model=EMBEDDING_MODEL) and the Chroma(...) call in
a try/except that catches Exception, then raise a clear RuntimeError (or custom
exception) that includes context ("failed to initialize embeddings/vectorstore")
and the original exception (e.g., via str(e)) so the app startup handler can
surface actionable feedback; reference the load_vectorstore function and the
OllamaEmbeddings and Chroma initializations when making the changes.
- Around line 63-66: The current non-Qwen branch incorrectly nests the dtype
under an extra "model_kwargs" key so HuggingFaceCrossEncoder doesn't receive
torch_dtype; update the assignment to set model_kwargs as a flat dict
(torch_dtype: torch.float16) before constructing
HuggingFaceCrossEncoder(model_name=RERANKER_MODEL, model_kwargs=model_kwargs)
and keep constructing re_ranker via CrossEncoderReranker(model=model,
top_n=RERANKER_TOP_N).

In `@text_utils.py`:
- Around line 20-27: The current tag parsing is case-sensitive and brittle;
replace the string-split logic with a case-insensitive regex search (using
re.IGNORECASE and re.DOTALL) to extract the content between opening and closing
<think> tags into think_content and, if a closing tag is present, capture the
trailing text, pass it through clean_response for answer_content; also handle
the case where only an opening tag exists by capturing everything after the
opening tag as think_content and returning an empty answer_content, referencing
the variables/text handling in the existing block and the clean_response call to
preserve behavior.

In `@VERSIONS.md`:
- Line 16: The release note line "`**의존성 버전 고정**: `requirements.txt`에 최소 버전 제약을
추가하여 환경 재현성 향상.`" is inaccurate because the file uses minimum bounds rather than
strict pinning; update that heading and/or text in VERSIONS.md to read "`**최소 버전
제약**`" (or similar wording) and adjust the description to state that
`requirements.txt` uses minimum version constraints rather than exact pins so
the wording accurately reflects the dependency policy.

---

Outside diff comments:
In `@ingest_vector_db.py`:
- Around line 12-13: The ingest function currently assumes batch_size > 0
causing a runtime failure; add validation in ingest(json_path: str,
persist_directory: str, batch_size: int = 100) to check that batch_size is an
int > 0 and raise a clear ValueError if not, and also validate the CLI
parsing/argparse path (where batch_size is parsed) by supplying a positive-int
type or post-parse check to reject 0 or negative values before calling ingest;
reference the ingest function and the argument parsing code that constructs
batch_size to locate and apply these guards.

---

Nitpick comments:
In `@app.py`:
- Around line 88-93: The try/except around
st.session_state.rerank_retriever.invoke(query) is too broad; change it to only
catch the specific retrieval-related exceptions thrown by your retriever (e.g.,
RetrieverError, RetrievalException, or the library-specific exception type) and
before falling back ensure st.session_state.rerank_retriever.base_retriever is
present; if no base_retriever exists re-raise the exception so unrelated bugs
aren't swallowed, and keep the existing user warning/logging when the fallback
is used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d975d056-e978-4b85-97bf-9c58eb2df229

📥 Commits

Reviewing files that changed from the base of the PR and between c9ee1cc and 55db04b.

📒 Files selected for processing (13)
  • .gitignore
  • README.md
  • VERSIONS.md
  • app.py
  • chain.py
  • config.py
  • data_cleaning.py
  • data_pipeline.py
  • data_preprocessing_for_RAG.py
  • ingest_vector_db.py
  • requirements.txt
  • retriever.py
  • text_utils.py
💤 Files with no reviewable changes (2)
  • data_cleaning.py
  • data_preprocessing_for_RAG.py

Comment thread app.py
Comment on lines 123 to 124
if refs_str:
final_answer_with_refs = final_answer + f"\n\n🔗 **참고 자료:**\n{refs_str}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist the same assistant text that is rendered.

At Line 131, final_think can be rendered, but Line 136 always stores final_answer_with_refs. If final_answer_with_refs is empty, chat history can lose the shown assistant response on rerender.

🐛 Suggested fix
-        if refs_str:
-            final_answer_with_refs = final_answer + f"\n\n🔗 **참고 자료:**\n{refs_str}"
-        else:
-            final_answer_with_refs = final_answer
+        if final_answer and refs_str:
+            final_answer_with_refs = final_answer + f"\n\n🔗 **참고 자료:**\n{refs_str}"
+        else:
+            final_answer_with_refs = final_answer or ""

-        if final_answer_with_refs:
-            answer_placeholder.markdown(final_answer_with_refs)
-        else:
-            answer_placeholder.markdown(final_think)
+        display_content = final_answer_with_refs or final_think or ""
+        answer_placeholder.markdown(display_content)

-        st.session_state.messages.append({"role": "assistant", "content": final_answer_with_refs})
+        st.session_state.messages.append({"role": "assistant", "content": display_content})

Also applies to: 128-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app.py` around lines 123 - 124, The chat history is always storing
final_answer_with_refs which can be empty even when final_think (or another
rendered assistant text) was shown; change the storage logic to persist the
actual rendered assistant text by computing a rendered_text variable that uses
final_answer_with_refs if non-empty, otherwise falls back to final_think (or
final_answer) so the same text the UI displayed is saved; update any code that
appends to history to use this rendered_text and ensure refs_str/final_answer
concatenation still occurs when refs_str is present.

Comment thread chain.py
Comment on lines +49 to +55
llm = ChatOllama(
model=LLM_MODEL,
temperature=LLM_TEMPERATURE,
top_p=LLM_TOP_P,
repeat_penalty=LLM_REPEAT_PENALTY,
stop=LLM_STOP_TOKENS,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For langchain-ollama >=0.2, what is the correct timeout parameter for ChatOllama initialization (e.g., timeoutvsrequest_timeout)?

💡 Result:

For langchain-ollama >=0.2, there is no direct timeout or request_timeout parameter for ChatOllama initialization. Official LangChain documentation and reference pages list parameters like model, temperature, num_predict, reasoning, base_url, keep_alive, but no timeout-specific init param. ChatOllama uses the ollama Python client internally, which supports a timeout parameter in Client(host=..., timeout=...). To set timeout, use client_kwargs={"timeout": 60.0} or similar when initializing ChatOllama, as it accepts client_kwargs and async_client_kwargs for passing kwargs to the underlying ollama.Client.

Citations:


🏁 Script executed:

find . -name "chain.py" -type f | head -5

Repository: Gyuho-Han/chatdds

Length of output: 71


🏁 Script executed:

# Look for dependency files to check langchain-ollama version
find . -type f \( -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" -o -name "Pipfile" \) | head -10

Repository: Gyuho-Han/chatdds

Length of output: 79


🏁 Script executed:

# Search for ChatOllama usage in the codebase
rg "ChatOllama" --type py -B 2 -A 8

Repository: Gyuho-Han/chatdds

Length of output: 931


🏁 Script executed:

cat requirements.txt

Repository: Gyuho-Han/chatdds

Length of output: 378


🏁 Script executed:

cat -n chain.py | sed -n '40,60p'

Repository: Gyuho-Han/chatdds

Length of output: 575


🏁 Script executed:

# Check for any existing client_kwargs usage in the codebase
rg "client_kwargs" --type py

Repository: Gyuho-Han/chatdds

Length of output: 43


Add an explicit timeout to prevent indefinite blocking on Ollama requests.

ChatOllama is instantiated without a timeout, so Streamlit requests can hang indefinitely if the local Ollama server becomes slow or unavailable. Use client_kwargs={"timeout": 60.0} to set a timeout on the underlying HTTP client:

llm = ChatOllama(
    model=LLM_MODEL,
    temperature=LLM_TEMPERATURE,
    top_p=LLM_TOP_P,
    repeat_penalty=LLM_REPEAT_PENALTY,
    stop=LLM_STOP_TOKENS,
    client_kwargs={"timeout": 60.0},  # Add this
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chain.py` around lines 49 - 55, The ChatOllama instantiation for llm lacks a
request timeout, which can cause indefinite blocking if the Ollama server is
slow; update the ChatOllama call (the llm variable) to include
client_kwargs={"timeout": 60.0} alongside the existing parameters (model,
temperature, top_p, repeat_penalty, stop) so the underlying HTTP client times
out after 60 seconds.

Comment thread data_pipeline.py
Comment on lines +50 to +58
final_count = len(df)
deleted_count = initial_count - final_count

print("-" * 40)
print(f"정제 작업이 완료되었습니다!")
print(f"기존 행 개수: {initial_count:,}개")
print(f"정제 후 행 개수: {final_count:,}개")
print(f"삭제된 행 개수: {deleted_count:,}개 (약 {deleted_count/initial_count*100:.1f}% 감소)")
print(f"최종 파일 저장 경로: {os.path.abspath(output_file)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard deletion-rate calculation for empty inputs.

Line 57 divides by initial_count without checking zero, which crashes on empty CSVs.

🐛 Suggested fix
     final_count = len(df)
     deleted_count = initial_count - final_count
+    deleted_ratio = (deleted_count / initial_count * 100) if initial_count else 0.0
@@
-    print(f"삭제된 행 개수: {deleted_count:,}개 (약 {deleted_count/initial_count*100:.1f}% 감소)")
+    print(f"삭제된 행 개수: {deleted_count:,}개 (약 {deleted_ratio:.1f}% 감소)")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
final_count = len(df)
deleted_count = initial_count - final_count
print("-" * 40)
print(f"정제 작업이 완료되었습니다!")
print(f"기존 행 개수: {initial_count:,}개")
print(f"정제 후 행 개수: {final_count:,}개")
print(f"삭제된 행 개수: {deleted_count:,}개 (약 {deleted_count/initial_count*100:.1f}% 감소)")
print(f"최종 파일 저장 경로: {os.path.abspath(output_file)}")
final_count = len(df)
deleted_count = initial_count - final_count
deleted_ratio = (deleted_count / initial_count * 100) if initial_count else 0.0
print("-" * 40)
print(f"정제 작업이 완료되었습니다!")
print(f"기존 행 개수: {initial_count:,}개")
print(f"정제 후 행 개수: {final_count:,}개")
print(f"삭제된 행 개수: {deleted_count:,}개 (약 {deleted_ratio:.1f}% 감소)")
print(f"최종 파일 저장 경로: {os.path.abspath(output_file)}")
🧰 Tools
🪛 Ruff (0.15.10)

[error] 54-54: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data_pipeline.py` around lines 50 - 58, The percent calculation uses
initial_count and will raise on empty input; update the print block after
computing final_count and deleted_count to guard against initial_count == 0 by
computing the percent only when initial_count > 0 (e.g., set percent =
deleted_count/initial_count*100 or "N/A"/0.0 otherwise) and use that percent
variable in the f-string that prints the deletion rate; change references to
initial_count, deleted_count, and output_file in the print statements so they
display a safe, explicit value when the CSV is empty.

Comment thread data_pipeline.py
deleted_count = initial_count - final_count

print("-" * 40)
print(f"정제 작업이 완료되었습니다!")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefix.

Line 54 has an f-string without placeholders (F541), which can fail lint if Ruff is enforced.

🧹 Suggested fix
-    print(f"정제 작업이 완료되었습니다!")
+    print("정제 작업이 완료되었습니다!")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"정제 작업이 완료되었습니다!")
print("정제 작업이 완료되었습니다!")
🧰 Tools
🪛 Ruff (0.15.10)

[error] 54-54: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data_pipeline.py` at line 54, Remove the unnecessary f-string prefix on the
print call that has no interpolations; locate the print statement (print(f"정제
작업이 완료되었습니다!")) in data_pipeline.py and change it to a normal string literal
(remove the leading f) so lint (F541) is not triggered.

Comment thread data_pipeline.py
Comment on lines +92 to +97
processed_data.append({
"chunk_id": f"doc_{index}_chunk_{i}",
"title": row['title'],
"url": row['url'],
"reference_urls": row['reference_urls'],
"content_chunk": chunk,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make reference_urls optional to avoid brittle schema coupling.

Line 96 directly accesses row['reference_urls']; missing column causes a hard failure. This field is currently not consumed by retriever.py (retriever.py:16-31), so this should be optional.

♻️ Suggested fix
         for i, chunk in enumerate(chunks):
+            reference_urls = row.get("reference_urls", "")
+            if pd.isna(reference_urls):
+                reference_urls = ""
             processed_data.append({
                 "chunk_id": f"doc_{index}_chunk_{i}",
                 "title": row['title'],
                 "url": row['url'],
-                "reference_urls": row['reference_urls'],
+                "reference_urls": reference_urls,
                 "content_chunk": chunk,
             })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data_pipeline.py` around lines 92 - 97, The code currently directly accesses
row['reference_urls'] when building the dict in processed_data.append which
causes a hard failure if the column is missing; change this to treat
reference_urls as optional by reading it with row.get('reference_urls') (or
checking 'reference_urls' in row) and defaulting to None or an empty list, and
only include the key or set the default value in the dict construction so the
pipeline won't break (note retriever.py lines 16-31 do not consume this field).

Comment thread requirements.txt
sentence-transformers>=3.0
transformers>=4.40
torch>=2.0
bitsandbytes>=0.43
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Is bitsandbytes (>=0.43) officially supported for pip installation on macOS Apple Silicon without CUDA?

💡 Result:

No, bitsandbytes (>=0.43) is not officially supported for pip installation on macOS Apple Silicon without CUDA. While CPU support for macOS arm64 is available via pip wheels starting from recent versions like 0.49.0 (with pre-release wheels explicitly for macOS 14+ arm64), full quantization features (e.g., 4-bit QLoRA, 8-bit optimizers) are limited or not supported on Apple Silicon. Official docs list experimental CPU support for LLM.int8 and QLoRA 4-bit (✅), but 8-bit optimizers (❌) and MPS/Metal support is partial/slow (🐢 for some features, ❌ for others). Prior to 0.49.0, no PyPI wheels were available for >=0.43 on macOS arm64, leading to installation failures. The library is primarily designed for CUDA/NVIDIA GPUs, and Apple Silicon support remains experimental without CUDA.

Citations:


🏁 Script executed:

# Check if README exists and read relevant sections mentioning platform/installation
find . -maxdepth 2 -iname "readme*" -type f

Repository: Gyuho-Han/chatdds

Length of output: 72


🏁 Script executed:

# Read the full requirements.txt to understand the current dependency structure
cat -n requirements.txt

Repository: Gyuho-Han/chatdds

Length of output: 497


🏁 Script executed:

# Check for setup.py, pyproject.toml, setup.cfg to understand optional dependency management
ls -la | grep -E "(setup\.py|pyproject\.toml|setup\.cfg)"

Repository: Gyuho-Han/chatdds

Length of output: 43


🏁 Script executed:

# Check for setup.py or pyproject.toml anywhere in the project
find . -type f -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg"

Repository: Gyuho-Han/chatdds

Length of output: 43


🏁 Script executed:

# Read the README to check for platform-specific installation documentation
head -100 README.md

Repository: Gyuho-Han/chatdds

Length of output: 2567


🏁 Script executed:

# Search for any existing platform-scoped dependencies in the requirements file
rg "platform_system|platform_machine|sys_platform" requirements.txt

Repository: Gyuho-Han/chatdds

Length of output: 43


Make bitsandbytes optional/platform-scoped.

This dependency lacks official support for macOS Apple Silicon without CUDA (no PyPI wheels for >=0.43 until v0.49.0). Your README claims optimization for Apple Silicon/CPU environments, but users attempting installation on macOS will encounter failure. Restrict to Linux x86_64 where CUDA support is available.

Suggested change
-bitsandbytes>=0.43
+bitsandbytes>=0.43; platform_system == "Linux" and platform_machine == "x86_64"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bitsandbytes>=0.43
bitsandbytes>=0.43; platform_system == "Linux" and platform_machine == "x86_64"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 16, The requirements entry for bitsandbytes must be
made optional/platform-scoped to avoid failing installs on macOS/Apple Silicon;
update the requirements.txt entry for the dependency (the symbol bitsandbytes)
to include a PEP 508 environment marker such as: bitsandbytes>=0.43;
sys_platform == "linux" and platform_machine == "x86_64" (or move it into an
extras group like [extras] or requirements-dev with a similarly guarded marker),
and update the README installation section to instruct Linux x86_64 users to
install the bitsandbytes extra or platform-specific requirements when CUDA is
available.

Comment thread retriever.py
Comment on lines +34 to +39
def load_vectorstore(persist_directory=CHROMA_DB_DIR):
"""Chroma 벡터DB를 로드합니다."""
if not Path(persist_directory).exists():
raise FileNotFoundError("벡터DB가 아직 생성되지 않았습니다. 데이터를 먼저 임베딩하세요.")
embed = OllamaEmbeddings(model=EMBEDDING_MODEL)
return Chroma(persist_directory=persist_directory, embedding_function=embed)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify caller-side exception scope for startup loading
rg -n -C3 'except\s+FileNotFoundError' --type=py
rg -n -C3 'load_vectorstore\(|get_vectorstore\(' --type=py

Repository: Gyuho-Han/chatdds

Length of output: 1019


Handle vectorstore initialization failures explicitly.

OllamaEmbeddings and Chroma initialization errors bubble up uncaught from this function. The app's startup handler (app.py lines 34-39) only catches FileNotFoundError, so failures such as Ollama service unavailability or embedding model issues cause unhandled exceptions instead of actionable feedback.

Wrap both initialization calls in try-except:

Suggested fix
 def load_vectorstore(persist_directory=CHROMA_DB_DIR):
     """Chroma 벡터DB를 로드합니다."""
     if not Path(persist_directory).exists():
         raise FileNotFoundError("벡터DB가 아직 생성되지 않았습니다. 데이터를 먼저 임베딩하세요.")
-    embed = OllamaEmbeddings(model=EMBEDDING_MODEL)
-    return Chroma(persist_directory=persist_directory, embedding_function=embed)
+    try:
+        embed = OllamaEmbeddings(model=EMBEDDING_MODEL)
+        return Chroma(persist_directory=persist_directory, embedding_function=embed)
+    except Exception as e:
+        raise RuntimeError(f"벡터DB 로드 실패: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@retriever.py` around lines 34 - 39, The load_vectorstore function currently
lets OllamaEmbeddings and Chroma initialization errors propagate uncaught; wrap
the creation of OllamaEmbeddings(model=EMBEDDING_MODEL) and the Chroma(...) call
in a try/except that catches Exception, then raise a clear RuntimeError (or
custom exception) that includes context ("failed to initialize
embeddings/vectorstore") and the original exception (e.g., via str(e)) so the
app startup handler can surface actionable feedback; reference the
load_vectorstore function and the OllamaEmbeddings and Chroma initializations
when making the changes.

Comment thread retriever.py
Comment on lines +63 to +66
model_kwargs = {"model_kwargs": {"torch_dtype": torch.float16}}

model = HuggingFaceCrossEncoder(model_name=RERANKER_MODEL, model_kwargs=model_kwargs)
re_ranker = CrossEncoderReranker(model=model, top_n=RERANKER_TOP_N)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In langchain_community.cross_encoders.HuggingFaceCrossEncoder, what exact structure should model_kwargshave for sentence-transformers CrossEncoder (especially for settingtorch_dtype)?

💡 Result:

In langchain_community.cross_encoders.HuggingFaceCrossEncoder, model_kwargs is a dictionary of keyword arguments passed directly to the sentence_transformers.CrossEncoder constructor. The exact structure is a dict[str, Any], typically including keys like 'device' (e.g., 'cpu', 'cuda'), and for setting torch_dtype, use 'torch_dtype' with values such as 'float16', 'bfloat16', 'float32', or 'auto'. Example: model_kwargs = { 'device': 'cuda', 'torch_dtype': 'float16' } hf = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", model_kwargs=model_kwargs) This passes model_kwargs to sentence_transformers.CrossEncoder(model_name, self.model_kwargs), and Sentence Transformers forwards model_kwargs (including torch_dtype) to the underlying Hugging Face Transformers AutoModel.from_pretrained. Other common keys: 'revision', 'trust_remote_code', 'attn_implementation' (e.g., 'flash_attention_2'), 'device_map' (e.g., 'auto').

Citations:


🏁 Script executed:

find . -name "retriever.py" -type f

Repository: Gyuho-Han/chatdds

Length of output: 75


🏁 Script executed:

cat -n retriever.py | sed -n '50,80p'

Repository: Gyuho-Han/chatdds

Length of output: 1417


Remove the extra model_kwargs nesting in the non-Qwen branch.

At line 63, model_kwargs should be a flat dictionary passed directly to HuggingFaceCrossEncoder, not nested under another model_kwargs key. Change:

model_kwargs = {"model_kwargs": {"torch_dtype": torch.float16}}

to:

model_kwargs = {"torch_dtype": torch.float16}

The extra nesting causes the torch_dtype setting to not be applied, as the CrossEncoder constructor won't recognize the outer model_kwargs key.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@retriever.py` around lines 63 - 66, The current non-Qwen branch incorrectly
nests the dtype under an extra "model_kwargs" key so HuggingFaceCrossEncoder
doesn't receive torch_dtype; update the assignment to set model_kwargs as a flat
dict (torch_dtype: torch.float16) before constructing
HuggingFaceCrossEncoder(model_name=RERANKER_MODEL, model_kwargs=model_kwargs)
and keep constructing re_ranker via CrossEncoderReranker(model=model,
top_n=RERANKER_TOP_N).

Comment thread text_utils.py
Comment on lines +20 to +27
if "<think>" in text and "</think>" in text:
parts = text.split("</think>", 1)
think_content = parts[0].split("<think>")[-1].strip()
answer_content = clean_response(parts[1])
return think_content, answer_content
elif "<think>" in text:
think_content = text.split("<think>")[-1].strip()
return think_content, ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

<think> parsing is too strict and can miss valid tag variants.

Current split logic is case-sensitive and only matches exact <think>/</think>. Use a case-insensitive regex to make extraction robust.

Suggested fix
 def extract_think_and_answer(text: str):
     """문자열에서 <think> 부분과 실제 답변 부분을 완벽하게 분리합니다."""
-    if "<think>" in text and "</think>" in text:
-        parts = text.split("</think>", 1)
-        think_content = parts[0].split("<think>")[-1].strip()
-        answer_content = clean_response(parts[1])
-        return think_content, answer_content
-    elif "<think>" in text:
-        think_content = text.split("<think>")[-1].strip()
-        return think_content, ""
+    match = re.search(r"<think\b[^>]*>(.*?)</think>", text, flags=re.IGNORECASE | re.DOTALL)
+    if match:
+        think_content = match.group(1).strip()
+        answer_content = clean_response(text[:match.start()] + text[match.end():])
+        return think_content, answer_content
+    elif re.search(r"<think\b[^>]*>", text, flags=re.IGNORECASE):
+        think_content = re.split(r"<think\b[^>]*>", text, flags=re.IGNORECASE)[-1].strip()
+        return think_content, ""
     else:
         return "", clean_response(text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@text_utils.py` around lines 20 - 27, The current tag parsing is
case-sensitive and brittle; replace the string-split logic with a
case-insensitive regex search (using re.IGNORECASE and re.DOTALL) to extract the
content between opening and closing <think> tags into think_content and, if a
closing tag is present, capture the trailing text, pass it through
clean_response for answer_content; also handle the case where only an opening
tag exists by capturing everything after the opening tag as think_content and
returning an empty answer_content, referencing the variables/text handling in
the existing block and the clean_response call to preserve behavior.

Comment thread VERSIONS.md
- **중복 코드 제거**: `app.py`와 `ingest_vector_db.py`에 중복 존재하던 `load_documents()` 함수를 `retriever.py`로 통합.
- **데이터 파이프라인 통합**: `data_cleaning.py`와 `data_preprocessing_for_RAG.py`를 `data_pipeline.py`로 통합하여 1,2단계를 한 번에 실행.
- **Reranker 모델 자동 전환**: `config.py`에서 `RERANKER_MODEL`만 변경하면 BAAI/Qwen 모델에 맞는 kwargs가 자동 적용.
- **의존성 버전 고정**: `requirements.txt`에 최소 버전 제약을 추가하여 환경 재현성 향상.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Release note wording is inaccurate for dependency policy.

requirements.txt currently uses minimum bounds, not strict pinning. Consider changing “의존성 버전 고정” to “최소 버전 제약”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@VERSIONS.md` at line 16, The release note line "`**의존성 버전 고정**:
`requirements.txt`에 최소 버전 제약을 추가하여 환경 재현성 향상.`" is inaccurate because the file
uses minimum bounds rather than strict pinning; update that heading and/or text
in VERSIONS.md to read "`**최소 버전 제약**`" (or similar wording) and adjust the
description to state that `requirements.txt` uses minimum version constraints
rather than exact pins so the wording accurately reflects the dependency policy.

@Gyuho-Han Gyuho-Han merged commit 9d8267b into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant