Notebooks by NewBlueWizard · Pull Request #137 · knmlprz/ChatKNML

NewBlueWizard · 2024-04-28T08:44:23Z

No description provided.

review-notebook-app · 2024-04-28T08:44:28Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

pgronkievitz

didn't check notebooks yet

pgronkievitz · 2024-05-15T19:59:12Z

question: why does this require separate gitignore?

pgronkievitz · 2024-05-15T20:01:09Z

+
+    article = Article("")
+
+    def __init__(self, url_list: Union[List[str], str], depth: int = 1):


nitpick: use modern typing syntax

Suggested change

def __init__(self, url_list: Union[List[str], str], depth: int = 1):

def __init__(self, url_list: list[str] | str, depth: int = 1):

pgronkievitz · 2024-05-15T20:01:50Z

+        self.depth = depth
+
+    @staticmethod
+    def newspaper_extractor(html):


issue: untyped

pgronkievitz · 2024-05-15T20:04:54Z

+        if text_splitter is None:
+            _text_splitter: TextSplitter = SpacyTextSplitter(
+                pipeline="pl_core_news_sm",
+                chunk_size=chunk,
+                chunk_overlap=chunk_overlap,
+            )
+        else:
+            _text_splitter = text_splitter


suggestion:

Suggested change

if text_splitter is None:

_text_splitter: TextSplitter = SpacyTextSplitter(

pipeline="pl_core_news_sm",

chunk_size=chunk,

chunk_overlap=chunk_overlap,

)

else:

_text_splitter = text_splitter

_text_splitter = text_splitter or SpacyTextSplitter(

pipeline="pl_core_news_sm",

chunk_size=chunk,

chunk_overlap=chunk_overlap,

)

pgronkievitz · 2024-05-15T20:06:01Z

+        docs = self.load()
+        docs = reduce(
+            lambda data, method: method(data),
+            [CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],
+            docs,
+        )


suggestion:

Suggested change

docs = self.load()

docs = reduce(

lambda data, method: method(data),

[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],

docs,

)

docs = reduce(

lambda data, method: method(data),

[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],

self.load(),

)

pgronkievitz · 2024-05-15T20:10:55Z

+                        loader = LocalDataLoader.loaders[d_type](file_path)
+                        try:
+                            docs.append(loader.load()[0])
+                        except Exception as e:


issue: use specific exceptions, not generic Exception, as it'll silently fail on exception you didn't forsee

pgronkievitz · 2024-05-15T20:11:25Z

+
+[tool.poetry.dev-dependencies]
+black = "*"
+flake8 = "*"


issue: lack of EOL

pgronkievitz · 2024-05-15T20:11:50Z

+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.poetry.dev-dependencies]


issue: legacy syntax, don't use it

pgronkievitz · 2024-05-15T20:12:24Z

+black = "*"
+flake8 = "*"


issue: be more specific about required versions

pgronkievitz · 2024-05-15T20:13:44Z

+        docs.extend(local_loader.load())
+
+    for doc in docs:
+        requests.post(url, json=doc)


issue: lack of EOL

…andling

NewBlueWizard

Fixed the code as suggested. Fixing exceptions and module versions require updating and testing the current code.

pgronkievitz

please, use black, ruff (with as much enabled options as possible) and pyright with strict typing

pgronkievitz · 2024-09-20T07:07:36Z

+
+    """
+
+    article = Article("")


nitpick: are you aware this instance will be shared and may cause weird bugs with overwrites between objects? (this object is created once and will be shared between CleanWebLoader instances)

pgronkievitz · 2024-09-20T07:09:12Z

+    @staticmethod
+    def newspaper_extractor(html: str):
+        """
+        Extracts and cleans text content from HTML using the 'newspaper' library.
+
+        :param html: HTML content to be processed.
+        :return: Cleaned and concatenated text extracted from the HTML.
+        """
+        CleanWebLoader.article.set_html(html)
+        CleanWebLoader.article.parse()
+        return " ".join(CleanWebLoader.article.text.split())
+
+    @staticmethod
+    def ds_converter(docs: list[Document]):
+        """
+        Converts a list of documents into a specific data structure.
+
+        :param docs: List of documents to be converted.
+        :return: List of dictionaries, each representing a document with 'text' key.
+        """
+        return [{"text": doc.page_content} for doc in docs]
+
+    @staticmethod
+    def junk_remover(docs: list[Document]):
+        """
+        Identifies and returns a list of suspected junk documents based on specific criteria.
+
+        :param docs: A list of documents, where each document is represented as a dictionary.
+                    Each dictionary should have a "text" key containing the text content of the document.
+        :return: A list of suspected junk documents based on the criteria of having less than 300 characters
+                or having the same text as another document in the input list.
+        """
+        junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
+        seen_texts = set()
+        clear_docs = []
+        for doc in docs:
+            if "title" not in doc.metadata.keys():
+                junk_docs.append(doc)
+            elif doc.page_content not in seen_texts and doc not in junk_docs:
+                clear_docs.append(doc)
+                seen_texts.add(doc.page_content)
+            else:
+                pass
+        return clear_docs


issue: those should not be static, add self argument and use self.article.* instead of CleanWebLoader.article.*
ds_converter and junk_remover should be separate from this class as they've got nothing to do with it.
also - TYPING, please check with pyright set to strict

pgronkievitz · 2024-09-20T07:11:20Z

@@ -0,0 +1,140 @@
+from newspaper import Article
+from functools import reduce
+from typing import List, Optional


nitpick: both are unnecessary - list[type] and type | None work just fine

pgronkievitz · 2024-09-20T07:12:01Z

+        :param chunk_overlap: Overlap size between chunks (default is 80).
+        :return: List of dictionaries, each representing a document with 'text' key.
+        """
+        _text_splitter: text_splitter or TextSplitter = SpacyTextSplitter(


issue: that's some weird-ass typing, wtf

My bad, I missed it

pgronkievitz · 2024-09-20T07:12:28Z

+    loaders = {
+        ".pdf": PyPDFLoader,
+        ".json": JSONLoader,
+        ".txt": TextLoader,
+        ".csv": CSVLoader,
+    }


nitpick: move this to init

pgronkievitz · 2024-09-20T07:12:53Z

+        ".csv": CSVLoader,
+    }
+
+    def __init__(self, path: Union[List[str], str]):


suggestion:

Suggested change

def __init__(self, path: Union[List[str], str]):

def __init__(self, path: list[str] | str):

pgronkievitz · 2024-09-20T07:13:38Z

+    @staticmethod
+    def ds_converter(docs):
+        """
+        Converts a list of documents into a specific data structure.
+
+        :param docs: List of documents to be converted.
+        :return: List of dictionaries, each representing a document with 'text' and 'url' keys.
+        """
+        return [{"text": doc.page_content} for doc in docs]
+
+    @staticmethod
+    def junk_remover(docs):
+        """
+        Identifies and returns a list of suspected junk documents based on specific criteria.
+
+        :param docs: A list of documents, where each document is represented as a dictionary.
+                    Each dictionary should have a "text" key containing the text content of the document.
+        :return: A list of suspected junk documents based on the criteria of having less than 300 characters
+                or having the same text as another document in the input list.
+        """
+        junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
+        seen_texts = {}
+        clear_docs = []
+        for doc in docs:
+            if doc.page_content not in seen_texts and doc not in junk_docs:
+                clear_docs.append(doc)
+                seen_texts.add(doc.page_content)
+        return clear_docs


suggestion: yep, those should be shared, if you want them to be inside of this class - make them a mixin

pgronkievitz · 2024-09-20T07:15:41Z

+def test_ds_converter(clean_web_loader):
+    docs = [Document(page_content="Document 1"), Document(page_content="Document 2")]
+    expected_output = [{"text": "Document 1"}, {"text": "Document 2"}]
+    assert clean_web_loader.ds_converter(docs) == expected_output


issue: tests should be clean functions with no side effects, this will break once example.com changes (or this fixture is unnnecessary at all)

Kleczyk and others added 12 commits December 4, 2023 19:04

add new folders for notebooks

8777375

add tutorials folder

75aba49

move some README.md

dacee35

Update README.md

aea8b85

Add files via upload

e919f6f

web extractor structure sketch

102ef8c

Delete notebooks/data_gathering/WEB/WEB.ipynb

ef69ea5

loaders/web_loader/clean_web_loader.py

03bf352

feat: Implement CleanWebLoader class

25e9d96

Add local data loader and tests for both loaders

0c1cd95

Fix notebooks readme file

384e7b3

feat: Implement upload_data function using REST API endpoint

1ab17eb

github-actions Bot requested a review from bafaurazan April 28, 2024 08:44

pgronkievitz suggested changes May 15, 2024

View reviewed changes

refactor: split loading logic into helper methods and improve error h…

93e232c

…andling

NewBlueWizard commented Sep 19, 2024

View reviewed changes

pgronkievitz suggested changes Sep 20, 2024

View reviewed changes


		article = Article("")

		def __init__(self, url_list: Union[List[str], str], depth: int = 1):

	def __init__(self, url_list: Union[List[str], str], depth: int = 1):
	def __init__(self, url_list: list[str] \| str, depth: int = 1):

	def __init__(self, path: Union[List[str], str]):
	def __init__(self, path: list[str] \| str):

		black = "*"
		flake8 = "*" No newline at end of file

Conversation

NewBlueWizard commented Apr 28, 2024

Uh oh!

review-notebook-app Bot commented Apr 28, 2024

Uh oh!

pgronkievitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NewBlueWizard left a comment

Choose a reason for hiding this comment

Uh oh!

pgronkievitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants