Skip to content

Fixed typos, an oversight regarding nltk data download, and added support for multi-threading/processing, Windows, ignoring UTF-8 decoding failures, etc.#47

Open
Trenza1ore wants to merge 20 commits into
pgcorpus:masterfrom
Trenza1ore:master
Open

Fixed typos, an oversight regarding nltk data download, and added support for multi-threading/processing, Windows, ignoring UTF-8 decoding failures, etc.#47
Trenza1ore wants to merge 20 commits into
pgcorpus:masterfrom
Trenza1ore:master

Conversation

@Trenza1ore
Copy link
Copy Markdown

@Trenza1ore Trenza1ore commented Dec 28, 2023

Changes are described in commit messages

@Trenza1ore Trenza1ore changed the title Changes regarding usage of rsync Fixed typos, an oversight regarding nltk data download, and Windows support Dec 30, 2023
…multi-threading or processing, argument for ignoring UTF-8 decoding failures
@Trenza1ore Trenza1ore changed the title Fixed typos, an oversight regarding nltk data download, and Windows support Fixed typos, an oversight regarding nltk data download, and added support for multi-threading/processing, Windows, ignoring UTF-8 decoding failures, etc. Dec 30, 2023
…tility function for checking if a file is empty
- used `shutil.rmtree` instead of `os.rmdir` since latter is only for empty dir and added check for existence of dir to remove (win32)
- removed `-p` option when calling wget to avoid downloading large amount of useless data (original code)
- filtered out many garbage data (non-book weblinks) in bookshelves dicts and removed "PG" prefix for values of index.html in bookshelves_ebooks_dict.pkl (original code)
… bookshelves info without running time-consuming Wget
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant