pgcorpus · Trenza1ore · Dec 28, 2023 · Dec 30, 2023 · Dec 30, 2023 · Dec 30, 2023
diff --git a/.gitignore b/.gitignore
@@ -108,4 +108,18 @@ ENV/
 .mypy_cache/
 
 # VScode
-.vscode/
+.vscode/
+
+# Windows dependencies / batch files
+cwRsync*/*
+*.exe
+*.bat
+
+# nltk data directory
+src/nltk_data/**
+
+# Wget temporary directory
+*gutenberg*/
+
+# Jupyter notebooks for processing data
+*.ipynb
diff --git a/README.md b/README.md
@@ -14,6 +14,23 @@ SPGC-2018-07-18 contains the `tokens/` and `counts/` files of all books that wer
 
 For **most other use cases**, however, you probably want the latest, most recent version of the corpus, in which case you should use this repository to **generate the corpus locally** on your computer. In particular, you will need to generate the corpus locally if you need to work with the original full text files in `raw/` and `text/`, since these are not included in the SPGC-2018-07-18 Zenodo dataset.
 
+## Changes in this fork
+- Windows support (still need to install `wget` and `cwRsync` (cwRsync tested with 5.4.1)
+- Patched stuffs in original code:
+  - unwanted garbage in bookshelves info (probably due to Project Gutenberg website updating)
+  - oversights (bookshelves info are never fetched, nltk missing download, utf-8 decoding error in ebook header, etc.)
+  - bugs & typos
+- Parallelised text processing
+- Additional arguments for customisation (see [**Usage**](#usage) section)
+> **Note:**
+> this fork has only been tested on Windows (yet), but should work on other platforms unless the original code doesn't work in the first place?
+
+## Todo
+- Better tokenisation rules?
+  - Chinese books are all empty after tokenisation -> use jieba, probably?
+  - Only tokens that return `True` for `str.isalpha()` are kept currently
+- Faster method for getting bookselves info?
+
 
 ## Installation
 :warning: **Python 2.x is not supported** Please make sure your system runs Python 3.x. (https://pythonclock.org/).  
@@ -43,14 +60,81 @@ python get_data.py
 This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...).
 
 Notice that if you already have some of the data, the program will only download those you are missing (we use `rsync` for this). It is hence easy to update the dataset periodically to keep it up-to-date by just running `get_data.py`.
-
+> For Windows users, see [**Usage**](#usage) section
 
 ## Processing the data
 To process all the data in the `raw/` directory, run
 ```bash
 python process_data.py
 ```
 This will fill in the `text/`, `tokens/` and `counts/` folders.
+> To avoid losing ebooks that are actually UTF-8 but mistakenly removed in the original code, see [**Usage**](#usage) section
 
+## Usage
+**Recommended usage for `get_data.py` (Windows user):** 
+```bash
+python get_data.py --rsync "cwRsync_5.4.1/rsync"
+```
+(replace `cwRsync_5.4.1/rsync` with path to your rsync binary, `.exe` is not needed)
 
+**Recommended usage for `process_data.py`:**
+```bash
+python process_data.py --ignore
+```
 
+**How to use `get_data.py` with customisation options:**
+```
+python get_data.py --help
+usage: Update local PG repository.
+
+This script will download all books currently not in your
+local copy of PG and get the latest version of the metadata.
+
+       [-h] [-m MIRROR] [-r RAW] [-M METADATA] [-p PATTERN] [-k] [-owr] [-q] [-c] [--rsync RSYNC] [--procedures PROCEDURES]
+
+options:
+  -h, --help            show this help message and exit
+  -m MIRROR, --mirror MIRROR
+                        Path to the mirror folder that will be updated via rsync.
+  -r RAW, --raw RAW     Path to the raw folder.
+  -M METADATA, --metadata METADATA
+                        Path to the metadata folder.
+  -p PATTERN, --pattern PATTERN
+                        Patterns to get only a subset of books.
+  -k, --keep_rdf        If there is an RDF file in metadata dir, do not overwrite it.
+  -owr, --overwrite_raw
+                        Overwrite files in raw.
+  -q, --quiet           Quiet mode, do not print info, warnings, etc
+  -c, --clean           Clean the mirror directory to remove any empty folders
+  --rsync RSYNC         Specify an alternative rsync command
+  --procedures PROCEDURES
+                        Procedures to go through, defaults to "pdlmbs": [p]ull mirror files; find [d]uplicates; hard [l]ink from mirror to raw;   
+                        get [m]etadata; get [b]ookshelf information; [s]tore bookshelf information
+```
+
+**How to use `process_data.py` with customisation options:**
+```
+python process_data.py --help
+usage: Processing raw texts from Project Gutenberg: i) removing headers,ii) tokenizing, and iii) counting words.
+       [-h] [-r RAW] [-ote OUTPUT_TEXT] [-oto OUTPUT_TOKENS] [-oco OUTPUT_COUNTS] [-p PATTERN] [-q] [-l LOG_FILE] [-c] [--ignore]
+       [--pool {process,thread}]
+
+options:
+  -h, --help            show this help message and exit
+  -r RAW, --raw RAW     Path to the raw-folder
+  -ote OUTPUT_TEXT, --output_text OUTPUT_TEXT
+                        Path to text-output (text_dir)
+  -oto OUTPUT_TOKENS, --output_tokens OUTPUT_TOKENS
+                        Path to tokens-output (tokens_dir)
+  -oco OUTPUT_COUNTS, --output_counts OUTPUT_COUNTS
+                        Path to counts-output (counts_dir)
+  -p PATTERN, --pattern PATTERN
+                        Pattern to specify a subset of books
+  -q, --quiet           Quiet mode, do not print info, warnings, etc
+  -l LOG_FILE, --log_file LOG_FILE
+                        Path to log file
+  -c, --check_empty     Whether to check if existing files are empty
+  --ignore              Whether to ignore UTF-8 decoding errors
+  --pool {process,thread}
+                        Whether to use multi-processing or multi-threading
+```
diff --git a/get_data.py b/get_data.py
@@ -5,7 +5,7 @@
 M. Gerlach & F. Font-Clos
 
 """
-from src.utils import populate_raw_from_mirror, list_duplicates_in_mirror
+from src.utils import populate_raw_from_mirror, list_duplicates_in_mirror, remove_empty_dirs, is_win32
 from src.metadataparser import make_df_metadata
 from src.bookshelves import get_bookshelves
 from src.bookshelves import parse_bookshelves
@@ -22,6 +22,7 @@
         "This script will download all books currently not in your\n"
         "local copy of PG and get the latest version of the metadata.\n"
         )
+
     # mirror dir
     parser.add_argument(
         "-m", "--mirror",
@@ -68,16 +69,50 @@
         action="store_true",
         help="Quiet mode, do not print info, warnings, etc"
         )
+
+    # clean argument, to supress info
+    parser.add_argument(
+        "-c", "--clean",
+        action="store_true",
+        help="Clean the mirror directory to remove any empty folders"
+        )
+
+    # rsync command
+    parser.add_argument(
+        "--rsync",
+        help="Specify an alternative rsync command",
+        default='rsync',
+        type=str)
+
+    # rsync command
+    parser.add_argument(
+        "--procedures",
+        help='''Procedures to go through, defaults to \"pdlmbs\":
+        [p]ull mirror files;
+        find [d]uplicates;
+        hard [l]ink from mirror to raw;
+        get [m]etadata;
+        get [b]ookshelf information;
+        [s]tore bookshelf information''',
+        default='pdlmbs',
+        type=str)
 
     # create the parser
     args = parser.parse_args()
+    mirror_dir, raw_dir, metadata_dir = args.mirror, args.raw, args.metadata
+
+    if is_win32:
+        print("Windows detected, please make sure wget is installed and added to PATH")
+        mirror_dir = mirror_dir.replace('/', '\\')
+        raw_dir = raw_dir.replace('/', '\\')
+        metadata_dir = metadata_dir.replace('/', '\\')
 
     # check that all dirs exist
-    if not os.path.isdir(args.mirror):
+    if not os.path.isdir(mirror_dir):
         raise ValueError("The specified mirror directory does not exist.")
-    if not os.path.isdir(args.raw):
+    if not os.path.isdir(raw_dir):
         raise ValueError("The specified raw directory does not exist.")
-    if not os.path.isdir(args.metadata):
+    if not os.path.isdir(metadata_dir):
         raise ValueError("The specified metadata directory does not exist.")
 
     # Update the .mirror directory via rsync
@@ -99,49 +134,61 @@
     # + 12345 -   0   .  t x                 t 
     #---------------------------------------------
     #        [.-][t0][x.]t[x.]    *         [t8]
-    sp_args = ["rsync", "-am%s" % vstring,
-               "--include", "*/",
-               "--include", "[p123456789][g0123456789]%s[.-][t0][x.]t[x.]*[t8]" % args.pattern,
-               "--exclude", "*",
-               "aleph.gutenberg.org::gutenberg", args.mirror
-               ]
-    subprocess.call(sp_args)
+    includes = ["*/", "[p123456789][g0123456789]%s[.-][t0][x.]t[x.]*[t8]" % args.pattern]
+    excludes = ["*"]
+    sp_args = ' '.join([args.rsync, "-am%s" % vstring] + ["--include=\"%s\"" % i for i in includes] + \
+        ["--exclude=\"%s\"" % i for i in excludes] + ["aleph.gutenberg.org::gutenberg", mirror_dir])
+
+    # If specified, remove any empty directory that might be caused by bugs or wrong patterns in rsync
+    if args.clean:
+        remove_empty_dirs(mirror_dir, args.quiet)
+
+    # Subprocess call (default arguments):
+    # rsync -amv --include="*/" --include="[p123456789][g0123456789]*[.-][t0][x.]t[x.]*[t8]" --exclude="*" aleph.gutenberg.org::gutenberg data/.mirror/
+    if 'p' in args.procedures:
+        subprocess.call(sp_args) 
 
     # Get rid of duplicates
     # ---------------------
     # A very small portion of books are stored more than
     # once in PG's site. We keep the newest one, see
     # erase_duplicates_in_mirror docstring.
-    dups_list = list_duplicates_in_mirror(mirror_dir=args.mirror)
+    dups_list = list_duplicates_in_mirror(mirror_dir=mirror_dir) if 'd' in args.procedures else []
 
     # Populate raw from mirror
     # ------------------------
     # We populate 'raw_dir' hardlinking to
     # the hidden 'mirror_dir'. Names are standarized
     # into PG12345_raw.txt form.
-    populate_raw_from_mirror(
-        mirror_dir=args.mirror,
-        raw_dir=args.raw,
-        overwrite=args.overwrite_raw,
-        dups_list=dups_list,
-        quiet=args.quiet
+    if 'l' in args.procedures:
+        populate_raw_from_mirror(
+            mirror_dir=mirror_dir,
+            raw_dir=raw_dir,
+            overwrite=args.overwrite_raw,
+            dups_list=dups_list,
+            quiet=args.quiet
         )
 
     # Update metadata
     # ---------------
     # By default, update the whole metadata csv
     # file each time new data is downloaded.
-    make_df_metadata(
-        path_xml=os.path.join(args.metadata, 'rdf-files.tar.bz2'),
-        path_out=os.path.join(args.metadata, 'metadata.csv'),
-        update=args.keep_rdf
+    if 'm' in args.procedures:
+        make_df_metadata(
+            path_xml=os.path.join(metadata_dir, 'rdf-files.tar.bz2'),
+            path_out=os.path.join(metadata_dir, 'metadata.csv'),
+            update=args.keep_rdf
         )
 
     # Bookshelves
     # -----------
     # Get bookshelves and their respective books and titles as dicts
-    BS_dict, BS_num_to_category_str_dict = parse_bookshelves()
-    with open("metadata/bookshelves_ebooks_dict.pkl", 'wb') as fp:
-        pickle.dump(BS_dict, fp)
-    with open("metadata/bookshelves_categories_dict.pkl", 'wb') as fp:
-        pickle.dump(BS_num_to_category_str_dict, fp)
+    if 'b' in args.procedures:
+        get_bookshelves()
+
+    if 's' in args.procedures:
+        BS_dict, BS_num_to_category_str_dict = parse_bookshelves()
+        with open("metadata/bookshelves_ebooks_dict.pkl", 'wb') as fp:
+            pickle.dump(BS_dict, fp)
+        with open("metadata/bookshelves_categories_dict.pkl", 'wb') as fp:
+            pickle.dump(BS_num_to_category_str_dict, fp)