Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
2e455e4
Added argument for customization of rsync command, changed rsync argu…
Trenza1ore Dec 28, 2023
e424d95
Added Win32 support and a function for cleaning empty directories
Trenza1ore Dec 30, 2023
eabe981
Fixed typo, exposing nltk_data dir as string variable
Trenza1ore Dec 30, 2023
f759ae0
Used OS-independent path-parsing, changed behavior of process_book fu…
Trenza1ore Dec 30, 2023
5e3ea27
Used OS-independent path-parsing
Trenza1ore Dec 30, 2023
be5cb20
Added Win32 support
Trenza1ore Dec 30, 2023
0e6404d
Added Win32 support and freedom to specify the stages to go through v…
Trenza1ore Dec 30, 2023
91ef9c0
Fixed a typo and an oversight regarding nltk data download, more cust…
Trenza1ore Dec 30, 2023
d986607
modified: .gitignore
Trenza1ore Dec 30, 2023
cb4f58b
Added option to ignore UTF-8 decoding failures for "technically UTF-8…
Trenza1ore Dec 30, 2023
f694095
Added detection for books already processed, argument for specifying …
Trenza1ore Dec 30, 2023
1a06f53
Added in the missing `get_bookshelves()` call in get_data.py and an u…
Trenza1ore Jan 4, 2024
cfef209
Added an option to check if any of the resultant files are empty befo…
Trenza1ore Jan 4, 2024
c183413
Fixed several bugs in bookshelf-related code:
Trenza1ore Jan 5, 2024
93208e6
Further extended the procedures option to allow for parsing/saving of…
Trenza1ore Jan 5, 2024
8c6b530
Corrected help message for procedures option
Trenza1ore Jan 5, 2024
b81b321
Update README.md
Trenza1ore Jan 5, 2024
0349d28
Update README.md
Trenza1ore Jan 5, 2024
a96bf11
Update README.md
Trenza1ore Jan 5, 2024
0764543
Update README.md
Trenza1ore Jan 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -108,4 +108,18 @@ ENV/
.mypy_cache/

# VScode
.vscode/
.vscode/

# Windows dependencies / batch files
cwRsync*/*
*.exe
*.bat

# nltk data directory
src/nltk_data/**

# Wget temporary directory
*gutenberg*/

# Jupyter notebooks for processing data
*.ipynb
86 changes: 85 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,23 @@ SPGC-2018-07-18 contains the `tokens/` and `counts/` files of all books that wer

For **most other use cases**, however, you probably want the latest, most recent version of the corpus, in which case you should use this repository to **generate the corpus locally** on your computer. In particular, you will need to generate the corpus locally if you need to work with the original full text files in `raw/` and `text/`, since these are not included in the SPGC-2018-07-18 Zenodo dataset.

## Changes in this fork
- Windows support (still need to install `wget` and `cwRsync` (cwRsync tested with 5.4.1)
- Patched stuffs in original code:
- unwanted garbage in bookshelves info (probably due to Project Gutenberg website updating)
- oversights (bookshelves info are never fetched, nltk missing download, utf-8 decoding error in ebook header, etc.)
- bugs & typos
- Parallelised text processing
- Additional arguments for customisation (see [**Usage**](#usage) section)
> **Note:**
> this fork has only been tested on Windows (yet), but should work on other platforms unless the original code doesn't work in the first place?

## Todo
- Better tokenisation rules?
- Chinese books are all empty after tokenisation -> use jieba, probably?
- Only tokens that return `True` for `str.isalpha()` are kept currently
- Faster method for getting bookselves info?


## Installation
:warning: **Python 2.x is not supported** Please make sure your system runs Python 3.x. (https://pythonclock.org/).
Expand Down Expand Up @@ -43,14 +60,81 @@ python get_data.py
This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...).

Notice that if you already have some of the data, the program will only download those you are missing (we use `rsync` for this). It is hence easy to update the dataset periodically to keep it up-to-date by just running `get_data.py`.

> For Windows users, see [**Usage**](#usage) section

## Processing the data
To process all the data in the `raw/` directory, run
```bash
python process_data.py
```
This will fill in the `text/`, `tokens/` and `counts/` folders.
> To avoid losing ebooks that are actually UTF-8 but mistakenly removed in the original code, see [**Usage**](#usage) section

## Usage
**Recommended usage for `get_data.py` (Windows user):**
```bash
python get_data.py --rsync "cwRsync_5.4.1/rsync"
```
(replace `cwRsync_5.4.1/rsync` with path to your rsync binary, `.exe` is not needed)

**Recommended usage for `process_data.py`:**
```bash
python process_data.py --ignore
```

**How to use `get_data.py` with customisation options:**
```
python get_data.py --help
usage: Update local PG repository.

This script will download all books currently not in your
local copy of PG and get the latest version of the metadata.

[-h] [-m MIRROR] [-r RAW] [-M METADATA] [-p PATTERN] [-k] [-owr] [-q] [-c] [--rsync RSYNC] [--procedures PROCEDURES]

options:
-h, --help show this help message and exit
-m MIRROR, --mirror MIRROR
Path to the mirror folder that will be updated via rsync.
-r RAW, --raw RAW Path to the raw folder.
-M METADATA, --metadata METADATA
Path to the metadata folder.
-p PATTERN, --pattern PATTERN
Patterns to get only a subset of books.
-k, --keep_rdf If there is an RDF file in metadata dir, do not overwrite it.
-owr, --overwrite_raw
Overwrite files in raw.
-q, --quiet Quiet mode, do not print info, warnings, etc
-c, --clean Clean the mirror directory to remove any empty folders
--rsync RSYNC Specify an alternative rsync command
--procedures PROCEDURES
Procedures to go through, defaults to "pdlmbs": [p]ull mirror files; find [d]uplicates; hard [l]ink from mirror to raw;
get [m]etadata; get [b]ookshelf information; [s]tore bookshelf information
```

**How to use `process_data.py` with customisation options:**
```
python process_data.py --help
usage: Processing raw texts from Project Gutenberg: i) removing headers,ii) tokenizing, and iii) counting words.
[-h] [-r RAW] [-ote OUTPUT_TEXT] [-oto OUTPUT_TOKENS] [-oco OUTPUT_COUNTS] [-p PATTERN] [-q] [-l LOG_FILE] [-c] [--ignore]
[--pool {process,thread}]

options:
-h, --help show this help message and exit
-r RAW, --raw RAW Path to the raw-folder
-ote OUTPUT_TEXT, --output_text OUTPUT_TEXT
Path to text-output (text_dir)
-oto OUTPUT_TOKENS, --output_tokens OUTPUT_TOKENS
Path to tokens-output (tokens_dir)
-oco OUTPUT_COUNTS, --output_counts OUTPUT_COUNTS
Path to counts-output (counts_dir)
-p PATTERN, --pattern PATTERN
Pattern to specify a subset of books
-q, --quiet Quiet mode, do not print info, warnings, etc
-l LOG_FILE, --log_file LOG_FILE
Path to log file
-c, --check_empty Whether to check if existing files are empty
--ignore Whether to ignore UTF-8 decoding errors
--pool {process,thread}
Whether to use multi-processing or multi-threading
```
101 changes: 74 additions & 27 deletions get_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
M. Gerlach & F. Font-Clos

"""
from src.utils import populate_raw_from_mirror, list_duplicates_in_mirror
from src.utils import populate_raw_from_mirror, list_duplicates_in_mirror, remove_empty_dirs, is_win32
from src.metadataparser import make_df_metadata
from src.bookshelves import get_bookshelves
from src.bookshelves import parse_bookshelves
Expand All @@ -22,6 +22,7 @@
"This script will download all books currently not in your\n"
"local copy of PG and get the latest version of the metadata.\n"
)

# mirror dir
parser.add_argument(
"-m", "--mirror",
Expand Down Expand Up @@ -68,16 +69,50 @@
action="store_true",
help="Quiet mode, do not print info, warnings, etc"
)

# clean argument, to supress info
parser.add_argument(
"-c", "--clean",
action="store_true",
help="Clean the mirror directory to remove any empty folders"
)

# rsync command
parser.add_argument(
"--rsync",
help="Specify an alternative rsync command",
default='rsync',
type=str)

# rsync command
parser.add_argument(
"--procedures",
help='''Procedures to go through, defaults to \"pdlmbs\":
[p]ull mirror files;
find [d]uplicates;
hard [l]ink from mirror to raw;
get [m]etadata;
get [b]ookshelf information;
[s]tore bookshelf information''',
default='pdlmbs',
type=str)

# create the parser
args = parser.parse_args()
mirror_dir, raw_dir, metadata_dir = args.mirror, args.raw, args.metadata

if is_win32:
print("Windows detected, please make sure wget is installed and added to PATH")
mirror_dir = mirror_dir.replace('/', '\\')
raw_dir = raw_dir.replace('/', '\\')
metadata_dir = metadata_dir.replace('/', '\\')

# check that all dirs exist
if not os.path.isdir(args.mirror):
if not os.path.isdir(mirror_dir):
raise ValueError("The specified mirror directory does not exist.")
if not os.path.isdir(args.raw):
if not os.path.isdir(raw_dir):
raise ValueError("The specified raw directory does not exist.")
if not os.path.isdir(args.metadata):
if not os.path.isdir(metadata_dir):
raise ValueError("The specified metadata directory does not exist.")

# Update the .mirror directory via rsync
Expand All @@ -99,49 +134,61 @@
# + 12345 - 0 . t x t
#---------------------------------------------
# [.-][t0][x.]t[x.] * [t8]
sp_args = ["rsync", "-am%s" % vstring,
"--include", "*/",
"--include", "[p123456789][g0123456789]%s[.-][t0][x.]t[x.]*[t8]" % args.pattern,
"--exclude", "*",
"aleph.gutenberg.org::gutenberg", args.mirror
]
subprocess.call(sp_args)
includes = ["*/", "[p123456789][g0123456789]%s[.-][t0][x.]t[x.]*[t8]" % args.pattern]
excludes = ["*"]
sp_args = ' '.join([args.rsync, "-am%s" % vstring] + ["--include=\"%s\"" % i for i in includes] + \
["--exclude=\"%s\"" % i for i in excludes] + ["aleph.gutenberg.org::gutenberg", mirror_dir])

# If specified, remove any empty directory that might be caused by bugs or wrong patterns in rsync
if args.clean:
remove_empty_dirs(mirror_dir, args.quiet)

# Subprocess call (default arguments):
# rsync -amv --include="*/" --include="[p123456789][g0123456789]*[.-][t0][x.]t[x.]*[t8]" --exclude="*" aleph.gutenberg.org::gutenberg data/.mirror/
if 'p' in args.procedures:
subprocess.call(sp_args)

# Get rid of duplicates
# ---------------------
# A very small portion of books are stored more than
# once in PG's site. We keep the newest one, see
# erase_duplicates_in_mirror docstring.
dups_list = list_duplicates_in_mirror(mirror_dir=args.mirror)
dups_list = list_duplicates_in_mirror(mirror_dir=mirror_dir) if 'd' in args.procedures else []

# Populate raw from mirror
# ------------------------
# We populate 'raw_dir' hardlinking to
# the hidden 'mirror_dir'. Names are standarized
# into PG12345_raw.txt form.
populate_raw_from_mirror(
mirror_dir=args.mirror,
raw_dir=args.raw,
overwrite=args.overwrite_raw,
dups_list=dups_list,
quiet=args.quiet
if 'l' in args.procedures:
populate_raw_from_mirror(
mirror_dir=mirror_dir,
raw_dir=raw_dir,
overwrite=args.overwrite_raw,
dups_list=dups_list,
quiet=args.quiet
)

# Update metadata
# ---------------
# By default, update the whole metadata csv
# file each time new data is downloaded.
make_df_metadata(
path_xml=os.path.join(args.metadata, 'rdf-files.tar.bz2'),
path_out=os.path.join(args.metadata, 'metadata.csv'),
update=args.keep_rdf
if 'm' in args.procedures:
make_df_metadata(
path_xml=os.path.join(metadata_dir, 'rdf-files.tar.bz2'),
path_out=os.path.join(metadata_dir, 'metadata.csv'),
update=args.keep_rdf
)

# Bookshelves
# -----------
# Get bookshelves and their respective books and titles as dicts
BS_dict, BS_num_to_category_str_dict = parse_bookshelves()
with open("metadata/bookshelves_ebooks_dict.pkl", 'wb') as fp:
pickle.dump(BS_dict, fp)
with open("metadata/bookshelves_categories_dict.pkl", 'wb') as fp:
pickle.dump(BS_num_to_category_str_dict, fp)
if 'b' in args.procedures:
get_bookshelves()

if 's' in args.procedures:
BS_dict, BS_num_to_category_str_dict = parse_bookshelves()
with open("metadata/bookshelves_ebooks_dict.pkl", 'wb') as fp:
pickle.dump(BS_dict, fp)
with open("metadata/bookshelves_categories_dict.pkl", 'wb') as fp:
pickle.dump(BS_num_to_category_str_dict, fp)
Loading