As this repo appears not to be maintained, I think it may be useful for folks to know that Project Gutenberg has been making available complete utf8 text for its library in gzipped tar archives, updated weekly.
These files are 100% utf8 and have uniform headers and footers that are easy to strip for text analysis and other automated uses. The headers also contain metadata in uniform format.
https://www.gutenberg.org/cache/epub/feeds/txt-files.tar.zip (currently 9.9 GB)
As this repo appears not to be maintained, I think it may be useful for folks to know that Project Gutenberg has been making available complete utf8 text for its library in gzipped tar archives, updated weekly.
These files are 100% utf8 and have uniform headers and footers that are easy to strip for text analysis and other automated uses. The headers also contain metadata in uniform format.
https://www.gutenberg.org/cache/epub/feeds/txt-files.tar.zip (currently 9.9 GB)