Skip to content

html2text

Matthew Harris edited this page Jan 21, 2016 · 2 revisions

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

> python html2text.py -h
Usage: html2text.py [(filename|url) [encoding]]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --ignore-emphasis     don't include any formatting for emphasis
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -e, --asterisk-emphasis
                        use an asterisk rather than an underscore for
                        emphasized text
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevant when -g is
                        specified as well
  --escape-all          Escape all special characters.  Output is less
                        readable, but avoids corner case formatting issues.

Clone this wiki locally