wireshark/tools/html2text.py

203 lines
6.6 KiB
Python
Raw Normal View History

#!/usr/bin/env python3
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
#
# html2text.py - converts HTML to text
#
# Wireshark - Network traffic analyzer
# By Gerald Combs <gerald@wireshark.org>
# Copyright 1998 Gerald Combs
#
# SPDX-License-Identifier: GPL-2.0-or-later
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
from __future__ import unicode_literals
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
__author__ = "Peter Wu <peter@lekensteyn.nl>"
__copyright__ = "Copyright 2015, Peter Wu"
__license__ = "GPL (v2 or later)"
# TODO:
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
# multiple list indentation levels
# maybe allow for ascii output instead of utf-8?
import sys
from textwrap import TextWrapper
try:
from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint
Fix issues discovered by common python linters Fix some issues discovered by common python linters including: * switch `None` comparisons to use `is` rather than `==`. Identity != equality, and I've spent 40+ hours before tracking down a subtle bug caused by exactly this issue. Note that this may introduce a problem if one of the scripts is depending on this behavior, in which case the comparison should be changed to `True`/`False` rather than `None`. * Use `except Exception:` as bare `except:` statements have been discouraged for years. Ideally for some of these we'd examine if there were specific exceptions that should be caught, but for now I simply caught all. Again, this could introduce very subtle behavioral changes under Python 2, but IIUC, that was all fixed in Python 3, so safe to move to `except Exception:`. * Use more idiomatic `if not x in y`--> `if x not in y` * Use more idiomatic 2 blank lines. I only did this at the beginning, until I realized how overwhelming this was going to be to apply, then I stopped. * Add a TODO where an undefined function name is called, so will fail whenever that code is run. * Add more idiomatic spacing around `:`. This is also only partially cleaned up, as I gave up when I saw how `asn2wrs.py` was clearly infatuated with the construct. * Various other small cleanups, removed some trailing whitespace and improper indentation that wasn't a multiple of 4, etc. There is still _much_ to do, but I haven't been heavily involved with this project before, so thought this was a sufficient amount to put up and see what the feedback is. Linters that I have enabled which highlighted some of these issues include: * `pylint` * `flake8` * `pycodestyle`
2020-09-21 05:44:41 +00:00
except ImportError: # Python 3
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
from html.parser import HTMLParser
from html.entities import name2codepoint
unichr = chr # for html entity handling
class TextHTMLParser(HTMLParser):
"""Converts a HTML document to text."""
def __init__(self):
try:
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
# Python 3.4
HTMLParser. __init__(self, convert_charrefs=True)
Fix issues discovered by common python linters Fix some issues discovered by common python linters including: * switch `None` comparisons to use `is` rather than `==`. Identity != equality, and I've spent 40+ hours before tracking down a subtle bug caused by exactly this issue. Note that this may introduce a problem if one of the scripts is depending on this behavior, in which case the comparison should be changed to `True`/`False` rather than `None`. * Use `except Exception:` as bare `except:` statements have been discouraged for years. Ideally for some of these we'd examine if there were specific exceptions that should be caught, but for now I simply caught all. Again, this could introduce very subtle behavioral changes under Python 2, but IIUC, that was all fixed in Python 3, so safe to move to `except Exception:`. * Use more idiomatic `if not x in y`--> `if x not in y` * Use more idiomatic 2 blank lines. I only did this at the beginning, until I realized how overwhelming this was going to be to apply, then I stopped. * Add a TODO where an undefined function name is called, so will fail whenever that code is run. * Add more idiomatic spacing around `:`. This is also only partially cleaned up, as I gave up when I saw how `asn2wrs.py` was clearly infatuated with the construct. * Various other small cleanups, removed some trailing whitespace and improper indentation that wasn't a multiple of 4, etc. There is still _much_ to do, but I haven't been heavily involved with this project before, so thought this was a sufficient amount to put up and see what the feedback is. Linters that I have enabled which highlighted some of these issues include: * `pylint` * `flake8` * `pycodestyle`
2020-09-21 05:44:41 +00:00
except Exception:
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
HTMLParser. __init__(self)
# All text, concatenated
self.output_buffer = ''
# The current text block which is being constructed
self.text_block = ''
# Whether the previous element was terminated with whitespace
self.need_space = False
# Whether to prevent word-wrapping the contents (for "pre" tag)
self.skip_wrap = False
# track list items
self.list_item_prefix = None
self.ordered_list_index = None
# Indentation (for heading and paragraphs)
self.indent_levels = [0, 0]
# Don't dump CSS, scripts, etc.
self.ignore_tags = ('head', 'style', 'script')
self.ignore_level = 0
# href footnotes.
self.footnotes = []
self.href = None
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
def _wrap_text(self, text):
"""Wraps text, but additionally indent list items."""
initial_indent = indent = sum(self.indent_levels) * ' '
if self.list_item_prefix:
initial_indent += self.list_item_prefix
indent += ' '
kwargs = {
'width': 72,
'initial_indent': initial_indent,
'subsequent_indent': indent
}
kwargs['break_on_hyphens'] = False
wrapper = TextWrapper(**kwargs)
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
return '\n'.join(wrapper.wrap(text))
def _commit_block(self, newline='\n\n'):
text = self.text_block
if text:
if not self.skip_wrap:
text = self._wrap_text(text)
self.output_buffer += text + newline
self.text_block = ''
self.need_space = False
def handle_starttag(self, tag, attrs):
# end a block of text on <br>, but also flush list items which are not
# terminated.
if tag == 'br' or tag == 'li':
self._commit_block('\n')
if tag == 'pre':
self.skip_wrap = True
# Following list items are numbered.
if tag == 'ol':
self.ordered_list_index = 1
if tag == 'ul':
self.list_item_prefix = ''
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
if tag == 'li' and self.ordered_list_index:
self.list_item_prefix = ' %d. ' % (self.ordered_list_index)
self.ordered_list_index += 1
if tag[0] == 'h' and len(tag) == 2 and \
(tag[1] >= '1' and tag[1] <= '6'):
self.indent_levels = [int(tag[1]) - 1, 0]
if tag == 'p':
self.indent_levels[1] = 1
if tag == 'a':
try:
href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
if '://' in href: # Skip relative URLs and links.
self.href = href
except IndexError:
self.href = None
if tag in self.ignore_tags:
self.ignore_level += 1
def handle_data(self, data):
if self.ignore_level > 0:
return
elif self.skip_wrap:
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
block = data
else:
if self.href and data == self.href:
# This is a self link. Don't create a footnote.
self.href = None
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
# For normal text, fold multiple whitespace and strip
# leading and trailing spaces for the whole block (but
# keep spaces in the middle).
block = ''
if data.strip() and data[:1].isspace():
# Keep spaces in the middle
self.need_space = True
if self.need_space and data.strip() and self.text_block:
block = ' '
block += ' '.join(data.split())
self.need_space = data[-1:].isspace()
self.text_block += block
def handle_endtag(self, tag):
block_elements = 'p li ul pre ol h1 h2 h3 h4 h5 h6'
#block_elements += ' dl dd dt'
if tag in block_elements.split():
self._commit_block()
if tag in ('ol', 'ul'):
self.list_item_prefix = None
self.ordered_list_index = None
if tag == 'pre':
self.skip_wrap = False
if tag == 'a' and self.href:
self.footnotes.append(self.href)
self.text_block += '[{0}]'.format(len(self.footnotes))
if tag in self.ignore_tags:
self.ignore_level -= 1
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
def handle_charref(self, name):
self.handle_data(unichr(int(name)))
def handle_entityref(self, name):
self.handle_data(unichr(name2codepoint[name]))
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
def close(self):
HTMLParser.close(self)
self._commit_block()
if len(self.footnotes) > 0:
self.list_item_prefix = None
self.indent_levels = [1, 0]
self.text_block = 'References'
self._commit_block()
self.indent_levels = [1, 1]
footnote_num = 1
for href in self.footnotes:
self.text_block += '{0:>2}. {1}\n'.format(footnote_num, href)
footnote_num += 1
self._commit_block('\n')
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
byte_output = self.output_buffer.encode('utf-8')
if hasattr(sys.stdout, 'buffer'):
sys.stdout.buffer.write(byte_output)
else:
sys.stdout.write(byte_output)
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
def main():
htmlparser = TextHTMLParser()
if len(sys.argv) > 1 and sys.argv[1] != '-':
filename = sys.argv[1]
f = open(filename, 'rb')
else:
filename = None
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
f = sys.stdin
try:
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
if hasattr(f, 'buffer'):
# Access raw (byte) buffer in Python 3 instead of decoded one
f = f.buffer
# Read stdin as as Unicode string
htmlparser.feed(f.read().decode('utf-8'))
finally:
if filename is not None:
f.close()
Always use html2text.py for FAQ, improve output A recent commit broke compilation with Python 3. The original author of html2text.py is deceased and the fork has increased the number of files for this "simple" helper. The html2text.py script in this patch was rewritten and its output matches with lynx (except for a few newlines around lists). This means that indentation has been added for headings, paragraphs and lists. Also, since it was written from scratch, a new license could be chosen that matches Wireshark. Since now the in-tree html2text.py script provides nicer output, remove detection of the alternative programs (elinks, links). lynx/w3m is somehow still necessary for asciidoc though. (I also looked into reusing html2text.py for the release notes to replace asciidoc, but the --format=html output produces different output (HTML adds a ToC and section numbers). For now still require lynx for release notes) Tested with Python 2.6.6, 2.7.9, 3.2.6 and 3.4.3 under LC_ALL=C and LC_ALL=en_US.UTF-8 on Linux. Tested reading from stdin and file, writing to file, pipe and tty. Tested with cmake (Ninja) and autotools on Arch Linux x86_64. Test: # For each $PATH per python version, execute (with varying LC_ALL) help/faq.py -b | tools/html2text.py /dev/stdin | md5sum help/faq.py -b | tools/html2text.py | md5sum help/faq.py -b | tools/html2text.py help/faq.py -b | tools/html2text.py >/dev/null Change-Id: I6409450a3e6c8b010ca082251f9db7358b0cc2fd Reviewed-on: https://code.wireshark.org/review/7779 Petri-Dish: Peter Wu <peter@lekensteyn.nl> Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org> Reviewed-by: Anders Broman <a.broman58@gmail.com>
2015-03-21 10:57:01 +00:00
htmlparser.close()
if __name__ == '__main__':
sys.exit(main())