Skip to main content

extract markdown flavored text from html

Project description

Python package

html5lib_to_markdown

This package offers a way to convert HTML to the Markdown format.

This package is currently targeting a SUBSET of full HTML->Markdown conversion to address internal needs. More functionality will be added as needed. Pull requests are welcome.

There are many packages that convert HTML to Markdown. Why create another package to do this task?

as_markdown = to_markdown(html) == to_markdown(as_markdown) == to_markdown(to_markdown(html)) == to_markdown(to_markdown(as_markdown))

This can't be guaranteed in all situations because of how Markdown and HTML work, but it is a goal. This library should not add artifacts.

At a minimum, our goal is this

as_markdown = to_markdown(html) == to_markdown(to_html(as_markdown))
as_html = to_html(as_markdown) == to_html(to_markdown(as_html))
  • Customized feature: A departure from the core Markdown specification was needed for a few elements:

    1. Render img tags, not Markdown format
    2. Render a tags, not Markdown format
  • Customized feature: Python2 and Python3 compatibility. This shouldn't be a feature, but it is. Some excellent packages in this space stopped supporting Python2 already. This package aims to keep Python2 support around a bit longer than the official cutoff date, because legacy systems exist.

  • Core Implementation Detail. This package is implemented as a htmllib5 "tree adapter", which means it can be potentially be layered into many htm5lib processing routines. Other packages use BeautifulSoup, lxml or HTMLParser. These other projects are all great, but require re-processing if you are already doing things with html5lib.

Unsupported Features

Angled links are not currently supported, for example:

<http://example.com>

They are not compatible with the html5lib parser, and trying to support them will require a lot of work.

Pretty Markdown?

What is pretty Markdown?

  • There is a max of 2 newlines (1 blank line) between elements.
  • blank lines are dropped to the lowest allowable nesting of blockquotes or lists
  • whitespace is shown via HTML rendering rules

Other Libraries

The interface was inspired by the bleach user-input sanitization library, which relies on html5lib

If you just need a standard and pure "HTML to Markdown" convertor, I recommend the following libraries:

License & Copyright

This package is available under the MIT License. The code and tests are: Copyright 2019 Jonathan Vanasco (jonathan@findmeon.com)

Additional code and tests are isolated in the tests_working directory (and soon to be tests), with copyright attributed to the antimarkdown and markdownify projects. both are used under the MIT License and credited in the source

Environment Variables

  • MD_DEBUG_TOKENS - will use string representation for tokens (human readable!) instead of optimizing with ints
  • MD_DEBUG_STACKS - will print() the tokens during
  • MD_DEBUG_STACKS_SIMPLE - will print() the tokens in a simplified form

TODO

See TODO.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html5lib_to_markdown-0.0.6.tar.gz (35.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page