Skip to main content

Simple Python library for HTML parsing

Project description

Leaf

image Coverage Status Downloads Latest Version License

What is this?

This is a simple wrapper around lxml which adds some nice features to make working with lxml better. This library covers all my needs in HTML parsing.

Dependencies

lxml obviously :3

Features

  • Nice jquery-like CSS selectors
  • Simple access to element attributes
  • Easy way to convert HTML to other formats (bbcode, markdown, etc.)
  • A few nice functions for working with text
  • And, of course, all original features of lxml

Description

The main function of the module (for my purposes) is leaf.parse. This function takes an HTML string as argument, and returns a leaf.Parser object, which wraps an lxml object.

With this object you can do anything you want, for example:

document = leaf.parse(sample)
# get the links from the DIV with id 'menu' using CSS selectors
links = document('div#menu a')

Or you can do this:

# get first link or return None
link = document.get('div#menu a')

And you can get attributes from these results like this:

print link.onclick

You can also use standard lxml methods like object.xpath, and they return results as leaf.Parser objects.

My favorite feature is parsing HTML into bbcode (markdown, etc.):

# Let's define simple formatter, which passes text
# and wraps links into [url][/url] (like bbcode)
def code_formatter(element, children):
    # Replace <br> tag with line break
    if element.tag == 'br':
        return '\n'
    # Wrap links into [url][/url]
    if element.tag == 'a':
        return u"[url=link}]{text}[/url]".format(link=element.href, text=children)
    # Return children only for other elements.
    if children:
        return children

This function will be recursively called with element and children (this is string with children parsing result).

So, let's call this parser on some leaf.Parser object:

document.parse(code_formatter)

More detailed examples available in the tests.

Finally, this library has some nice functions for working with text:

Name Description
to_unicode Convert string to unicode string
strip_accents Strip accents from a string
strip_symbols Strip ugly unicode symbols from a string
strip_spaces Strip excess spaces from a string
strip_linebreaks Strip excess line breaks from a string

Change log

1.0.7

  • Fix badges in README.md
  • cleanup CHANGES.md

1.0.6

  • Fix installation script on LICENSE file

1.0.4

  • Convert documentation to Markdown
  • Add support for universal wheel

1.0.1

  • 100% test coverage
  • fixed bug in result wrapping (etree._Element has __iter__ too!)

1.0

  • add python3 support
  • first production release

0.4.4

  • fix inner_html method
  • added **kwargs to the parse function, added inner_html method to the Parser class
  • cssselect in deps

0.4.2

  • Node attribute modification via node.href = '/blah'
  • Custom default value for get: document.get(selector, default=None)
  • Get element by index: document.get(selector, index)

0.4.1

  • bool(node) returns True if element exists and False if element is None

0.4

  • First public version

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page