Simple Python library for HTML parsing

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Information Analysis

Project description

Leaf

What is this?

This is a simple wrapper around lxml which adds some nice features to make working with lxml better. This library covers all my needs in HTML parsing.

Dependencies

lxml obviously :3

Features

Nice jquery-like CSS selectors
Simple access to element attributes
Easy way to convert HTML to other formats (bbcode, markdown, etc.)
A few nice functions for working with text
And, of course, all original features of lxml

Description

The main function of the module (for my purposes) is leaf.parse. This function takes an HTML string as argument, and returns a leaf.Parser object, which wraps an lxml object.

With this object you can do anything you want, for example:

document = leaf.parse(sample)
# get the links from the DIV with id 'menu' using CSS selectors
links = document('div#menu a')

Or you can do this:

# get first link or return None
link = document.get('div#menu a')

And you can get attributes from these results like this:

print link.onclick

You can also use standard lxml methods like object.xpath, and they return results as leaf.Parser objects.

My favorite feature is parsing HTML into bbcode (markdown, etc.):

# Let's define simple formatter, which passes text
# and wraps links into [url][/url] (like bbcode)
def code_formatter(element, children):
    # Replace <br> tag with line break
    if element.tag == 'br':
        return '\n'
    # Wrap links into [url][/url]
    if element.tag == 'a':
        return u"[url=link}]{text}[/url]".format(link=element.href, text=children)
    # Return children only for other elements.
    if children:
        return children

This function will be recursively called with element and children (this is string with children parsing result).

So, let's call this parser on some leaf.Parser object:

document.parse(code_formatter)

More detailed examples available in the tests.

Finally, this library has some nice functions for working with text:

Name	Description
to_unicode	Convert string to unicode string
strip_accents	Strip accents from a string
strip_symbols	Strip ugly unicode symbols from a string
strip_spaces	Strip excess spaces from a string
strip_linebreaks	Strip excess line breaks from a string

Change log

1.0.7

Fix badges in README.md

cleanup CHANGES.md

1.0.6

Fix installation script on LICENSE file

1.0.4

Convert documentation to Markdown

Add support for universal wheel

1.0.1

100% test coverage

fixed bug in result wrapping (etree._Element has __iter__ too!)

1.0

add python3 support

first production release

0.4.4

fix inner_html method

added **kwargs to the parse function, added inner_html method to the Parser class

cssselect in deps

0.4.2

Node attribute modification via node.href = '/blah'

Custom default value for get: document.get(selector, default=None)

Get element by index: document.get(selector, index)

0.4.1

bool(node) returns True if element exists and False if element is None

0.4

First public version

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Information Analysis

Release history Release notifications | RSS feed

This version

1.0.7

Jan 25, 2020

1.0.6

Jan 25, 2020

1.0.5

Jan 25, 2020

1.0.4

Jan 25, 2020

1.0.3

Sep 17, 2014

1.0.2

Mar 13, 2014

1.0.1

Mar 12, 2014

1.0

Mar 10, 2014

0.4.5

Aug 16, 2013

0.4.4

Jan 18, 2013

0.4.3

Jan 17, 2013

0.4.2

May 15, 2011

0.4.1

Apr 24, 2011

0.4

Mar 8, 2011

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaf-1.0.7.tar.gz (5.8 kB view hashes)

Uploaded Jan 25, 2020 Source

Built Distribution

leaf-1.0.7-py2.py3-none-any.whl (5.9 kB view hashes)

Uploaded Jan 25, 2020 Python 2 Python 3

Hashes for leaf-1.0.7.tar.gz

Hashes for leaf-1.0.7.tar.gz
Algorithm	Hash digest
SHA256	`38c7fdef9de1a67961794d981260cd2dc5c16bb705aa11c746565f9b52856aa9`
MD5	`58df91645a06b97eda494758de834fa5`
BLAKE2b-256	`18a45c8c5caac9e03ea33b2384d16f5167c474cd7194cb2d7718de1d4d6156c4`

Hashes for leaf-1.0.7-py2.py3-none-any.whl

Hashes for leaf-1.0.7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`d3ea38bf05e1cb4caee373192fc30c53a09c7890f2a000baf7b473df0a989910`
MD5	`77b50f83d8d0b5dbbe59423c26c1e712`
BLAKE2b-256	`0105dc58afe5bd51f3016a1329f7e891f77daf5b63abe518643be1b8cd9c4623`