leaf · PyPI

Simple Python library for HTML parsing

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Leaf
====
.. image:: https://travis-ci.org/penpen/Leaf.png?branch=master
:target: https://travis-ci.org/penpen/Leaf

What is this?
-------------
This is a simple wrapper around lxml, which adds some nice features,
which make work with lxml better. This library covers all my needs in
html parsing.

Dependencies
------------
`lxml <http://lxml.de/>`_ obviously :3

Features
--------
* Nice jquery-like css selectors
* Simple access to element attributes
* Easy way for convert html to other format (bbcode, markdown, etc)
* Few nice functions for work with text
* And, of course this saves all original features of lxml

Description
-----------
Main function of module (as I mind) is leaf.parse, this function takes string with
html as an argument, and returns leaf.Parser object, which wraps lxml object.
With this object you can do anything you want, like this::

document = leaf.parse(sample)
links = document('div#menu a') # get links in div with id menu through css selectors

Or you can do this::

link = document.get('div#menu a') # get first link or return None

And you can get attributes from these results like this::

print link.onclick

Anyway, you can use standard lxml methods like object.xpath, and they returns results
wrapped into leaf.Parser.
So, my favorite feature is parsing html into bbcode (markdown, etc)::

# Lets define simple formatter, which pass text
# and wraps links into [url][/url] (like bbcode)
def omgcode_formatter(element, children):
# Replace <br> tag with line break
if element.tag == 'br':
return '\n'
# Wrap links into [url][/url]
if element.tag == 'a':
return u"[url=link}]{text}[/url]".format(link=element.href, text=children)
# Return children only for other elements.
if children:
return children

This function will be recursively called with element and children (this is string with
children parsing result).
So, lets call this parser in some leaf.Parser object::

document.parse(omgcode_formatter)

More detailed examples availible in the tests.

Finally, this library has some nice functions for work with text:

*to_unicode* -- Convert string to unicode string

*strip_accents* -- Strip accents from a string

*strip_symbols* -- Strip ugly unicode symbols from a string

*strip_spaces* -- Strip excess spaces from a string

*strip_linebreaks* -- Strip excess line breaks from a string
Change log
==========

1.0
---
- add python3 support
- first production release

0.4.4
-----
- fix inner_html method
- added **kwargs to the parse function, added inner_html method to the Parser class
- cssselect in deps

0.4.2
-----
- Node attribute modification via node.href = '/blah'
- Custom default value for get: document.get(selector, default=None)
- Get element by index: document.get(selector, index)

0.4.1
-----
- bool(node) returns True if element exists and False if element is None

0.4
---
- First public version

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.7

Jan 25, 2020

1.0.6

Jan 25, 2020

1.0.5

Jan 25, 2020

1.0.4

Jan 25, 2020

1.0.3

Sep 17, 2014

1.0.2

Mar 13, 2014

1.0.1

Mar 12, 2014

This version

1.0

Mar 10, 2014

0.4.5

Aug 16, 2013

0.4.4

Jan 18, 2013

0.4.3

Jan 17, 2013

0.4.2

May 15, 2011

0.4.1

Apr 24, 2011

0.4

Mar 8, 2011

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leaf-1.0.tar.gz (5.4 kB view hashes)

Uploaded Mar 10, 2014 Source

Hashes for leaf-1.0.tar.gz

Hashes for leaf-1.0.tar.gz
Algorithm	Hash digest
SHA256	`7fd309af6e812eba3951875ee9d2ff15a28c49db5c288b9d9bce94bd4fabb051`
MD5	`b26df96abc209313ac10249edede6daa`
BLAKE2b-256	`c01e3aed1d5eb572c7c9dfe57fe58aa76dea03521a790674d833b6b7593833c0`