Skip to main content

XML/HTML scraper using XPath queries.

Project description

Copyright (C) 2014-2018 H. Turgut Uyar <uyar@tekir.org>

Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.

PyPI:

https://pypi.python.org/pypi/piculet/

Repository:

https://bitbucket.org/uyar/piculet

Documentation:

https://piculet.readthedocs.io/

Piculet has been tested with Python 2.7, Python 3.4+, PyPy2 5.7+, and PyPy3 5.7+. You can install the latest version using pip:

pip install piculet

History

1.0b7 (2018-03-21)

  • Dropped support for Python 3.3.

  • Fixes for handling Unicode data in HTML for Python 2.

  • Added registry for preprocessors.

1.0b6 (2018-01-17)

  • Support for writing specifications in YAML.

1.0b5 (2018-01-16)

  • Added a class-based API for writing specifications.

  • Added predefined transformation functions.

  • Removed callables from specification maps. Use the new API instead.

  • Added support for registering new reducers and transformers.

  • Added support for defining sections in document.

  • Refactored XPath evaluation method in order to parse path expressions once.

  • Preprocessing will be done only once when the tree is built.

  • Concatenation is now the default reducing operation.

1.0b4 (2018-01-02)

  • Added “–version” option to command line arguments.

  • Added option to force the use of lxml’s HTML builder.

  • Fixed the error where non-truthy values would be excluded from the result.

  • Added support for transforming node text during preprocess.

  • Added separate preprocessing function to API.

  • Renamed the “join” reducer as “concat”.

  • Renamed the “foreach” keyword for keys as “section”.

  • Removed some low level debug messages to substantially increase speed.

1.0b3 (2017-07-25)

  • Removed the caching feature.

1.0b2 (2017-06-16)

  • Added helper function for getting cache hash keys of URLs.

1.0b1 (2017-04-26)

  • Added optional value transformations.

  • Added support for custom reducer callables.

  • Added command-line option for scraping documents from local files.

1.0a2 (2017-04-04)

  • Added support for Python 2.7.

  • Fixed lxml support.

1.0a1 (2016-08-24)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piculet-1.0b7.tar.gz (32.8 kB view hashes)

Uploaded Source

Built Distribution

piculet-1.0b7-py2.py3-none-any.whl (13.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page