Skip to main content

A DSL for extracting data from a web page.

Project description

A DSL for extracting data from a web page. The DSL serves two purposes: finds elements and extracts their text or attribute values. The main reason for developing this is to have all the CSS selectors for scraping a site in one place.

The DSL wraps PyQuery.

Example

Given the following take template:

$ h1 | text
    save: h1_title
$ ul
    save each: uls
        $ li
            | 0 [title]
                save: title
            | 1 text
                save: second_li
$ p | 1 text
    save: p_text

And the following HTML:

<div>
    <h1>Le Title 1</h1>
    <p>Some body here</p>
    <h2 class="second title">The second title</h2>
    <p>Another body here</p>
    <ul id="a">
        <li title="a less than awesome title">A first li</li>
        <li>A second li</li>
        <li>A third li</li>
    </ul>
    <ul id="b">
        <li title="some awesome title">B first li</li>
        <li>B second li</li>
        <li>B third li</li>
    </ul>
</div>

The following data will be extracted (presented in JSON format):

{
    "h1_title": "Le Title 1",
    "p_text": "Another body here",
    "uls": [
        {
            "title": "a less than awesome title",
            "second_li": "A second li"
        },
        {
            "title": "some awesome title",
            "second_li": "B second li"
        }
    ]
}

Take templates always result in a single python dict.

For a more complex example, see the reddit sample.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

take-0.0.0.tar.gz (8.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page