Skip to main content

Command-line utility for easy scraping of HTML documents

Project description

======================================
**screp**, easy command-line scrapping
======================================


What is screp?
==============

**screp** is a command line utility that provides easy and flexible scrapping
of HTML documents. It works by finding a set of *anchors* (specified using a
CSS selector) and then extracting information relative to those anchors,
optionally post processing it using a set of standard operations. For each
anchor it outputs a record formatted according to one of the supported formats
(CSV, JSON or general).


Invoking screp
==============

**screp** is invoked using the following syntax::

$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]

where:
* FORMAT_SPEC is a format specification, one of:
- *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
- *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole
output as a list of JSON objects
- *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format
where computed values are substituted to their specifications (similar to
bash parameter substitution)
* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as
detailed below
* FILE can be either a local file or an absolute URL; if no FILEs are specified
the standard input is read


How does screp work?
====================

**screp** tries to automate many of the steps taken when writing your own
scrapper, steps like:

* fetching the HTML documents, if necessary
* parsing HTML
* locating areas of interest in the DOM of the document
* locating interesting information around those areas
* simple processing of these pieces of information
* formatting of the information
* outputting the information

To use screp, you need to take a series of steps:
* tell screp where to take the HTML documents; it works with multiple
documents, from sources such as the web, the local file-system or STDIN
* define the *primary anchor* using a CSS selector: these are elements through
which you access records of interest in the HTML documents
* specify the output format; this implies specifying:
- *terms*, which are string computed relative to the anchors
- how these terms are combined to produce a record; currently screp supports
three methods of specifying formats:
- CSV
- JSON
- general format
* optionally, you can also define *secondary anchors*, which are elements
computed relative to the *primary anchor* that can be used to define *terms*
in a more succinct way

Defining terms
==============

A *term* has the following format::

anchor.accessor.accessor.accessor|filter|filter|filter

In other words, a term is an anchor(primary or secondary) followed by zero or
more accessors followed by zero or more filters.

*Accessors* and *filters* (also collectively called *actions*) are functions
that take the output value of the last function (or the anchor, if this is the
first action) and output another value. In other words, they form a pipeline.
Accessors act on DOM elements and sets (actually ordered lists) of elements,
whereas filters act on strings. Each action has an in_type and an out_type. For
a term to be correctly defined the out_type of an action needs to match the
in_type of the following action.

The supported types are: 'string', 'element', 'element_set'.

Actions can have zero or more parameters. When the action takes parameters it
is specified as a function::

action(parameter1, parameter2, parameter3)

When not, only the action name is specified (no parentheses).

Finally, terms have restrictions of the out_type of their last action (also
called the out_type of the term):
* if a term is used inside a format specification, its out_type must be
'string'
* if a term is used to define a secondary anchor, its out_type must be
'element'

Examples of terms
-----------------

These are correct term definitions::

'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'

Predefined anchors and actions
==============================

The following anchors are predefined:
* **$** is the primary anchor defined by the primary anchor selector
* **@** is the primary anchor representing the root of the current document

The following accessors are predefined:
* **first** [in_type='element_set', out_type='element']: returns the first
element in an element_set
* **last** [in_type='element_set', out_type='element']: returns the last
element in an element_set
* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th
element in an element_set; it also supports negative indexes, where -1
represents the last element, -2 the second-to-last element, and so on
* **class** [in_type='element', out_type='string']: returns the value of the
'class' attribute * **id** [in_type='element', out_type='string']: returns
the value of the 'id' attribute * **parent** [in_type='element',
out_type='element']: returns the parent of the current element
* **text** [in_type='element', out_type='string']: returns the text enclosed by
the current element
* **tag** [in_type='element', out_type='string']: returns the tag of the
current element
* **attr(attr_name)** [in_type='element', out_type='string']: returns the value
of the current element's attribute with name 'attr_name'
* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the
ordered list of descendants of the current element selected by the CSS
selector specified by 'css_sel'
* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
.desc(css_sel).first
* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns
the list of ancestors of the current element that satisfy the CSS selector
specified by 'css_sel'
* **children(css_sel)** [in_type='element', out_type='element_set']: returns
the list of children of the current element that satisfy the CSS selector
specified by 'css_sel'
* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of preceding siblings of the current element that satisfy the CSS
selector specified by 'css_sel'
* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of following siblings of the current element that satisfy the CSS
selector specified by 'css_sel'
* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of siblings of the current element that satisfy the CSS selector
specified by 'css_sel'
* **matching(css_sel)** [in_type='element_set', out_type='element_set']:
filters an element_set and returns all elements that match the CSS selector
specified by 'css_sel'

The following filters are predefined:
* **upper** [in_type='string', out_type='string']: converts string to uppercase
* **lower** [in_type='string', out_type='string']: converts string to lowercase
* **trim** [in_type='string', out_type='string']: removes spaces at the
beginning and end of the string
* **strip(chars)** [in_type='string', out_type='string']: removes characters
specified by 'chars' at the beginning and end of the string
* **replace(old, new)** [in_type='string', out_type='string']: replaces all
occurrences of 'old' with 'new'
* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a
regular expression substitution; *pattern* and *repl* are have the formats
taken by the **re.sub** Python function from the standard Python library;

Specifying output formats
=========================

CSV format
----------

The CSV output format is specified using the -c option. Optionally, using the
-H option you can specify a CSV header to output before outputting records.

Example::

-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'


JSON format
-----------

The JSON output format is defined using the -j option. It formats the output as
a JSON list of objects, one for each record. The *--indent-json* flat tells
screp to indent each object. The format is specified as a comma-separated list
of *key=value* pairs, where the *key* represents the JSON key in the record
object while *value* is a term specification.

Example::

- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'


General format
--------------

Then general format is specified by a general string containing term
specifications. To distinguish it from the general format, each term
specification is surrounded by braces. When formatting a record each term
specification is substituted with the computed value for that term.

Example::

-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some
tail'


Specifying secondary anchors
============================

Secondary anchors are specified using the -a option. There can be any number of
secondary anchors definitions. The definitions have the format
**<name>=<term>** where <name> is an identifier and <term> is a term definition
relative to any of the previously defined anchors (primary or secondary) that
has outputs an element. Secondary anchors can be redefined in later -a options
but only the last definition is retained.

Secondary anchors examples
--------------------------

These are examples of secondary anchors definitions::

-a 'p=$.parent' -a 'gp=p.parent'

-a 'interesting=$.fdesc(".interesting-class")' -a
'interesting=interesting.parent'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

screp-0.3.2.tar.gz (19.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page