Command-line utility for easy scraping of HTML documents
Project description
==================================
screp, easy command-line scrapping
==================================
What is screp?
==============
**screp** is a command line utility that provides easy and flexible scrapping of HTML documents. It
works by finding a set of *anchors* (specified using a CSS selector) and then extracting information
relative to those anchors, optionally post processing it using a set of standard operations. For each
anchor it outputs a record formatted according to one of the supported formats (CSV, JSON or
general).
Invoking screp
==============
**screp** is invoked using the following syntax:
$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]
where:
* FORMAT_SPEC is a format specification, one of:
- *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
- *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole output as a list of
JSON objects
- *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format where computed
values are substituted to their specifications (similar to bash parameter substitution)
* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as detailed below
* FILE can be either a local file or an absolute URL; if no FILEs are specified the standard input
is read
How does screp work?
====================
**screp** tries to automate many of the steps taken when writing your own scrapper, steps like:
* fetching the HTML documents, if necessary
* parsing HTML
* locating areas of interest in the DOM of the document
* locating interesting information around those areas
* simple processing of these pieces of information
* formatting of the information
* outputting the information
To use screp, you need to take a series of steps:
* tell screp where to take the HTML documents; it works with multiple documents, from sources such
as the web, the local file-system or STDIN
* define the *primary anchor* using a CSS selector: these are elements through which you access
records of interest in the HTML documents
* specify the output format; this implies specifying:
- *terms*, which are string computed relative to the anchors
- how these terms are combined to produce a record; currently screp supports three methods of
specifying formats:
- CSV
- JSON
- general format
* optionally, you can also define *secondary anchors*, which are elements computed relative to the
*primary anchor* that can be used to define *terms* in a more succinct way
Defining terms
==============
A *term* has the following format::
anchor.accessor.accessor.accessor|filter|filter|filter
In other words, a term is an anchor(primary or secondary) followed by zero or more accessors
followed by zero or more filters.
*Accessors* and *filters* (also collectively called *actions*) are functions that take the output
value of the last function (or the anchor, if this is the first action) and output another value. In
other words, they form a pipeline. Accessors act on DOM elements and sets (actually ordered lists)
of elements, whereas filters act on strings. Each action has an in_type and an out_type. For a term
to be correctly defined the out_type of an action needs to match the in_type of the following
action.
The supported types are: 'string', 'element', 'element_set'.
Actions can have zero or more parameters. When the action takes parameters it is specified as a
function::
action(parameter1, parameter2, parameter3)
When not, only the action name is specified (no parentheses).
Finally, terms have restrictions of the out_type of their last action (also called the out_type of
the term):
* if a term is used inside a format specification, its out_type must be 'string'
* if a term is used to define a secondary anchor, its out_type must be 'element'
Examples of terms
-----------------
These are correct term definitions::
'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'
Predefined anchors and actions
==============================
The following anchors are predefined:
* **$** is the primary anchor defined by the primary anchor selector
* **@** is the primary anchor representing the root of the current document
The following accessors are predefined:
* **first** [in_type='element_set', out_type='element']: returns the first element in an element_set
* **last** [in_type='element_set', out_type='element']: returns the last element in an element_set
* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th element in an
element_set; it also supports negative indexes, where -1 represents the last element, -2 the
second-to-last element, and so on
* **class** [in_type='element', out_type='string']: returns the value of the 'class' attribute
* **id** [in_type='element', out_type='string']: returns the value of the 'id' attribute
* **parent** [in_type='element', out_type='element']: returns the parent of the current element
* **text** [in_type='element', out_type='string']: returns the text enclosed by the current element
* **tag** [in_type='element', out_type='string']: returns the tag of the current element
* **attr(attr_name)** [in_type='element', out_type='string']: returns the value of the current element's
attribute with name 'attr_name'
* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the ordered list of
descendants of the current element selected by the CSS selector specified by 'css_sel'
* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
.desc(css_sel).first
* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns the list of ancestors
of the current element that satisfy the CSS selector specified by 'css_sel'
* **children(css_sel)** [in_type='element', out_type='element_set']: returns the list of children
of the current element that satisfy the CSS selector specified by 'css_sel'
* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of preceding
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of following
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of siblings of
the current element that satisfy the CSS selector specified by 'css_sel'
* **matching(css_sel)** [in_type='element_set', out_type='element_set']: filters an element_set and
returns all elements that match the CSS selector specified by 'css_sel'
The following filters are predefined:
* **upper** [in_type='string', out_type='string']: converts string to uppercase
* **lower** [in_type='string', out_type='string']: converts string to lowercase
* **trim** [in_type='string', out_type='string']: removes spaces at the beginning and end of the
string
* **strip(chars)** [in_type='string', out_type='string']: removes characters specified by 'chars'
at the beginning and end of the string
* **replace(old, new)** [in_type='string', out_type='string']: replaces all occurrences of 'old' with
'new'
* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a regular expression
substitution; *pattern* and *repl* are have the formats taken by the **re.sub** Python function
from the standard Python library;
Specifying output formats
=========================
CSV format
----------
The CSV output format is specified using the -c option. Optionally, using the -H option you can
specify a CSV header to output before outputting records.
Example::
-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'
JSON format
-----------
The JSON output format is defined using the -j option. It formats the output as a JSON list of
objects, one for each record. The *--indent-json* flat tells screp to indent each object. The format
is specified as a comma-separated list of *key=value* pairs, where the *key* represents the JSON key
in the record object while *value* is a term specification.
Example::
- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'
General format
--------------
Then general format is specified by a general string containing term specifications. To distinguish
it from the general format, each term specification is surrounded by braces. When formatting a
record each term specification is substituted with the computed value for that term.
Example::
-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some tail'
Specifying secondary anchors
============================
Secondary anchors are specified using the -a option. There can be any number of secondary anchors
definitions. The definitions have the format **<name>=<term>** where <name> is an identifier and
<term> is a term definition relative to any of the previously defined anchors (primary or
secondary) that has outputs an element. Secondary anchors can be redefined in later -a options but
only the last definition is retained.
Secondary anchors examples
--------------------------
These are examples of secondary anchors definitions::
-a 'p=$.parent' -a 'gp=p.parent'
-a 'interesting=$.fdesc(".interesting-class")' -a 'interesting=interesting.parent'
screp, easy command-line scrapping
==================================
What is screp?
==============
**screp** is a command line utility that provides easy and flexible scrapping of HTML documents. It
works by finding a set of *anchors* (specified using a CSS selector) and then extracting information
relative to those anchors, optionally post processing it using a set of standard operations. For each
anchor it outputs a record formatted according to one of the supported formats (CSV, JSON or
general).
Invoking screp
==============
**screp** is invoked using the following syntax:
$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]
where:
* FORMAT_SPEC is a format specification, one of:
- *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
- *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole output as a list of
JSON objects
- *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format where computed
values are substituted to their specifications (similar to bash parameter substitution)
* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as detailed below
* FILE can be either a local file or an absolute URL; if no FILEs are specified the standard input
is read
How does screp work?
====================
**screp** tries to automate many of the steps taken when writing your own scrapper, steps like:
* fetching the HTML documents, if necessary
* parsing HTML
* locating areas of interest in the DOM of the document
* locating interesting information around those areas
* simple processing of these pieces of information
* formatting of the information
* outputting the information
To use screp, you need to take a series of steps:
* tell screp where to take the HTML documents; it works with multiple documents, from sources such
as the web, the local file-system or STDIN
* define the *primary anchor* using a CSS selector: these are elements through which you access
records of interest in the HTML documents
* specify the output format; this implies specifying:
- *terms*, which are string computed relative to the anchors
- how these terms are combined to produce a record; currently screp supports three methods of
specifying formats:
- CSV
- JSON
- general format
* optionally, you can also define *secondary anchors*, which are elements computed relative to the
*primary anchor* that can be used to define *terms* in a more succinct way
Defining terms
==============
A *term* has the following format::
anchor.accessor.accessor.accessor|filter|filter|filter
In other words, a term is an anchor(primary or secondary) followed by zero or more accessors
followed by zero or more filters.
*Accessors* and *filters* (also collectively called *actions*) are functions that take the output
value of the last function (or the anchor, if this is the first action) and output another value. In
other words, they form a pipeline. Accessors act on DOM elements and sets (actually ordered lists)
of elements, whereas filters act on strings. Each action has an in_type and an out_type. For a term
to be correctly defined the out_type of an action needs to match the in_type of the following
action.
The supported types are: 'string', 'element', 'element_set'.
Actions can have zero or more parameters. When the action takes parameters it is specified as a
function::
action(parameter1, parameter2, parameter3)
When not, only the action name is specified (no parentheses).
Finally, terms have restrictions of the out_type of their last action (also called the out_type of
the term):
* if a term is used inside a format specification, its out_type must be 'string'
* if a term is used to define a secondary anchor, its out_type must be 'element'
Examples of terms
-----------------
These are correct term definitions::
'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'
Predefined anchors and actions
==============================
The following anchors are predefined:
* **$** is the primary anchor defined by the primary anchor selector
* **@** is the primary anchor representing the root of the current document
The following accessors are predefined:
* **first** [in_type='element_set', out_type='element']: returns the first element in an element_set
* **last** [in_type='element_set', out_type='element']: returns the last element in an element_set
* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th element in an
element_set; it also supports negative indexes, where -1 represents the last element, -2 the
second-to-last element, and so on
* **class** [in_type='element', out_type='string']: returns the value of the 'class' attribute
* **id** [in_type='element', out_type='string']: returns the value of the 'id' attribute
* **parent** [in_type='element', out_type='element']: returns the parent of the current element
* **text** [in_type='element', out_type='string']: returns the text enclosed by the current element
* **tag** [in_type='element', out_type='string']: returns the tag of the current element
* **attr(attr_name)** [in_type='element', out_type='string']: returns the value of the current element's
attribute with name 'attr_name'
* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the ordered list of
descendants of the current element selected by the CSS selector specified by 'css_sel'
* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
.desc(css_sel).first
* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns the list of ancestors
of the current element that satisfy the CSS selector specified by 'css_sel'
* **children(css_sel)** [in_type='element', out_type='element_set']: returns the list of children
of the current element that satisfy the CSS selector specified by 'css_sel'
* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of preceding
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of following
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of siblings of
the current element that satisfy the CSS selector specified by 'css_sel'
* **matching(css_sel)** [in_type='element_set', out_type='element_set']: filters an element_set and
returns all elements that match the CSS selector specified by 'css_sel'
The following filters are predefined:
* **upper** [in_type='string', out_type='string']: converts string to uppercase
* **lower** [in_type='string', out_type='string']: converts string to lowercase
* **trim** [in_type='string', out_type='string']: removes spaces at the beginning and end of the
string
* **strip(chars)** [in_type='string', out_type='string']: removes characters specified by 'chars'
at the beginning and end of the string
* **replace(old, new)** [in_type='string', out_type='string']: replaces all occurrences of 'old' with
'new'
* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a regular expression
substitution; *pattern* and *repl* are have the formats taken by the **re.sub** Python function
from the standard Python library;
Specifying output formats
=========================
CSV format
----------
The CSV output format is specified using the -c option. Optionally, using the -H option you can
specify a CSV header to output before outputting records.
Example::
-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'
JSON format
-----------
The JSON output format is defined using the -j option. It formats the output as a JSON list of
objects, one for each record. The *--indent-json* flat tells screp to indent each object. The format
is specified as a comma-separated list of *key=value* pairs, where the *key* represents the JSON key
in the record object while *value* is a term specification.
Example::
- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'
General format
--------------
Then general format is specified by a general string containing term specifications. To distinguish
it from the general format, each term specification is surrounded by braces. When formatting a
record each term specification is substituted with the computed value for that term.
Example::
-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some tail'
Specifying secondary anchors
============================
Secondary anchors are specified using the -a option. There can be any number of secondary anchors
definitions. The definitions have the format **<name>=<term>** where <name> is an identifier and
<term> is a term definition relative to any of the previously defined anchors (primary or
secondary) that has outputs an element. Secondary anchors can be redefined in later -a options but
only the last definition is retained.
Secondary anchors examples
--------------------------
These are examples of secondary anchors definitions::
-a 'p=$.parent' -a 'gp=p.parent'
-a 'interesting=$.fdesc(".interesting-class")' -a 'interesting=interesting.parent'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
screp-0.3.tar.gz
(19.0 kB
view hashes)