Skip to main content

Local version of ScraperWiki libraries

Project description

<h2>ScraperWiki Python library</h3>

[![Build Status](https://travis-ci.org/scraperwiki/scraperwiki-python.png)](https://travis-ci.org/scraperwiki/scraperwiki-python)

<p>This is a Python library for scraping web pages and saving data.
It is the easiest way to save data on the ScraperWiki platform, and it
can also be used locally or on your own servers.
</p>

<h3>Installing</h3>

<code>pip install scraperwiki</code>

<h3>Scraping</h3>

<dl>
<dt>scraperwiki.<strong>scrape</strong>(url[, params][,user_agent])</dt>
<dd><p>Returns the downloaded string from the given url.</p>
<p><var>params</var> are sent as a POST if set.</p>
<p><var>user_agent</var> sets the user-agent string if provided.</p>
</dd>
</dl>


<h3><span id="sql"></span>Saving data</h3>

<p>Helper functions for saving and querying an SQL database. Updates
the schema automatically according to the data you save.</p>

<p>Currently only supports SQLite. It will make a local SQLite database.
It is based on the Python module dumptruck. You should expect it to support
other SQL databases at a later date.</p>

<dl>
<dt>scraperwiki.<strong>sql.save</strong>(unique_keys, data[, table_name="swdata"])</dt>
<dd><p>Saves a data record into the datastore into the table given
by <var>table_name</var>.</p>
<p><var>data</var> is a dict object with field names as keys;
<var>unique_keys</var> is a subset
of data.keys() which determines when a record is
overwritten.</p>
<p>For large numbers of records <var>data</var> can be a list of dicts.</p>
</dd>

<dt>scraperwiki.<strong>sql.execute</strong>(sql[, vars])</dt>
<dd><p>Executes any arbitrary SQL command. For
example CREATE, DELETE, INSERT or drop.</p>
<p><var>vars</var> is an optional list of parameters, inserted when
the SQL command contains &lsquo;?&rsquo;s.
For example:</p>
<code>scraperwiki.sql.execute("INSERT INTO swdata VALUES (?,?,?)", [a,b,c])</code>
<p>The &lsquo;?&rsquo; convention is like "paramstyle qmark"
from <a href="http://www.python.org/dev/peps/pep-0249/">Python's
DB API 2.0</a> (but note that the API to the datastore is
nothing like Python's DB API). In particular the
&lsquo;?&rsquo; does not itself need quoting, and can in
general only be used where a literal would appear.
</p>
</dd>

<dt>scraperwiki.<strong>sql.select</strong>(sqlfrag[, vars])</dt>
<dd><p>Executes a select command on the datastore. For example:</p>
<code>scraperwiki.sql.select("* FROM swdata LIMIT 10")</code>
<p>Returns a list of dicts that have been selected.</p>
<p><var>vars</var> is an optional list of parameters, inserted when the
select command contains &lsquo;?&rsquo;s. This is like the
feature in the <var>.execute</var> command, above.</p>
</dd>

<dt>scraperwiki.<strong>sql.commit</strong>()</dt>
<dd>Commits to the file after a series of execute commands. (sql.save auto-commits after every action).
</dd>

<dt>scraperwiki.<strong>sql.show_tables</strong>([dbname])</dt>
<dd>Returns an array of tables and their schemas in the current database.</dd>

<dt>scraperwiki.<strong>sql.table_info</strong>(name)</dt>
<dd>Returns an array of attributes for each element of the table.</dd>

<dt>scraperwiki.<strong>sql.save_var</strong>(key, value)</dt>
<dd>Saves an arbitrary single-value into a table called
<var>swvariables</var>. Intended to store scraper state so that
a scraper can continue after an interruption.
</dd>

<dt>scraperwiki.<strong>sql.get_var</strong>(key[, default])</dt>
<dd>Retrieves a single value that was saved by <var>save_var</var>.
Only works for string, float, or int types.
For anything else, use the <a
href="http://docs.python.org/library/pickle.html">pickle
library</a> to turn it into a string.
</dd>
</dl>

<h3>Miscellaneous</h3>
<dl>

<dt>scraperwiki.<strong>status</strong>(type, message=None)</dt>
<dd>If run on the ScraperWiki platform (the new one, not Classic), updates
the visible status of the dataset. If not on the platform, does nothing.
<var>params</var> can be 'ok' or
'error'. If no <var>message</var> is given, it will show the time since the update.
See <a href="https://scraperwiki.com/help/developer#boxes-status">dataset
status API</a> in the documentation for details.
</dd>

<dt>scraperwiki.<strong>pdftoxml</strong>(pdfdata, options='')</dt>
<dd>Convert a byte string containing a PDF file into an XML file containing the coordinates
and font of each text string.<br/><var>options</var> is an optional string of options (!!) to be passed to the underlying pdftohtml command (see <a href="http://linux.die.net/man/1/pdftohtml">the pdftohtml documentation</a> for details).<br/>
Refer to
<a href="https://scraperwiki.com/scrapers/new/python?template=advanced-scraping-pdfs">the example</a>
for more details.
</dd>

</dl>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraperwiki-0.3.8.tar.gz (6.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page