Skip to main content

Scrapy project for feeds into INSPIRE-HEP (http://inspirehep.net).

Project description

HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.

The project is currently in early stage of development.

Installation for developers

We start by creating a virtual environment for our Python packages:

mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src

Now we grab the code and install it in development mode:

git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .

Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.

Finally run the tests to make sure all is setup correctly:

python setup.py test

Run example crawler

Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:

cdvirtualenv src/hepcrawl
scrapy crawl Sample -a source_file=file://`pwd`/tests/responses/world_scientific/sample_ws_record.xml

Thanks for contributing!

Changes

Version 0.1.0 (2015-10-26)

  • Initial commit

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hepcrawl-0.1.0.tar.gz (38.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page