hepcrawl

Scrapy project for feeds into INSPIRE-HEP (http://inspirehep.net).

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://img.shields.io/travis/inspirehep/hepcrawl.svg

https://img.shields.io/coveralls/inspirehep/hepcrawl.svg

https://img.shields.io/github/tag/inspirehep/hepcrawl.svg

https://img.shields.io/pypi/dm/hepcrawl.svg

https://img.shields.io/github/license/inspirehep/hepcrawl.svg

HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.

The project is currently in early stage of development.

Installation for developers

We start by creating a virtual environment for our Python packages:

mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src

Now we grab the code and install it in development mode:

git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .

Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.

Finally run the tests to make sure all is setup correctly:

python setup.py test

Run example crawler

Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:

cdvirtualenv src/hepcrawl
scrapy crawl arXiv -a source_file=file://`pwd`/tests/responses/arxiv/sample_arxiv_record.xml

Thanks for contributing!

Changes

Version 0.2.0 (2016-06-02)

11 new spiders, including arXiv, APS, Base OAI source, Elsevier and many more.
Updated HEPRecord data items to conform with updates to INSPIRE data model.
Reorganization of loaders to have one place for input and output processing of metadata.
New pipelines for pushing content crawled to INSPIRE servers.
Better error handling and reporting, including support for Sentry.

Version 0.1.0 (2015-10-26)

Initial commit

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

13.0.72

Oct 27, 2023

13.0.71

Oct 26, 2023

13.0.70

Oct 19, 2023

13.0.69

Oct 18, 2023

13.0.68

Oct 11, 2023

13.0.67

Sep 20, 2023

13.0.66

Sep 19, 2023

13.0.65

Sep 18, 2023

13.0.64

Aug 11, 2023

13.0.63

Aug 4, 2023

13.0.58

May 23, 2023

13.0.57

Mar 21, 2023

13.0.56

Mar 15, 2023

13.0.55

Mar 14, 2023

13.0.54

Feb 9, 2023

13.0.53

Jan 24, 2023

13.0.52

Nov 11, 2022

13.0.51

Oct 27, 2022

13.0.50

Sep 21, 2022

13.0.49

Aug 26, 2022

13.0.48

Aug 10, 2022

13.0.47

Dec 7, 2021

13.0.46

Nov 25, 2021

13.0.45

Nov 18, 2021

13.0.44

Nov 11, 2021

13.0.43

Oct 11, 2021

13.0.42

Oct 1, 2021

13.0.41

Aug 31, 2021

13.0.40

Aug 27, 2021

13.0.39

Aug 12, 2021

13.0.38

Aug 11, 2021

13.0.37

Aug 6, 2021

13.0.36

Aug 3, 2021

13.0.35

Jul 12, 2021

13.0.34

Jun 7, 2021

13.0.33

May 12, 2021

13.0.32

May 7, 2021

13.0.31

Apr 21, 2021

13.0.30

Apr 20, 2021

13.0.29

Apr 6, 2021

13.0.26

Feb 4, 2021

13.0.25

Feb 2, 2021

13.0.24

Feb 2, 2021

13.0.21

Oct 28, 2020

13.0.20

Oct 26, 2020

13.0.19

Oct 23, 2020

13.0.18

Oct 23, 2020

13.0.17

Oct 22, 2020

13.0.16

Oct 21, 2020

13.0.14

Oct 14, 2020

13.0.13

Oct 13, 2020

13.0.12

Oct 6, 2020

13.0.11

Sep 9, 2020

13.0.10

Sep 1, 2020

13.0.9

Sep 1, 2020

13.0.8

May 6, 2020

13.0.7

Oct 31, 2019

13.0.6

Oct 3, 2019

13.0.5

Sep 30, 2019

13.0.4

Sep 27, 2019

13.0.3

Sep 26, 2019

13.0.2

Aug 12, 2019

13.0.1

Jul 24, 2019

13.0.0

May 27, 2019

12.0.14

May 15, 2019

12.0.13

May 15, 2019

12.0.12

Apr 17, 2019

12.0.11

Apr 16, 2019

12.0.7

Feb 19, 2019

12.0.6

Feb 13, 2019

12.0.5

Feb 12, 2019

12.0.4

Feb 12, 2019

12.0.3

Jan 24, 2019

12.0.2

Jan 23, 2019

12.0.1

Jan 7, 2019

12.0.0

Dec 4, 2018

11.1.5

Nov 14, 2018

11.1.4

Nov 2, 2018

11.1.3

Oct 19, 2018

11.1.2

Aug 30, 2018

11.1.1

Jun 6, 2018

11.1.0

Jun 1, 2018

11.0.4

May 28, 2018

11.0.3

May 28, 2018

11.0.2

May 17, 2018

11.0.1

May 15, 2018

11.0.0

May 8, 2018

10.0.9

May 3, 2018

10.0.8

May 3, 2018

10.0.7

Apr 23, 2018

10.0.6

Mar 13, 2018

10.0.5

Mar 8, 2018

10.0.3

Mar 1, 2018

10.0.2

Feb 28, 2018

10.0.1

Feb 27, 2018

10.0.0

Feb 27, 2018

9.0.13

Feb 21, 2018

9.0.12

Feb 20, 2018

9.0.11

Feb 20, 2018

9.0.10

Feb 14, 2018

9.0.9

Feb 13, 2018

9.0.8

Feb 13, 2018

9.0.7

Feb 8, 2018

9.0.6

Feb 1, 2018

9.0.5

Feb 1, 2018

9.0.4

Jan 31, 2018

9.0.3

Jan 23, 2018

9.0.2

Jan 23, 2018

9.0.1

Jan 23, 2018

9.0.0

Jan 21, 2018

8.0.0

Jan 21, 2018

7.2.0

Jan 19, 2018

7.1.0

Jan 19, 2018

7.0.0

Jan 16, 2018

6.0.0

Jan 15, 2018

5.0.3

Jan 8, 2018

5.0.2

Dec 13, 2017

5.0.1

Dec 6, 2017

5.0.0

Dec 5, 2017

4.0.4

Nov 29, 2017

4.0.3

Nov 21, 2017

4.0.2

Nov 2, 2017

4.0.1

Nov 1, 2017

4.0.0

Nov 1, 2017

3.0.16

Oct 20, 2020

3.0.15

Oct 20, 2020

3.0.1

Oct 23, 2017

3.0.0

Oct 20, 2017

2.1.3

Oct 19, 2017

2.1.2

Oct 4, 2017

2.1.1

Oct 4, 2017

2.1.0

Sep 20, 2017

2.0.2

Sep 20, 2017

2.0.1

Aug 25, 2017

2.0.0

Aug 24, 2017

1.0.11

Aug 24, 2017

1.0.10

Jul 20, 2017

1.0.9

Jul 4, 2017

1.0.8

Jul 4, 2017

1.0.7

Jul 3, 2017

1.0.6

Jun 30, 2017

1.0.5

Jun 30, 2017

1.0.4

Jun 30, 2017

1.0.2

Jun 29, 2017

1.0.1

Jun 20, 2017

1.0.0

Jun 15, 2017

0.3.22

Jun 13, 2017

0.3.21

May 22, 2017

0.3.20

May 22, 2017

0.3.19

May 22, 2017

0.3.18

May 18, 2017

0.3.17

May 15, 2017

0.3.16

May 11, 2017

0.3.15

May 11, 2017

0.3.14

May 11, 2017

0.3.13

May 11, 2017

0.3.12

May 10, 2017

0.3.10

May 8, 2017

0.3.9

May 8, 2017

0.3.8

May 8, 2017

0.3.7

May 5, 2017

0.3.6

Apr 25, 2017

0.3.4

Mar 21, 2017

0.3.3

Mar 21, 2017

0.3.2

Mar 20, 2017

0.3.1

Mar 8, 2017

0.2.49

Feb 10, 2017

0.2.48

Jan 25, 2017

0.2.46

Dec 5, 2016

0.2.45

Nov 15, 2016

0.2.44

Nov 8, 2016

0.2.43

Oct 26, 2016

0.2.42

Oct 25, 2016

This version

0.2.0

Jun 2, 2016

0.1.0

Nov 12, 2015

0.0.50

Feb 13, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hepcrawl-0.2.0.tar.gz (1.2 MB view hashes)

Uploaded Jun 2, 2016 Source

Hashes for hepcrawl-0.2.0.tar.gz

Hashes for hepcrawl-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5b1c76c7c4f2ad4784c89f734eccfa4daece94c93ddab19df48f8cae61f0dc74`
MD5	`a5c1b9d9c891f413da0778fb9904a4e0`
BLAKE2b-256	`17db3e0933de5e5547394928f3d76cbd655e11cd8b7f933c6658e8a109f5ff1c`