scrape

a web scraping tool

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
- Web Environment
Intended Audience
Programming Language

Project description

# scrape

## a web scraping tool
scrape is a command-line tool for extracting webpages as text or pdf files. The crawling mechanism allows for entire websites to be scraped and also offers regexp support for filtering links and text content. scrape is especially useful for converting online documentation to pdf or just as a faster alternative to wget and grep!

## Installation
* `pip install scrape`
* [Installing wkhtmltopdf](https://github.com/pdfkit/pdfkit/wiki/Installing-WKHTMLTOPDF)

## Usage
usage: scrape.py [-h] [-c [CRAWL [CRAWL ...]]] [-ca]
[-f [FILTER [FILTER ...]]] [-l LIMIT] [-p] [-s] [-v] [-vb]
[urls [urls ...]]

a web scraping tool

positional arguments:
urls urls to scrape

optional arguments:
-h, --help show this help message and exit
-c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
keywords to crawl links by
-ca, --crawl-all crawl all links
-f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
filter lines of text by keywords
-l LIMIT, --limit LIMIT
set crawl page limit
-p, --pdf write to pdf instead of text
-r, --restrict restrict domain to that of the seed url
-v, --version display current version
-vb, --verbose print pdfkit log messages

## Author
* Hunter Hammond (huntrar@gmail.com)

## Notes
* --pdf can be used to save web pages as pdf's, they are saved to text by default.

* Text can be filtered by passing one or more regexps to --filter.

* To crawl subsequent pages, enter --crawl followed by one or more regexps or instead enter --crawl-all.

* To restrict the domain to the seed url's domain, use --strict, otherwise any domain may be followed.

* There is no limit to the number of pages to be crawled unless one is set with --limit, thus to cancel crawling and begin processing simply press Ctrl-C.

News
====

0.1.8
------

- removed url fragments
- replaced set_base with urlparse method urljoin
- out_file name construction now uses urlparse 'path' member
- raw_links is now an OrderedSet to try to eliminate as much processing as possible
- added clear method to OrderedSet in utils.py

0.1.7
------

- removed validate_domain and replaced it with a lambda instead
- replaced domain with base_url in set_base as should have been done before
- crawled message no longer prints if url was a duplicate

0.1.6
------

- uncommented import __version__

0.1.5
------

- set_domain was replaced by set_base, proper solution for links that are relative
- fixed verbose behavior
- updated description in README

0.1.4
------

- fixed output file generation, was using domain instead of base_url
- minor code cleanup

0.1.3
------

- blank lines are no longer written to text unless as a page separator
- style tags now ignored alongside script tags when getting text

0.1.2
------

- added shebang

0.1.1
------

- uncommented import __version__

0.1.0
------

- reformatting to conform with PEP 8
- added regexp support for matching crawl keywords and filter text keywords
- improved url resolution by correcting domains and schemes
- added --restrict option to restrict crawler links to only those with seed domain
- made text the default write option rather than pdf, can now use --pdf to change that
- removed page number being written to text, separator is now just a single blank line
- improved construction of output file name

0.0.11
------

- fixed missing comma in install_requires in setup.py
- also labeled now as beta as there are still some kinks with crawling

0.0.10
------

- now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

- pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

- better implementation of crawler, can now scrape entire websites
- added OrderedSet class to utils.py

0.0.7
------

- changed --keywords to --filter and positional arg url to urls

0.0.6
------

- use --keywords flag for filtering text
- can pass multiple links now
- will not write empty files anymore

0.0.5
------

- added --verbose argument for use with pdfkit
- improved output file name processing

0.0.4
------

- accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

- Moved utils.py to scrape/

0.0.2
------

- First entry

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
- Web Environment
Intended Audience
Programming Language

Release history Release notifications | RSS feed

0.11.3

Feb 20, 2022

0.11.2

Feb 20, 2022

0.11.1

Mar 24, 2021

0.11.0

Mar 19, 2021

0.10.2

Jan 8, 2021

0.10.1

Aug 24, 2020

0.10.0

Mar 12, 2020

0.9.15

Jan 5, 2019

0.9.14

Jan 5, 2019

0.9.12

Jan 10, 2017

0.9.11

Aug 23, 2016

0.9.10

Jun 26, 2016

0.9.9

Jun 24, 2016

0.9.8

Jun 24, 2016

0.9.6

Jun 23, 2016

0.9.5

Jun 23, 2016

0.9.4

Jun 23, 2016

0.9.3

Jun 23, 2016

0.9.2

Jun 20, 2016

0.9.1

Jun 20, 2016

0.9.0

Jun 18, 2016

0.8.11

Jun 16, 2016

0.8.10

Jun 16, 2016

0.8.9

Jun 16, 2016

0.8.8

Jun 10, 2016

0.8.7

Mar 30, 2016

0.8.6

Feb 17, 2016

0.8.5

Feb 4, 2016

0.8.4

Feb 4, 2016

0.8.3

Feb 4, 2016

0.8.2

Feb 2, 2016

0.8.1

Jan 30, 2016

0.8.0

Jan 30, 2016

0.7.9

Jan 23, 2016

0.7.8

Jan 22, 2016

0.7.7

Jan 22, 2016

0.7.6

Jan 5, 2016

0.7.5

Jan 2, 2016

0.7.4

Jan 2, 2016

0.7.3

Jan 2, 2016

0.7.2

Jan 2, 2016

0.7.1

Dec 19, 2015

0.7.0

Dec 7, 2015

0.6.9

Dec 6, 2015

0.6.8

Dec 5, 2015

0.6.7

Dec 5, 2015

0.6.6

Dec 5, 2015

0.6.5

Dec 4, 2015

0.6.4

Nov 28, 2015

0.6.3

Nov 26, 2015

0.6.2

Nov 24, 2015

0.6.1

Nov 23, 2015

0.6.0

Nov 23, 2015

0.5.9

Nov 19, 2015

0.5.8

Nov 19, 2015

0.5.7

Nov 10, 2015

0.5.6

Nov 10, 2015

0.5.5

Nov 10, 2015

0.5.4

Nov 8, 2015

0.5.3

Nov 8, 2015

0.5.2

Nov 8, 2015

0.5.1

Nov 8, 2015

0.5.0

Nov 8, 2015

0.4.6

Oct 30, 2015

0.4.5

Oct 29, 2015

0.4.4

Oct 28, 2015

0.4.3

Oct 28, 2015

0.4.2

Oct 20, 2015

0.4.1

Oct 20, 2015

0.4.0

Oct 19, 2015

0.3.9

Oct 15, 2015

0.3.8

Oct 15, 2015

0.3.7

Oct 12, 2015

0.3.6

Sep 17, 2015

0.3.5

Sep 16, 2015

0.3.4

Sep 15, 2015

0.3.3

Sep 15, 2015

0.3.2

Sep 15, 2015

0.3.1

Sep 15, 2015

0.3.0

Sep 15, 2015

0.2.10

Sep 13, 2015

0.2.9

Sep 11, 2015

0.2.8

Aug 13, 2015

0.2.7

Aug 5, 2015

0.2.6

Jul 25, 2015

0.2.5

Jul 25, 2015

0.2.4

Jul 20, 2015

0.2.3

Jul 19, 2015

0.2.2

Jul 19, 2015

0.2.1

Jul 16, 2015

0.2.0

Jul 15, 2015

0.1.10

Jul 13, 2015

0.1.9

Jul 13, 2015

This version

0.1.8

Jul 13, 2015

0.1.7

Jul 11, 2015

0.1.6

Jul 11, 2015

0.1.5

Jul 11, 2015

0.1.4

Jul 11, 2015

0.1.3

Jul 11, 2015

0.1.2

Jul 11, 2015

0.1.1

Jul 11, 2015

0.1.0

Jul 11, 2015

0.0.11

Jul 10, 2015

0.0.10

Jul 10, 2015

0.0.9

Jul 9, 2015

0.0.8

Jul 8, 2015

0.0.7

Jul 7, 2015

0.0.6

Jul 7, 2015

0.0.5

Jul 7, 2015

0.0.4

Jul 7, 2015

0.0.3

Jul 7, 2015

0.0.2

Jul 7, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scrape-0.1.8-py2-none-any.whl (10.1 kB view hashes)

Uploaded Jul 13, 2015 Python 2

Hashes for scrape-0.1.8-py2-none-any.whl

Hashes for scrape-0.1.8-py2-none-any.whl
Algorithm	Hash digest
SHA256	`dde55e67648f3a9c938b70c6761d3f4182f52f12078c1ef306aed5fb1430aeaa`
MD5	`9e6b6b4da023b7761f0265a7ac129cf9`
BLAKE2b-256	`a50d120f757d9e0c5bb7c7b2d05242a496f48eed73df5de9a10691130bbfe2bf`