Simple webscraper built on top of requests and beautifulsoup
Project description
Some basic webscraper I use in many projects.
webscraper
Module to ease web efforts
Supports
Cached web requests (Wrapper around requests)
Bultin parsing/scraping (Wrapper around beautifulsoup)
Constructor parameters
url: Default url, used if nothing else specified
scheme: Default scheme for scrapping
timeout
cache_directory: Where to save cache files
cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
cache_use_advanced
auth_method: Authentication method (default: HTTPBasicAuth)
auth_username: Authentication username. If set, enables authentication
auth_password: Authentication password
handle_redirect: Allow redirects (default: True)
user_agent: User agent to use
default_user_agents_browser: Browser to set in user agent (from default_user_agents dict)
default_user_agents_os: Operating system to set in user agent (from default_user_agents dict)
user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
html2text: HTML2text settings
html_parser: What html parser to use (default: html.parser - built in)
Example
# Setup WebScraper with caching
web = WebScraper({
'cache_directory': "cache",
'cache_time': 5*60
})
# First call to git -> hit internet
web.get("https://github.com/")
# Second call to git (within 5 minutes of first) -> hit cache
web.get("https://github.com/")
Whitch results in the following output:
2016-01-07 19:22:00 DEBUG [WebScraper._getCached] From inet https://github.com 2016-01-07 19:22:00 INFO [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com 2016-01-07 19:22:01 DEBUG [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None 2016-01-07 19:22:01 DEBUG [WebScraper._getCached] From cache https://github.com
History
0.1.15a0 (2016-03-08)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for floscraper-0.1.15a1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d6087210bf2082c9da79158cc6dc3b2a923fe41ba9a9c55a1199e569009aec9 |
|
MD5 | 428e57582667cd7d88b8ea94251b39c5 |
|
BLAKE2b-256 | fe961107154a5f7e0972ed172ca3fa5bd53fd8dd8f24d9b9a096d1e45ea18baf |