Skip to main content

Lazy simple command line tool, a swiss knife for scraper writers. Automates scraping so much as possible

Project description

About Lazyscraper
=================

Lazyscraper is a simple command line tool and library, a swiss knife for scraper writers. It's created to work only from command line and to make easier
scraper writing for very simple tasks like extraction of external urls or simple table.


Supported patterns
==================
* simpleul - Extracts list of urls with pattern ul/li/a. Returns array of urls with "_text" and "href" fields
* simpleopt - Extracts list of options values with pattern select/option. Returns array: "_text", "value"
* exturls - Extracts list of urls that leads to external websites. Returns array of urls with "_text" and "href" fields
* getforms - Extracts all forms from website. Returns complex JSON data with each form on the page


Command-line tool
=================
Usage: lazyscraper.py [OPTIONS] COMMAND [ARGS]...

Options:
--help Show this message and exit.

Commands:
extract Extract data with xpath
gettable Extracts table with data from html
use Uses predefined pattern to extract page data

Examples
========

Extracts list of photos and names of Russian government ministers and outputs it to "gov_persons.csv"
>>> python lscraper.py extract --url http://government.ru/en/gov/persons/ --xpath "//img[@class='photo']" --fieldnames src,srcset,alt --absolutize True --output gov_persons.csv --format csv

Extracts list of ministries from Russian government website using pattern "simpleul" and from UL tag with class "departments col col__wide" and outputs absolutized urls.
>>> python lscraper.py use --pattern simpleul --nodeclass "departments col col__wide" --url http://government.ru/en/ministries --absolutize True


Extracts list of territorial organizations urls from Russian tax service website using pattern "simpleopt".
>>> python lscraper.py use --pattern simpleopt --url http://nalog.ru

Extracts all forms from Russian tax service website using pattern "getforms". Returns JSON with each form and each button, input and select
>>> python lscraper.py use --pattern getforms --url http://nalog.ru

Extracts list of websites urls of Russian Federal Treasury and uses awk to extract domains.
>>> python lscraper.py extract --url http://roskazna.ru --xpath "//ul[@class='site-list']/li/a" --fieldnames href | awk -F/ '{print $3}'

How to use library
==================

Extracts all urls with fields: src, alt, href and _text from gov.uk website
>>> from lazyscraper import extract_data_xpath
>>> extract_data_xpath('http://gov.uk', xpath='//a', fieldnames='src,alt,href,_text', absolutize=True)
[{'_text': 'Skip to main content', 'src': '', 'alt': '', 'href': 'http://gov.uk#content'}, {'_text': 'Find out more about cookies', 'src': '', 'alt': '', 'href': 'https://www.gov.uk/help/cookies'}, {'_text': 'GOV.UK', 'src': '', 'alt': '', 'href': 'https://www.gov.uk'}, {'_text': 'Universal Jobmatch job search', 'src': '', 'alt': '', 'href': 'http://gov.uk/jobsearch'}, {'_text': 'Renew vehicle tax', 'src': '', 'alt': '', 'href': 'http://gov.uk/vehicle-tax'}, {'_text': 'Log in to student finance', 'src': '', 'alt': '', 'href': 'http://gov.uk/student-finance-register-login'}, {'_text': 'Book your theory test', 'src': '', 'alt': '', 'href': 'http://gov.uk/book-theory-test'}, {'_text': 'Personal tax account', 'src': '', 'alt': '', 'href': 'http://gov.uk/personal-tax-account'}, {'_text': 'Benefits', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/benefits'}, {'_text': 'Births, deaths, marriages and care', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/births-deaths-marriages'}, {'_text': 'Business and self-employed', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/business'}, {'_text': 'Childcare and parenting', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/childcare-parenting'}, {'_text': 'Citizenship and living in the UK', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/citizenship'}, {'_text': 'Crime, justice and the law', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/justice'}, {'_text': 'Disabled people', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/disabilities'}, {'_text': 'Driving and transport', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/driving'}, {'_text': 'Education and learning', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/education'}, {'_text': 'Employing people', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/employing-people'}, {'_text': 'Environment and countryside', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/environment-countryside'}, {'_text': 'Housing and local services', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/housing-local-services'}, {'_text': 'Money and tax', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/tax'}, {'_text': 'Passports, travel and living abroad', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/abroad'}, {'_text': 'Visas and immigration', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/visas-immigration'}, {'_text': 'Working, jobs and pensions', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/working'}, {'_text': '25 Ministerial departments', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/organisations'}, {'_text': '385 Other agencies and public bodies', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/organisations'}, {'_text': 'government departments', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/organisations'}, {'_text': 'policies', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/policies'}, {'_text': 'announcements', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/announcements'}, {'_text': 'publications', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/publications'}, {'_text': 'statistics', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/statistics'}, {'_text': 'consultations', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/publications?publication_filter_option=consultations'}, {'_text': 'how government services are performing', 'src': '', 'alt': '', 'href': 'http://gov.uk/performance'}, {'_text': 'Get MOT reminders', 'src': '', 'alt': '', 'href': 'http://gov.uk/mot-reminder'}, {'_text': 'Grenfell Tower fire', 'src': '', 'alt': '', 'href': 'http://gov.uk/guidance/grenfell-tower-fire-june-2017-support-for-people-affected'}, {'_text': 'The UK and the EU', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/policies/brexit'}, {'_text': 'Universal Jobmatch job search', 'src': '', 'alt': '', 'href': 'http://gov.uk/jobsearch'}, {'_text': 'Log in to student finance', 'src': '', 'alt': '', 'href': 'http://gov.uk/student-finance-register-login'}, {'_text': 'Passport fees', 'src': '', 'alt': '', 'href': 'http://gov.uk/passport-fees'}, {'_text': "Jobseeker's Allowance", 'src': '', 'alt': '', 'href': 'http://gov.uk/jobseekers-allowance'}, {'_text': 'Council Tax bands', 'src': '', 'alt': '', 'href': 'http://gov.uk/council-tax-bands'}, {'_text': 'Running a limited company', 'src': '', 'alt': '', 'href': 'http://gov.uk/running-a-limited-company'}, {'_text': 'Driving theory test', 'src': '', 'alt': '', 'href': 'http://gov.uk/book-a-driving-theory-test'}, {'_text': 'Vehicle tax rates', 'src': '', 'alt': '', 'href': 'http://gov.uk/calculate-vehicle-tax-rates'}, {'_text': 'Renew vehicle tax', 'src': '', 'alt': '', 'href': 'http://gov.uk/vehicle-tax'}, {'_text': 'VAT rates', 'src': '', 'alt': '', 'href': 'http://gov.uk/vat-rates'}, {'_text': 'UK bank holidays', 'src': '', 'alt': '', 'href': 'http://gov.uk/bank-holidays'}, {'_text': 'bank holidays', 'src': '', 'alt': '', 'href': 'http://gov.uk/bank-holidays'}, {'_text': 'Benefits', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/benefits'}, {'_text': 'Births, deaths, marriages and care', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/births-deaths-marriages'}, {'_text': 'Business and self-employed', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/business'}, {'_text': 'Childcare and parenting', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/childcare-parenting'}, {'_text': 'Citizenship and living in the UK', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/citizenship'}, {'_text': 'Crime, justice and the law', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/justice'}, {'_text': 'Disabled people', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/disabilities'}, {'_text': 'Driving and transport', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/driving'}, {'_text': 'Education and learning', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/education'}, {'_text': 'Employing people', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/employing-people'}, {'_text': 'Environment and countryside', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/environment-countryside'}, {'_text': 'Housing and local services', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/housing-local-services'}, {'_text': 'Money and tax', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/tax'}, {'_text': 'Passports, travel and living abroad', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/abroad'}, {'_text': 'Visas and immigration', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/visas-immigration'}, {'_text': 'Working, jobs and pensions', 'src': '', 'alt': '', 'href': 'http://gov.uk/browse/working'}, {'_text': 'How government works', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/how-government-works'}, {'_text': 'Departments', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/organisations'}, {'_text': 'Worldwide', 'src': '', 'alt': '', 'href': 'http://gov.uk/world'}, {'_text': 'Policies', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/policies'}, {'_text': 'Publications', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/publications'}, {'_text': 'Announcements', 'src': '', 'alt': '', 'href': 'http://gov.uk/government/announcements'}, {'_text': 'Help', 'src': '', 'alt': '', 'href': 'http://gov.uk/help'}, {'_text': 'Cookies', 'src': '', 'alt': '', 'href': 'http://gov.uk/help/cookies'}, {'_text': 'Contact', 'src': '', 'alt': '', 'href': 'http://gov.uk/contact'}, {'_text': 'Terms and conditions', 'src': '', 'alt': '', 'href': 'http://gov.uk/help/terms-conditions'}, {'_text': 'Rhestr o Wasanaethau Cymraeg', 'src': '', 'alt': '', 'href': 'http://gov.uk/cymraeg'}, {'_text': 'Government Digital Service', 'src': '', 'alt': '', 'href': 'https://www.gov.uk/government/organisations/government-digital-service'}, {'_text': 'Open Government Licence', 'src': '', 'alt': '', 'href': 'https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/'}, {'_text': 'Open Government Licence v3.0', 'src': '', 'alt': '', 'href': 'https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/'}, {'_text': '© Crown copyright', 'src': '', 'alt': '', 'href': 'https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/copyright-and-re-use/crown-copyright/'}]


Run pattern 'simpleopt' against Russian federal treasury website
>>> from lazyscraper import use_pattern
>>> use_pattern('http://roskazna.ru', 'simpleopt')
[{'_text': 'roskazna.ru', 'value': 'http://www.roskazna.ru'}, {'_text': 'Алтайский край', 'value': 'http://altay.roskazna.ru'}, {'_text': 'Амурская область', 'value': 'http://amur.roskazna.ru'}, {'_text': 'Архангельская область и Ненецкий автономный округ', 'value': 'http://arhangelsk.roskazna.ru'}, {'_text': 'Астраханская область', 'value': 'http://astrahan.roskazna.ru'}, {'_text': 'Белгородская область', 'value': 'http://belgorod.roskazna.ru'}, {'_text': 'Брянская область', 'value': 'http://bryansk.roskazna.ru'}, {'_text': 'Владимирская область', 'value': 'http://vladimir.roskazna.ru'}, {'_text': 'Волгоградская область', 'value': 'http://volgograd.roskazna.ru'}, {'_text': 'Вологодская область', 'value': 'http://vologodskaya.roskazna.ru'}, {'_text': 'Воронежская область', 'value': 'http://31vrn.roskazna.ru'}, {'_text': 'Еврейская автономная область', 'value': 'http://birobidzhan.roskazna.ru'}, {'_text': 'Забайкальский край', 'value': 'http://chita.roskazna.ru'}, {'_text': 'Ивановская область', 'value': 'http://ivanovskaya.roskazna.ru'}, {'_text': 'Иркутская область', 'value': 'http://irkutsk.roskazna.ru'}, {'_text': 'Кабардино-Балкарская Республика', 'value': 'http://kabardino-balkaria.roskazna.ru'}, {'_text': 'Севастополь', 'value': 'http://sevastopol.roskazna.ru'}, {'_text': 'Калининградская область', 'value': 'http://kaliningrad.roskazna.ru'}, {'_text': 'Калужская область', 'value': 'http://kaluga.roskazna.ru'}, {'_text': 'Камчатский край', 'value': 'http://kamchatka.roskazna.ru'}, {'_text': 'Карачаево-Черкесская Республика', 'value': 'http://karachaevocherkessia.roskazna.ru'}, {'_text': 'Кемеровская область', 'value': 'http://kemerovskaya.roskazna.ru'}, {'_text': 'Кировская область', 'value': 'http://kirov.roskazna.ru'}, {'_text': 'Костромская область', 'value': 'http://kostroma.roskazna.ru'}, {'_text': 'Краснодарский край', 'value': 'http://krasnodar.roskazna.ru'}, {'_text': 'Красноярский край', 'value': 'http://krasnoyarsk.roskazna.ru'}, {'_text': 'Курганская область', 'value': 'http://kurgan.roskazna.ru'}, {'_text': 'Курская область', 'value': 'http://kursk.roskazna.ru'}, {'_text': 'Ленинградская область', 'value': 'http://leningrad.roskazna.ru'}, {'_text': 'Липецкая область', 'value': 'http://lipetsk.roskazna.ru'}, {'_text': 'Ямало-Ненецкий автономный округ', 'value': 'http://yamalo-nenetskiy.roskazna.ru'}, {'_text': 'Магаданская область', 'value': 'http://magadan.roskazna.ru'}, {'_text': 'Рязанская область', 'value': 'http://ryazan.roskazna.ru'}, {'_text': 'Санкт-Петербург', 'value': 'http://piter.roskazna.ru'}, {'_text': 'Самарская область', 'value': 'http://samara.roskazna.ru'}, {'_text': 'Московская область', 'value': 'http://mo.roskazna.ru'}, {'_text': 'Мурманская область', 'value': 'http://murmansk.roskazna.ru'}, {'_text': 'Нижегородская область', 'value': 'http://nizhegorodskaya.roskazna.ru'}, {'_text': 'Новгородская область', 'value': 'http://novgorod.roskazna.ru'}, {'_text': 'Новосибирская область', 'value': 'http://novosibirsk.roskazna.ru'}, {'_text': 'Омская область', 'value': 'http://omsk.roskazna.ru'}, {'_text': 'Оренбургская область', 'value': 'http://orenburg.roskazna.ru'}, {'_text': 'Орловская область', 'value': 'http://orel.roskazna.ru'}, {'_text': 'Пензенская область', 'value': 'http://penza.roskazna.ru'}, {'_text': 'Пермский край', 'value': 'http://perm.roskazna.ru'}, {'_text': 'Приморский край', 'value': 'http://vladivostok.roskazna.ru'}, {'_text': 'Псковская область', 'value': 'http://pskov.roskazna.ru'}, {'_text': 'Республика Адыгея', 'value': 'http://adygeya.roskazna.ru'}, {'_text': 'Республика Алтай', 'value': 'http://r-altay.roskazna.ru'}, {'_text': 'Республика Башкортостан', 'value': 'http://ufa.roskazna.ru'}, {'_text': 'Республика Бурятия', 'value': 'http://buryatia.roskazna.ru'}, {'_text': 'Республика Дагестан', 'value': 'http://dagestan.roskazna.ru'}, {'_text': 'Республика Ингушетия', 'value': 'http://ingushetia.roskazna.ru'}, {'_text': 'Республика Калмыкия', 'value': 'http://kalmykia.roskazna.ru'}, {'_text': 'Республика Карелия', 'value': 'http://karelia.roskazna.ru'}, {'_text': 'Республика Коми', 'value': 'http://komi.roskazna.ru'}, {'_text': 'Республика Крым', 'value': 'http://krym.roskazna.ru'}, {'_text': 'Республика Марий Эл', 'value': 'http://mariy-el.roskazna.ru'}, {'_text': 'Республика Мордовия', 'value': 'http://mordovia.roskazna.ru'}, {'_text': 'Республика Саха (Якутия)', 'value': 'http://sakha.roskazna.ru'}, {'_text': 'Республика Северная Осетия-Алания', 'value': 'http://alania.roskazna.ru'}, {'_text': 'Республика Татарстан', 'value': 'http://tatarstan.roskazna.ru'}, {'_text': 'Республика Тыва', 'value': 'http://tyva.roskazna.ru'}, {'_text': 'Республика Удмуртия', 'value': 'http://udmurtia.roskazna.ru'}, {'_text': 'Республика Хакасия', 'value': 'http://hakasia.roskazna.ru'}, {'_text': 'Ростовская область', 'value': 'http://rostov.roskazna.ru'}, {'_text': 'Саратовская область', 'value': 'http://saratov.roskazna.ru'}, {'_text': 'Сахалинская область', 'value': 'http://sahalin.roskazna.ru'}, {'_text': 'Свердловская область', 'value': 'http://sverdlovsk.roskazna.ru'}, {'_text': 'Смоленская область', 'value': 'http://smolensk.roskazna.ru'}, {'_text': 'Тамбовская область', 'value': 'http://tambov.roskazna.ru'}, {'_text': 'Тверская область', 'value': 'http://tver.roskazna.ru'}, {'_text': 'Томская область', 'value': 'http://tomsk.roskazna.ru'}, {'_text': 'Тульская область', 'value': 'http://tula.roskazna.ru'}, {'_text': 'Тюменская область', 'value': 'http://tumen.roskazna.ru'}, {'_text': 'Ульяновская область', 'value': 'http://ulyanovsk.roskazna.ru'}, {'_text': 'Хабаровский край', 'value': 'http://khabarovsk.roskazna.ru'}, {'_text': 'Ханты-Мансийский автономный округ - Югра', 'value': 'http://hantymansiysk.roskazna.ru'}, {'_text': 'Челябинская область', 'value': 'http://chelyabinsk.roskazna.ru'}, {'_text': 'Чеченская Республика', 'value': 'http://chechnya.roskazna.ru'}, {'_text': 'Чувашская Республика', 'value': 'http://chuvashia.roskazna.ru'}, {'_text': 'Чукотский автономный округ', 'value': 'http://chukotka.roskazna.ru'}, {'_text': 'Ярославская область', 'value': 'http://yaroslavl.roskazna.ru'}, {'_text': 'Москва', 'value': 'http://moscow.roskazna.ru'}, {'_text': 'Ставропольский край', 'value': 'http://stavropol.roskazna.ru'}, {'_text': 'Центр по обеспечению деятельности Казначейства России', 'value': 'http://cokr.roskazna.ru'}, {'_text': 'Межрегиональное операционное УФК', 'value': 'http://moufk.roskazna.ru'}]

Requirements
============
* Python3 https://www.python.org
* click https://github.com/pallets/click
* lxml http://lxml.de/


.. :changelog:

History
=======


0.1.0 (2018-01-14)
------------------
* First public release on PyPI and updated github code


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazyscraper-0.1.0.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distributions

lazyscraper-0.1.0-py3.6.egg (20.6 kB view hashes)

Uploaded Source

lazyscraper-0.1.0-py3-none-any.whl (15.0 kB view hashes)

Uploaded Python 3

lazyscraper-0.1.0-py2.7.egg (19.0 kB view hashes)

Uploaded Source

lazyscraper-0.1.0-py2-none-any.whl (16.9 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page