transmogrify.webcrawler 1.1
Crawling and feeding html content into a transmogrifier pipeline
Crawling - html to import
A source blueprint for crawling content from a site or local html files.
Webcrawler imports HTML either from a live website, for a folder on disk, or a folder on disk with html which used to come from a live website and may still have absolute links refering to that website.
To crawl a live website supply the crawler with a base http url to start crawling with. This url must be the url which all the other urls you want from the site start with.
For example
[crawler] blueprint = transmogrify.webcrawler url = http://www.whitehouse.gov max = 50
will restrict the crawler to the first 50 pages.
You can also crawl a local directory of html with relative links by just using a file: style url
[crawler] blueprint = transmogrify.webcrawler url = file:///mydirectory
or if the local directory contains html saved from a website and might have absolute urls in it the you can set this as the cache. The crawler will always look up the cache first
[crawler] blueprint = transmogrify.webcrawler url = http://therealsite.com --crawler:cache=mydirectory
The following will not crawl anything larget than 4Mb
[crawler] blueprint = transmogrify.webcrawler url = http://www.whitehouse.gov maxsize=400000
To skip crawling links by regular expression
[crawler]
blueprint = transmogrify.webcrawler
url=http://www.whitehouse.gov
ignore = \.mp3
\.mp4
If webcrawler is having trouble parsing the html of some pages you can preprocesses the html before it is parsed. e.g.
[crawler] blueprint = transmogrify.webcrawler patterns = (<script>)[^<]*(</script>) subs = \1\2
If you'd like to skip processing links with certain mimetypes you can use the drop:condition. This TALES expression determines what will be processed further. see http://pypi.python.org/pypi/collective.transmogrifier/#condition-section
[drop]
blueprint = collective.transmogrifier.sections.condition
condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']
Options
- site_url
- the top url to crawl
- ignore
- list of regex for urls to not crawl
- cache
- local directory to read crawled items from instead of accessing the site directly
- patterns
- Regular expressions to substitute before html is parsed. New line seperated
- subs
- Text to replace each item in patterns. Must be the same number of lines as patterns. Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use <EMPTYSTRING> as a substitution.
- maxsize
- don't crawl anything larger than this
- max
- Limit crawling to this number of pages
- start-urls
- a list of urls to initially crawl
- ignore-robots
- if set, will ignore the robots.txt directives and crawl everything
WebCrawler will emit items like
item = dict(_site_url = "Original site_url used",
_path = "The url crawled without _site_url,
_content = "The raw content returned by the url",
_content_info = "Headers returned with content"
_backlinks = names,
_sortorder = "An integer representing the order the url was found within the page/site
)
transmogrify.webcrawler.typerecognitor
A blueprint for assinging content type based on the mime-type as given by the webcrawler
transmogrify.webcrawler.cache
A blueprint that saves crawled content into a directory structure
transmogrify.webcrawler
A transmogrifier blueprint source which will crawl a url reading in all pages until all have been crawled.
Options
- site_url
- URL to start crawling. The URL will be treated as the base and any links outside this base will be ignored
- ignore
- Regular expressions for urls not to follow
- patterns
- Regular expressions to substitute before html is parsed. New line seperated
- subs
- Text to replace
- checkext
- checkext
- verbose
- verbose
- maxsize
- don't crawl anything larger than this
- nonames
- nonames
- cache
- cache
Keys inserted
The following set the keys items added to the pipeline
- pathkey
- default: _path. The path of the url not including the base
- siteurlkey
- default: _site_url. The base of the url
- originkey
- default: _origin. The original path in case retriving the url caused a redirection
- contentkey
- default: _content. The main content of the url
- contentinfokey
- default: _content_info. Headers returned by urlopen
- sortorderkey
- default: _sortoder. A count on when a link to this item was first encounted while crawling
- backlinkskey
- default: _backlinks. A list of tuples of which pages linked to this item. (url, path)
Tests
>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... """)
{'_backlinks': [],
'_content_info': {'content-type': 'text/html'},
'_mimetype': 'text/html',
'_origin': 'file://.../test_staticsite',
'_path': '',
'_site_url': 'file://.../test_staticsite/',
'_sortorder': 0,
'_type': 'Document'}
...
>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... """)
{...
'_path': '',
...}
{...
'_path': 'cia-plone-view-source.jpg',
...}
{...
'_path': 'subfolder',
...}
{...
'_path': 'subfolder2',
...}
{...
'_path': 'file3.html',
...}
{...
'_path': 'subfolder/subfile1.htm',
...}
{...
'_path': 'file.doc',
...}
{...
'_path': 'file2.htm',
...}
{...
'_path': 'file4.HTML',
...}
{...
'_path': 'egenius-plone.gif',
...}
{...
'_path': 'plone_schema.png',
...}
{...
'_path': 'file1.htm',
...}
{...
'_path': 'subfolder2/subfile1.htm',
...}
...
>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... patterns =
... (?s)<SCRIPT.*Abbreviation"\)
... (?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\)
... (?s)State=.*<body[^>]*>
... subs =
... </head><body>
... <a href="\g<u>">\g<a></a>
... <br>
... """)
External scripts used
http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
thon.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
TypeRecognitor
TypeRecognitor is a transmogrifier blue print which determines the plone type of the item from the mime_type in the header. It reads the mimetype from the headers in _content_info set by transmogrify.webrawler
>>> from os.path import dirname
>>> from os.path import abspath
>>> config = """
...
... [transmogrifier]
... pipeline =
... webcrawler
... typerecognitor
... clean
... printer
...
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url = file://%s/test_staticsite
...
... [typerecognitor]
... blueprint = transmogrify.webcrawler.typerecognitor
...
... [clean]
... blueprint = collective.transmogrifier.sections.manipulator
... delete =
... file
... text
... image
...
... [printer]
... blueprint = collective.transmogrifier.sections.tests.pprinter
...
... """ % abspath(dirname(__file__)).replace('\\','/')
>>> from collective.transmogrifier.tests import registerConfig >>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
>>> from collective.transmogrifier.transmogrifier import Transmogrifier
>>> transmogrifier = Transmogrifier(plone)
>>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test')
{...
'_mimetype': 'image/jpeg',
...
'_path': 'cia-plone-view-source.jpg',
...
'_type': 'Image',
...}
...
- {'_mimetype': 'image/gif',
- '_path': '/egenius-plone.gif', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Image'}
- {'_mimetype': 'application/msword',
- '_path': '/file.doc', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': 'doc_to_html', '_type': 'Document'}
- {'_mimetype': 'text/html',
- '_path': '/file1.htm', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
- {'_mimetype': 'text/html',
- '_path': '/file2.htm', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
- {'_mimetype': 'text/html',
- '_path': '/file3.html', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
- {'_mimetype': 'text/html',
- '_path': '/file4.HTML', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
- {'_mimetype': 'image/png',
- '_path': '/plone_schema.png', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Image'}
- {'_mimetype': 'text/html',
- '_path': '/subfolder', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
- {'_mimetype': 'text/html',
- '_path': '/subfolder/subfile1.htm', '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite', '_transform': None, '_type': 'Document'}
Changelog
1.1 (2012-04-17)
- add start-urls option [djay]
- add ignore_robots option [djay]
- fixed bug in http-equiv refresh handling [djay]
- fixes to disk caching [djay]
- better logging [djay]
- default maxsize is unlimited [djay]
- Provide ability for the reformat function to substitute patterns with empty strings (nothing). Buildout does not support empty lines within configuration, so if a substitution is <EMPTYSTRING> this becomes an empty string. [davidjb]
- Provide a logger in the LXMLPage class so the reformat function can succeed [davidjb]
- Reformat spacing in webcrawler reformat function [davidjb]
1.0 (2011-06-29)
- many fixes for importing from local directory w/ many languages [simahawk]
- fix UnicodeEncodeError when file name/language is not english [simahawk]
- fix iterating over non-sequence [simahawk]
- fix missing import for MyStringIO [simahawk]
1.0b7 (2011-02-17)
- fix bug in cache check
1.0b6 (2011-02-12)
- only open cache files when needed so don't run out of handles
- follow http-equiv refresh links
1.0b5 (2011-02-06)
- files use file pointers to reduce memory usage
- cache saves .metadata files to record and playback headersx
1.0b4 (2010-12-13)
- improve logging
- fix encoding bug caused by cache
1.0b3 (2010-11-10)
- Fixed bug in cache that caused many links to be ignored in some cases
- Fix documentation up
1.0b2 (2010-11-09)
- Stopped localhost output when no output set
1.0b1 (2010-11-08)
change site_url to just url.
rename maxpage to maxsize
fix file: style urls
Added cache option to replace base_alias
fix _origin key set by webcrawler, instead of url now it is path as expected by further blue [Vitaliy Podoba]
add _orig_path to pipeline item to keep original path for any further purposes, we will need [Vitaliy Podoba]
- make all url absolute taking into account base tags inside webcrawler blueprint
[Vitaliy Podoba]
0.1 (2008-09-25)
- renamed package from pretaweb.blueprints to transmogrify.webcrawler.
[djay]
enhanced import view (djay)
| File | Type | Py Version | Uploaded on | Size | # downloads |
|---|---|---|---|---|---|
| transmogrify.webcrawler-1.1.zip (md5) | Source | 2012-04-17 | 525KB | 147 | |
- Author: Dylan Jay
- Home Page: http://github.com/djay/transmogrify.webcrawler
- Keywords: transmogrifier blueprint funnelweb source plone import conversion microsoft office
- License: GPL
- Categories
- Package Index Owner: djay
- DOAP record: transmogrify.webcrawler-1.1.xml
