gugle.bot 1.0dev-r629
Small and Dumb Spider
Introduction
gugle.bot is a highly experimental web spider. It collects everything, gugle.bot does not distinguish trash from goods.
The intended use of gugle.bot is just to make experiments requiring the collection of links in web pages.
Instalation
Just type:
easy_install gugle.bot
That would be enough. If you find a non-declared dependency (and because of that gugle.bot just does not work), report it to me.
Usage of this package
This package provides a console script guglebot. This is a handy shortcut to start the spider.
Get help by typing:
guglebot --help
What data is collected?
Currently, we keep a list of URLs and a list (referer, target) pairs, nothing else.
How to inspect the collected data
Although we provide several scripts, you may need to craft your own in order to get all the information you need.
The following is a very compact list of the current scripts:
- gbdomains
- Shows every domain collected
- gbinspect
- Simply prints a summary of all collected data
- gblist
- Prints every collected URL
- gbgraph
- Prints every pair of (referrer, target).
Changelog
1.0 - Unreleased
- Initial release
| File | Type | Py Version | Uploaded on | Size | # downloads |
|---|---|---|---|---|---|
| gugle.bot-1.0dev-r629.tar.gz (md5, pgp) | Source | 2008-05-12 | 9KB | 733 | |
- Author: Manuel Vazquez Acosta
- Home Page: http://manuelonsoftware.wordpress.com/
- License: GPL
-
Categories
- Development Status :: 2 - Pre-Alpha
- Environment :: Console
- Intended Audience :: Science/Research
- License :: OSI Approved :: GNU General Public License (GPL)
- Operating System :: POSIX :: Linux
- Programming Language :: Python
- Topic :: Internet :: WWW/HTTP :: Indexing/Search
- Topic :: Software Development :: Libraries :: Python Modules
- Topic :: Utilities
- Package Index Owner: mvaled
- DOAP record: gugle.bot-1.0dev-r629.xml
