webstemmer 0.6.1
A web crawler and HTML layout analyzer
Latest Version: 0.7.1
Webstemmer is a web crawler and HTML layout analyzer. It extracts articles from news sites as plain text and removes banners, ads and/or navigation links automatically. You only need to give a URL of the top page of a site and it works in an almost fully automatic way with little human intervention.
- Author: Yusuke Shinyama <yusuke at cs nyu edu>
- Maintainer: Yusuke Shinyama <yusuke at cs nyu edu>
- Home Page: http://www.unixuser.org/~euske/python/webstemmer/
- Download URL: http://www.unixuser.org/~euske/python/webstemmer/webstemmer-0.6.1.tar.gz
- Keywords: web crawler, html parser
- License: MIT/X
- Platform: POSIX, Win32
-
Categories
- Development Status :: 3 - Alpha
- Environment :: Console
- Intended Audience :: Developers
- Intended Audience :: Science/Research
- License :: OSI Approved :: MIT License
- Natural Language :: English
- Natural Language :: Japanese
- Operating System :: POSIX
- Programming Language :: Python
- Topic :: Internet :: WWW/HTTP :: Indexing/Search
- Topic :: Scientific/Engineering :: Information Analysis
- Topic :: Text Processing :: Filters
- Topic :: Text Processing :: Markup :: HTML
- Package Index Owner: euske
- DOAP record: webstemmer-0.6.1.xml
