Skip to main content

Encoding detection collection for Python.

Project description

**For the current version
please /download/install cssutils
which includes v0.9.8 of encutils as well.
Files here have been deleted and
it is only kept here for reference.**



===================================================
encutils - encoding detection collection for Python
===================================================
:Version: 0.9
:Author: Christof Hoeke, see http://cthedot.de/encutils/
:Contributor: Robert Siemer
:Copyright: 2005-2009: Christof Hoeke
:License: encutils has a dual-license, please choose whatever you prefer:

* encutils is published under the
`LGPL 3 or later <http://cthedot.de/encutils/license/>`__
* encutils is published under the
`Creative Commons License <http://creativecommons.org/licenses/by/3.0/>`__.

encutils is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

encutils is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License
along with encutils. If not, see <http://www.gnu.org/licenses/>.


A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.

:func:`getEncodingInfo` is probably the main function of interest which uses
other supplied functions itself and gathers all information together and
supplies an :class:`EncodingInfo` object.

example::

>>> import encutils
>>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/')

>>> print info # = str(info)
utf-8

>>> print repr(info) # doctest:+ELLIPSIS
<encutils.EncodingInfo object encoding='utf-8' mismatch=False at...>

>>> print info.logtext
HTTP media_type: text/html
HTTP encoding: utf-8
Encoding (probably): utf-8 (Mismatch: False)
<BLANKLINE>

references
XML
RFC 3023 (http://www.ietf.org/rfc/rfc3023.txt)

easier explained in
- http://feedparser.org/docs/advanced.html
- http://www.xml.com/pub/a/2004/07/21/dive.html

HTML
http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2

TODO
- parse @charset of HTML elements?
- check for more texttypes if only text given

Project details


Release history Release notifications | RSS feed

This version

0.9

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page