Epubzilla

a library for extracting data from EPUB files

Project description

Epubzilla Documentation

Epubzilla is a Python library for extracting data from EPUB documents. [HomePage, Docs]

Currently, the only version supported is EPUB 2.0.1. There are grand plans to support EPUB 3.0 in the near future.

Getting Help

If you have questions about Epubzilla, send an email to odeegan @ gmail . com

Requirements

Python 2.6+

lxml version 3.0.1 or later is required

QuickStart

>>> from epubzilla.epubzilla import Epub
>>> epub = Epub.from_file('Manly-DeathValley-images.epub')
>>> epub.author
'Manly, William Lewis'
>>> epub.title
"Death Valley in '49"

Here are a few examples of how to navigate the data:

epub.metadata[3].tag.localname
# title

epub.metadata[3].tag.namespace
# http://purl.org/dc/elements/1.1/

epub.metadata[3].tag.text
# Death Valley in '49

epub.metadata[3].as_xhtml
# <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf">Death Valley in '49</dc:title>

epub.manifest[2].tag.attributes
# {u'href': 'www.gutenberg.org@files@12236@12236-h@ch_03.png', u'id': 'item3', u'media-type': 'image/png'}

If an element contains other elements, they can be accessed via the list property:

epub.metadata.list
# [class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>, class <Epub.Element>]

They can also be directly iterated over:

for element in epub.metadata:
  print "%s : %s" %(element.tag.localname, element.tag.text)
  for k,v in element.tag.iteritems():
    print "\t %s : %s" %(k,v)

# rights : Public domain in the USA.
# identifier : http://www.gutenberg.org/ebooks/12236
#  scheme : URI
#  id : id
# creator : William Lewis Manly
#  file-as : Manly, William Lewis
# title : Death Valley in '49
# language : en
#  type : dcterms:RFC4646
# date : 2004-05-01
#  event : publication
# date : 2010-02-15T17:50:02.335756+00:00
#  event : conversion
# source : http://www.gutenberg.org/files/12236/12236-h/12236-h.htm
# meta :
#  content : item26
#  name : cover

If a manifest element references a file, it can be access via the element’s get_file() method. A string buffer will be returned.:

type(epub.manifest[2].get_file())
# <type 'str'>

Project details

Release history Release notifications | RSS feed

0.1.1

Feb 12, 2013

This version

0.1.0

Feb 9, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Epubzilla-0.1.0.tar.gz (5.8 kB view hashes)

Uploaded Feb 9, 2013 Source

Hashes for Epubzilla-0.1.0.tar.gz

Hashes for Epubzilla-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e2693d39fda5944e56a8a9f69cb34b579df193d843029809a062ef83f532b8ef`
MD5	`f0322e040300288933e93e94ad4c219c`
BLAKE2b-256	`4d8228d8b65482126f141d27d506c15c0ebb538040ca335bbde431d6a381fc92`