skip to navigation
skip to content

Not Logged In

PyDocX 0.3.1

docx (OOXML) to html converter

Latest Version: 0.3.18

.. image::
   :align: left

pydocx is a parser that breaks down the elements of a docxfile and converts them
into different markup languages. Right now, HTML is supported. Markdown and LaTex
will be available soon. You can extend any of the available parsers to customize it
to your needs. You can also create your own class that inherits DocxParser
to create your own methods for a markup language not yet supported.

Currently Supported

* tables
    * nested tables
    * rowspans
    * colspans
    * lists in tables
* lists
    * list styles
    * nested lists
    * list of tables
    * list of pragraphs
* justification
* images
* styles
    * bold
    * italics
    * underline
    * hyperlinks
* headings


DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:


    class DocxParser:

        def parsed(self):
            return self._parsed

        def escape(self, text):
            return text

        def linebreak(self):
            return ''

        def paragraph(self, text):
            return text

        def heading(self, text, heading_level):
            return text

        def insertion(self, text, author, date):
            return text

        def hyperlink(self, text, href):
            return text

        def image_handler(self, path):
            return path

        def image(self, path, x, y):
            return self.image_handler(path)

        def deletion(self, text, author, date):
            return text

        def bold(self, text):
            return text

        def italics(self, text):
            return text

        def underline(self, text):
            return text

        def superscript(self, text):
            return text

        def subscript(self, text):
            return text

        def tab(self):
            return True

        def ordered_list(self, text):
            return text

        def unordered_list(self, text):
            return text

        def list_element(self, text):
            return text

        def table(self, text):
            return text
        def table_row(self, text):
            return text

        def table_cell(self, text):
            return text

        def page_break(self):
            return True

        def indent(self, text, left='', right='', firstLine=''):
            return text

Docx2Html inherits DocxParser and implements basic HTML handling. Ex.


    class Docx2Html(DocxParser):

        #  Escape '&', '<', and '>' so we render the HTML correctly
        def escape(self, text):
            return xml.sax.saxutils.quoteattr(text)[1:-1]

        # return a line break
        def linebreak(self, pre=None):
            return '<br />'

        # add paragraph tags
        def paragraph(self, text, pre=None):
            return '<p>' + text + '</p>'

However, let's say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type `my_implementation`. Simply extend docx2Html and add what you need.


     class My_Implementation_of_Docx2Html(Docx2Html):

        def paragraph(self, text, pre = None):
            return <p class="my_implementation"> + text + '</p>'

OR, let's say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser


    class Docx2Foo(DocxParser):

        # because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge  :)
        def linebreak(self):
            return '!!!!!!!!!!!!'

Custom Pre-Processor

When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the `pre_processor` field on the custom parser, like so:


    class Docx2Foo(DocxParser):
        pre_processor_class = FooPrePorcessor

The `FooPrePorcessor` will need a few things to get you going:


    class FooPrePorcessor(PydocxPrePorcessor):
        def perform_pre_processing(self, root, *args, **kwargs):
            super(FooPrePorcessor, self).perform_pre_processing(root, *args, **kwargs)

        def _set_foo(self, root):

If you want `_set_foo` to be called you must add it to `perform_pre_processing` which is called in the base parser for pydocx.

Everything done during pre-processing is executed prior to `parse` being called for the first time.


The base parser `Docx2Html` relies on certain css class being set for certain behaviour to occur. Currently these include:

* class `pydocx-insert` -> Turns the text green.
* class `pydocx-delete` -> Turns the text red and draws a line through the text.
* class `pydocx-center` -> Aligns the text to the center.
* class `pydocx-right` -> Aligns the text to the right.
* class `pydocx-left` -> Aligns the text to the left.
* class `pydocx-comment` -> Turns the text blue.
* class `pydocx-underline` -> Underlines the text.
* class `pydocx-caps` -> Makes all text uppercase.
* class `pydocx-small-caps` -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
* class `pydocx-strike` -> Strike a line through.
* class `pydocx-hidden` -> Hide the text.

Optional Arguments

You can pass in `convert_root_level_upper_roman=True` to the parser and it will convert all root level upper roman lists to headings instead.


* 0.3.1
    * Added support for several more OOXML tags including:
        * caps
        * smallCaps
        * strike
        * dstrike
        * vanish
        * webHidden
      More details in the README.
* 0.3.0
    * We switched from using stock *xml.etree.ElementTree* to using
      *xml.etree.cElementTree*. This has resulted in a fairly significant speed
      increase for python 2.6
    * It is now possible to create your own pre processor to do additional pre
    * Superscripts and subscripts are now extracted correctly.
* 0.2.1
    * Added a changelog
    * Added the version in pydocx.__init__
    * Fixed an issue with duplicating content if there was indentation or
      justification on a p element that had multiple t tags.
File Type Py Version Uploaded on Size
PyDocX-0.3.1.tar.gz (md5) Source 2013-06-13 348KB
  • Downloads (All Versions):
  • 43 downloads in the last day
  • 936 downloads in the last week
  • 4036 downloads in the last month